Abstract
Structural variations (SV), genomic alterations of DNA segments that are larger than 1 kb, comprise copy number variations (CNVs), such as deletions, insertions and duplications, as well as inversions and translocations; all of which have been shown to contribute to the variation within the human population. The characterization and documentation of all SVs and to which extent they are linked to disorders and diseases is of great interest and subject of ongoing research.
We explore the signatures of these structural variations through paired end read data obtained from recent sequencing technologies. Paired end reads enable the detection of deletions and duplications by a greater separation of discrepant read-pairs than expected, which will coincide with discrepant read-depth. A smaller separation of the read pairs than expected enables the detection of insertions while the orientation of read pairs indicate inversions. Split reads provide additional information about the actual breakpoint sequence. Each of those discrepancies contribute to the discovery of SVs not only in terms of size but also in terms of location.
Experimental design is also an important component for the discovery of CNVs. The standard Solexa/Illumina protocol uses a library of about 500bp that defines the distance between read pairs. The uncertainty of the actual fragment length hampers the detection of CNVs. However, it is possible to incorporate this uncertainty into a statistical model. To improve the calling algorithm for insertions and deletions we propose a probabilistic model that accounts for the amount of evidence at a particular breakpoint and includes the joint information gained from several individuals.