Model-based base-calling and de novo error correction algorithms for short-read sequencing
Seminar Room 1, Newton Institute
An important computational challenge associated with recent advances in sequencing technology is to develop efficient methods that can extract accurate sequence information from raw instrument data. In this talk, I will describe a couple of algorithms which significantly improve the accuracy of short-read sequence data, particularly in the later cycles of a sequencing run. First, I will describe a novel model-based base-calling algorithm for the Illumina sequencing platform. Being founded on the tools of statistical learning, our approach is flexible enough to incorporate various features of the sequencing process. In particular, it can easily incorporate cycle-dependent parameters and model residual effects. I will then describe an efficient algorithm for correcting base-call errors. Our algorithm does not require a reference genome and it significantly outperforms previous error correction algorithms under various realistic settings. Finally, I will demonstrate how improved data quality resulting from our algorithms may facilitate de novo assembly and SNP calling.