Two Seminars on Genomic Sequence Analysis – Thursday 16th April 2015

Please see below for details of two upcoming seminars at QUT Gardens Point.  These may be of interest to anyone interested/working in statistical genetics.

The Data Science discipline in the School of EECS is pleased to host two special seminars on genomic sequence analysis from visiting researchers next Thursday, April 16. We will commence at 2PM, break for coffee after the first seminar, and then re-convene for a second talk around 3:30. Full details are given below.

Two Seminars on Genomic Sequence Analysis Date: Thursday, 16 April 2015. Venue: QUT Gardens Point Campus: Room S637

Seminar 1 – Commencing at 2:00PM

Title: The Family-Free approach applied to genome rearrangement problems

Speaker: Pedro Feijão, University of Bielefeld

Abstract: Genomes are subject to mutations or rearrangements in the course of evolution. Large-scale rearrangements can change the number of chromosomes and/or the positions and orientations of large blocks of DNA.  A classical problem in comparative genomics is to compute the rearrangement distance, that is, the minimum number of rearrangements required to transform a given genome into another given genome. In order to study this problem, one usually adopts a high-level view of genomes, in which only “relevant” fragments of the DNA (e.g., genes) are taken into consideration. A pre-processing step is required, grouping genes in both genomes into gene families. This setting is said to be family-based, and many polynomial models have been proposed to compute the genomic distance in this setting.

Classifying each gene unambiguously into a single gene family is a difficult and error-prone task. Due to this fact, an alternative to the family-based setting was proposed recently and consists in studying the rearrangement distance without prior family assignment. In this Family-Free approachinstead of families, the pairwise similarity between genes is  used directly. In this talk I will present recent results about calculating the distance of two given genomes in a family-free setting, using the double cut and join (DCJ) operation, that consists of cutting a genome in two distinct positions and joining the four resultant open ends in a different way, and represents most of large-scale rearrangements that modify genomes.

Break for Coffee and Tea: 3:00-3:30PM (Arrangements TBC)

Seminar 2 – Commencing at 3:30PM

Title: Applications of k-mer consensus and context (for moderate ‘k’)

Speaker: Paul Greenfield, CSIRO

Abstract: The k-mer repetition histogram derived from tiling a single-organism sequencing dataset looks like a slightly-right-side heavy Poisson, with a sharp spike at the origin. The ‘Poisson’ part of the histogram represents the  consensus of the k-mers present in the genome, with the peak of curve at the average depth of coverage, and the left-hand spike comes sequencing errors. Most sequencing error correction tools are based on this idea of k-mer consensus, and simply scan reads, looking for k-mers whose repetition depth is too low, and replacing them with a ‘better’ k-mer from the consensus set. The tools differ largely in how they decide which k-mer is best, and even very naïve k-mer consensus algorithms perform better than the earlier published algorithms.

Finding the correct replacement k-mer is moderately straightforward if only substitution errors are corrected, as there will often only be one k-mer in the consensus set that is close to the broken k-mer. The challenge for a good error correction algorithm is choosing the right k-mer when there are multiple alternatives, and this will be the case at every k-mer if insertion and deletion errors are to be corrected in addition to substitutions. One way of solving this problem is to consider the faulty k-mer’s context within its read – and the effect that any proposed replacement will have on the correctness of the rest of the read. The Blue software makes the choice of which fix/k-mer to use to correct an error by recursively exploring the impact of each change of the remainder of the read. This algorithm can be thought of as building a tree of potential correct reads, and then choosing the path that requires the fewest fixes to correct the entire read.

Blue makes use of a file of distinct k-mers (and their repetition counts) that is built by another tool called Tessel. Tessel is a scalable k-mer counter, much like Jellyfish and a number of other such tools. Building such a k-mer counter is conceptually trivial – the challenge is to build a tool that will be able to handle billions of reads for large genomes, and run in a practical time on machines without terabytes of memory. Tessel takes advantage of the nature of sequencing data, where most of the distinct k-mers only appear once (and cover sequencing errors), and most of the k-mers tiled come from repeated k-mers. Tessel is slightly slower than Jellyfish on small data sets but much faster and more memory efficient on large datasets.

In this talk I will discuss the algorithms behind these tools and their implementation, along with applications to error correction, k-mer-based sequence comparison – avoiding the rapid fall-off in similarity that is inherent in straight D2-like k-mer metrics, and to k-mer based filtering to find reads of interest from genomic or metagenomic datasets. One recent application of these tools was to generate perfect mitochondrial genomes from fungal sequence datasets, filtering to find putative mitochondrial reads, assembling these reads, checking the results and fixing errors using a k-mer-based contig growing tool.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s