AMALGAM - Automating microbial genome assemblies

Endless improvements of Next-Generation Sequencing (NGS) lead to constantly increasing sequencing data that serve various studies such as metagenomics, transcriptomics or WGS projects. De novo assembly of raw data from WGS projects has been a tedious task requiring a strong expertise for a while but the recent development of powerful tools has brought the technical part of this process within the reach of “standard” bioinformaticians. However, the choice of the appropriate tool depends on the nature of the data (huge amount of short high quality reads vs few long and noisy reads) and probably the issue to address (assembling an isolate vs a collection of genomes included in a complex sample).

De Bruijn Graphs (DBG) and Overlap Layout Consensus (OLC) are standard approaches used to assemble NGS reads, the former mostly addressing the short reads problematic whereas the latter being much more suitable for longer reads. At present time, several available assemblers are sort of Swiss knives as they can handle both short and long reads, but still with heterogeneous performances depending on the input data because tools sharing a given core algorithm do not solve special cases in the same way. Hence for instance, we do not expect all DBG-assemblers to produce exactly the same molecule. The idea of our project is therefore to package those tools in an “allinone” solution that would generate, evaluate and compare several assemblies without the aim of being as exhaustive as MetAMOS [1] to i) limit software dependencies; ii) be able to run it on a standard desktop computer and iii) to exploit observed discrepancies to generate an accurate consensus (easier with a few methods).

In the context of the FRANCE-GENOMIQUE 2.7.2 workpackage, we have designed a bioinformatics pipeline for an automatic microbial genome assembly called AMALGAM (Automatic MicrobiAL Genome AsseMbler) that includes several popular tools selected for their ability to handle both short and long reads, whether they are paired/mate paired or not, and their global accuracy as reported in recent literature (SPAdes[2], IDBA-UD[3], ABySS[4] or Canu[5]). At the moment, polished assemblies are generated only for short reads data input via the GapCloser software from SOAPdenovo package [6] but other options are currently under investigation, especially for long reads input. At the very end of the process, Quast/ICARUS [7] is used to evaluate the quality of produced assemblies. A first evaluation is performed on one assembly alone (before vs after the polishing phase) and a second is performed when several assembler have been used to pinpoint discrepancies between assemblies.

AMALGAM is doomed to become the central element of future bioinformatics pipelines dedicated to microbial genome assembly in our MicroScope annotation platform, but not only. Used in conjunction with reads binning methods, we could probably move a step towards the generation of Metagenomic Assembled Metagenomes (MAGs).

References:

[1] TJ Treangen, S Koren, DD Sommer, B Liu, I Astrovskaya, B Ondov, Irina Astrovskaya, Brian Ondov, Aaron E Darling, Adam M Phillippy, Mihai Pop. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biology 14 (1), R2.

[2] Bankevich A. et al., SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing, Journal of Computational Biology (2012)

[3] Peng Y. et al., IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth, Bioinformatics (2012)

[4] Jackman SD. Et al., ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter, bioRxiv. (preview)

[5] Koren S et al., Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation, bioRxiv. (2016).

[6] Luo et al., SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience (2012)

[7] Gurevich A. et al., QUAST: quality assessment tool for genome assemblies, Bioinformatics (2013)

AMALGAM – Automating microbial genome assemblies