Preprocessing tools

Raw sequences data need to be pretreated (quality, assembly, sorting, ...) to facilitate downstream analyses. Actually, metagenomic and metatranscriptomic datasets are complex and not only sequencing, but also data analysis costs. Sequence pretreatments may reduce this complexity and inherent misinterpretation of these data.

The necessary data treatments highly depends on the data and technologies used to generate them. The following guide and the proposed treatments should ensure that the data are not too bad. However, there is no “one-size-fits all” solution.

Before any taxonomic or functional assignations, the sequences have to pre-processed with several step.

Subsection Tool Galaxy wrapper
Name Version Subtools Name Revision
Assemble paired-end sequences FastQ joiner [2] 1.0.1   fastq_paired_end_joiner 6a7f5da7c76d
FastQ-join [1] 1.1.2-806   fastq_join 8ec3dfde378b
Control quality FastQC (Documentation) 0.67   fastqc 3a458e268066
PRINSEQ [7] 0.20.4   prinseq 6b865dde1baa
VSEARCH Tool Suite VSEARCH [6] 1.9.7 Alignment vsearch 576963db5f1b
Chimera detection
Cluster sequences CD-HIT [3][5] 4.6.4 CD-HIT EST cdhit 54d811ad2b52
Format cd-hit outputs     format_cd_hit_output 4015e9d6d277
Sort rRNA/rDNA SortMeRNA [4] 2.1b   sortmerna 59252ca85c74


[1]Erik Aronesty. Ea-utils : “Command-line tools for processing biological sequencing data”. 2011. URL:
[2]Daniel Blankenberg, Assaf Gordon, Gregory Von Kuster, Nathan Coraor, James Taylor, Anton Nekrutenko, and the Galaxy Team. Manipulation of FASTQ data with Galaxy. Bioinformatics, 26(14):1783–1785, July 2010. URL:, doi:10.1093/bioinformatics/btq281.
[3]Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152, December 2012. doi:10.1093/bioinformatics/bts565.
[4]Evguenia Kopylova, Laurent Noé, and Hélène Touzet. SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics, 28(24):3211–3217, December 2012. doi:10.1093/bioinformatics/bts611.
[5]Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, July 2006. doi:10.1093/bioinformatics/btl158.
[6]Torbjørn Rognes, Frédéric Mahé, Tomas Flouri, Daniel McDonal, and Pat Schloss. Vsearch: VSEARCH 1.4.0. 2015. URL:
[7]Robert Schmieder and Robert Edwards. Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6):863–864, March 2011. URL:, doi:10.1093/bioinformatics/btr026.