Inished, at a “permanent draft” stage, which is used for subsequent

Inished, at a “permanent draft” stage, which is used for subsequent analyses. Before proceeding with such analyses, it is essential to Pentagastrin evaluate the consensus error rate and correctness of those assemblies. Furthermore, given the numerous sequencing technologies now in use, it is critical to know the capabilities and limitations of each, and to design and evaluate sequencing projects on this basis. Here we present an evaluation of current sequencing technologies based on analysis of 133 microbial genomes sequenced during the last seven years at the Department of Energy-Joint Genome Institute (DOE-JGI). We use these data to evaluate the quality of the assembled product and, in particular, to compare the draft products resulting from automated assemblies with the finished genomes.Figure 1. The distribution of projects among the 12 sequencing methods used. With dark green color are indicated the projects for which there are more than 5 sequenced projects and were used 25331948 in downstream analysis. doi:10.1371/journal.pone.0048837.gResults and Discussion Genomes and technologies surveyedDuring the last 7 years, 133 microbial genomes were sequenced to completion at the DOE-JGI (Table S1). These sequencing projects were carried out using a variety of sequencing technologies, alone or in combination (Table 1 and Figure 1). Several projects specifically compared different variants of a method (e.g., Illumina vs Illumina+PacBio). Included are draft and finished genomes that were submitted to Genbank and that included only contigs that were .200 bp. This size threshold was used in compliance with NCBI rules for submission of data from sequencing projects. The projects selected span the full spectrum of the GC percentage and phylogenetic placement (Table S1). These projects were sequenced until the end of 2011, however the current CB-5083 technology and methods used are undergoing constant improvements, which result in significant better results e.g. Illumina transitioned from V2 to V3 chemistry with significant improvement in the final product. Additionally improvements in the software used to process these data have been reflected in the quality of the end product as well. The purpose of this report is not to thoroughly evaluate these differences but is focused on the differences observed while transitioning from one technology to another, and the resulting quality of the assembled and annotated product.Quality of assemblyTwo metrics were used to evaluate the quality of the produced assembly: the number of contigs in the draft assembly and the amount of missing DNA sequence, i.e., number of bases in the finished assembly that is not included in the draft. In both cases higher numbers indicate worse quality of assembly resulting in lossof information about the genome e.g. missing genes, gene context information, and make downstream analysis more difficult. Overall NGS technologies yield fewer contigs compared to Sanger-based sequencing (Figure 2). The 454 technology alone produces better results than Sanger alone; combining Sanger with 454 reduces the number of scaffolds further. In comparison, standard Illumina yields more draft scaffolds, but the number is significantly reduced when long mate pair libraries are used or when Illumina is combined with 454, and more so when combined with PacBio sequence data. Each region of the finished genome that is missing from the draft assembly was identified as a gap. The number of gaps (gap occurrences) per genome (Figu.Inished, at a “permanent draft” stage, which is used for subsequent analyses. Before proceeding with such analyses, it is essential to evaluate the consensus error rate and correctness of those assemblies. Furthermore, given the numerous sequencing technologies now in use, it is critical to know the capabilities and limitations of each, and to design and evaluate sequencing projects on this basis. Here we present an evaluation of current sequencing technologies based on analysis of 133 microbial genomes sequenced during the last seven years at the Department of Energy-Joint Genome Institute (DOE-JGI). We use these data to evaluate the quality of the assembled product and, in particular, to compare the draft products resulting from automated assemblies with the finished genomes.Figure 1. The distribution of projects among the 12 sequencing methods used. With dark green color are indicated the projects for which there are more than 5 sequenced projects and were used 25331948 in downstream analysis. doi:10.1371/journal.pone.0048837.gResults and Discussion Genomes and technologies surveyedDuring the last 7 years, 133 microbial genomes were sequenced to completion at the DOE-JGI (Table S1). These sequencing projects were carried out using a variety of sequencing technologies, alone or in combination (Table 1 and Figure 1). Several projects specifically compared different variants of a method (e.g., Illumina vs Illumina+PacBio). Included are draft and finished genomes that were submitted to Genbank and that included only contigs that were .200 bp. This size threshold was used in compliance with NCBI rules for submission of data from sequencing projects. The projects selected span the full spectrum of the GC percentage and phylogenetic placement (Table S1). These projects were sequenced until the end of 2011, however the current technology and methods used are undergoing constant improvements, which result in significant better results e.g. Illumina transitioned from V2 to V3 chemistry with significant improvement in the final product. Additionally improvements in the software used to process these data have been reflected in the quality of the end product as well. The purpose of this report is not to thoroughly evaluate these differences but is focused on the differences observed while transitioning from one technology to another, and the resulting quality of the assembled and annotated product.Quality of assemblyTwo metrics were used to evaluate the quality of the produced assembly: the number of contigs in the draft assembly and the amount of missing DNA sequence, i.e., number of bases in the finished assembly that is not included in the draft. In both cases higher numbers indicate worse quality of assembly resulting in lossof information about the genome e.g. missing genes, gene context information, and make downstream analysis more difficult. Overall NGS technologies yield fewer contigs compared to Sanger-based sequencing (Figure 2). The 454 technology alone produces better results than Sanger alone; combining Sanger with 454 reduces the number of scaffolds further. In comparison, standard Illumina yields more draft scaffolds, but the number is significantly reduced when long mate pair libraries are used or when Illumina is combined with 454, and more so when combined with PacBio sequence data. Each region of the finished genome that is missing from the draft assembly was identified as a gap. The number of gaps (gap occurrences) per genome (Figu.