/17/$ © IEEE Our proposed pipeline is implemented on BGI Online to provide a user-friendly graphical interface Index Terms—pipeline, single cell sequencing, copy number variation detection, BGI Online. ISBN: pp: Yuwen Zhou, BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, China. Aodan Xu. (4)BGI Genomics, BGI-Shenzhen, Shenzhen, , China. association study on pulmonary TB patients and healthy controls.
|Published (Last):||22 February 2007|
|PDF File Size:||7.17 Mb|
|ePub File Size:||3.46 Mb|
|Price:||Free* [*Free Regsitration Required]|
Transcriptome sequencing has long been the favored method for quickly and inexpensively obtaining a large number of gene sequences from an organism with no reference genome. Owing to the rapid increase in throughputs and decrease in costs of next- generation sequencing, RNA-Seq in particular has become the method bg choice.
A copy-number variation detection pipeline for single cell sequencing data on BGI online
However, the very short reads e. We evaluated its performance on transcriptome datasets from rice and mouse. Using as our benchmarks the known transcripts from these well-annotated genomes sequenced a decade agowe assessed how SOAPdenovo-Trans and two other popular transcriptome assemblers handled such practical issues as alternative splicing and variable expression levels.
Our conclusion is that SOAPdenovo-Trans provides bgo contiguity, lower redundancy and faster 500. Source code and user manual are available at http: Supplementary data are available at Bioinformatics online. Transcript sequences and gene expression levels can now be efficiently obtained using RNA-Seq on next-generation sequencing technologies, providing increased throughputs and decreased costs. Applications for RNA-Seq include discriminating expression levels of allelic variants and detecting gene fusions Maher et al.
To carry out these types of analyses requires an assembler that can reconstruct the transcripts from very short reads e. Assemblers such as Cufflinks Trapnell et al. In these situations, de novo assembly is required.
The challenge is not only to recover full-length transcripts but also to identify alternative splice forms in the presence of variable gene expression levels. These programs were intended to recover sequences for genomes of a known estimated size with a defined number of chromosomes.
In contrast, transcriptome assemblers must recover an unknown number of RNA sequences, typically on the order of tens of thousands. Further, transcript sequences are only a few k ilobases in length, as compared with chromosomes, which can be hundreds of M egabases in length. Adding to the complexity is that gene expression levels vary by many orders of magnitude, so that for any non-zero sequencing error rate the most highly expressed genes will always harbor many discrepant bases, making it impossible to define an absolute threshold for the number of sequencing errors allowed per assembly.
This then needs to be addressed. In recent years, some important changes have been introduced to improve transcriptome assembly. Oases enumerated all possible transcripts with the simplifying concept of assembly sub-graphs and then used a robust heuristic algorithm to traverse these graphs. Trinity introduced a new error removal model to account for variations in gene expression levels and then used a dynamic programming procedure to traverse their graphs.
However, there is a lot of room for improvement, e. Oases produces more redundant transcripts, possibly due to it lacking an effective error-removal model Lu et al. SOAPdenovo-Trans incorporates the error-removal model from Trinity and the robust heuristic graph traversal method from Oases. In addition, we use a strict transitive reduction method to simplify the scaffolding graphs, and provide more accurate results. To assess the impact of these changes, we evaluated all three assemblers on rice and mouse, which have established transcriptome data linked to genome annotations produced over the last decade.
The results here demonstrated that SOAPdenovo-Trans provides higher contiguity, lower redundancy and faster execution. However, SOAPdenovo2 was designed for genomes with uniform sequencing depth. Thus, its error-removal model is not applicable to RNA-Seq data. It also does not allow for alternative splicing. Adopting and improving on concepts from Trinity and Oases resolved these issues.
DBG are constructed from reads; sequencing errors are removed; and contigs are then constructed. Transcripts are generated by traversing through reliable paths for each graph. B Management of ambiguous contigs. C Linearizing contigs into scaffolds.
For global error removal, low-frequency k -mers, edges, arcs direct linkage between contigs bg the DBG and tips are removed, and bubbles are pinched.
This is done in SOAPdenovo2 under the assumption that most are the result of sequencing errors. However, for the most highly expressed genes in a transcriptome, sequencing errors often generate k -mers that exceed any reasonable global error removal threshold. These cannot be corrected by global error removal. Finally, we used the same method as SOAPdenovo2 to generate contigs. This is important because transcripts are much shorter than chromosomes, so it is essential to use the information that may only be found in single-end reads.
The number of reads is then used to assign weights to these linkages, and insert sizes from the paired-ends are used to estimate the distances between linkages. This, however, is inappropriate for transcriptome assembly because of alternative splicing and variable gene expression levels.
Alternative splicing establishes multiple successive linkages from a unique contig. The data representation of this appears analogous to ambiguities in whole genome assembly. Variable gene expression levels make gbi impossible to define a contig as repetitive using a single depth constant. This removes not only sequencing errors but also short ambiguous contigs caused by repeats, which in turn obviates the need for the scaffolding module to resolve complicated ambiguities.
As a result, this increases its ability to identify alternative splicing events Fig. Bvi, unconditional removal of short contigs results in the creation of many small gaps, but this is corrected in the final phase of our algorithm by a gap-filling module described in Section 2. Linearization of bti to scaffolds also differs in genome and transcriptome assembly.
For genomes, after introducing paired-end reads with multiple tiers of insert sizes, a starting contig may have multiple successive contigs at different distances from the starting contig. The expectation is for these contigs to be linearly integrated into a single scaffold; however, for transcriptomes, conflicts may legitimately arise because multiple alternative splice forms share the same starting contig.
Contigs were clustered into sub-graphs according to their linkage. Each sub-graph consists of a set of transcripts alternative splice forms that share common exons. SOAPdenovo-Trans traverses these sub-graphs using the algorithm from Oases to generate possible transcripts from linear, fork and bubble paths. For the most complex paths, only the top scoring 50900 are retained. Paired-end information was used to cluster semi-unmapped reads into the gap regions, and then these reads were locally assembled into a consensus.
In instances where multiple consensus sequences were assembled, we selected the sequence that had a length most consistent with the gap size. For our first benchmark test dataset, we used rice transcriptome data from Oryza sativa panicle at booting stage.
Genome-wide association study identifies two risk loci for tuberculosis in Han Chinese.
Paired-end sequences were generated on an Illumina GA platform Zhang et al. For our analysis, we used a large L and small S dataset. The L dataset contained The second benchmark test dataset was mouse transcriptome data from Mus musculus dendritic cells.
Here, the L dataset contained 0590 The S dataset was down-sampled from the L dataset extracting one of every three reads from the L dataset and contained Trinity version r was run with minimum-assembled-contig-length-to-report set to The reference genomes and curated annotations were downloaded from the following two Web sites.
Note that for rice, our transcriptome data came from the indica subspecies, but our reference genome came from the japonica subspecies. We chose the japonica genome as a reference because these annotations are more extensively manually curated than 50990 indica counterparts. Ideally, we should have used japonica transcriptome data, but we used indica transcriptome data instead because there is little japonica data from the Illumina bi that is freely available.
The use of these different subspecies is not totally unreasonable because they differ on average by only a fraction of a percent Yu et al. We do, however, note that there are local regions of higher variability that will prevent some indica transcripts from aligning to the japonica genome.
When a transcript aligned to multiple genome loci, we selected the locus with the longest alignment. When multiple transcripts aligned to the same genome locus and we needed a single representative for our analysis, we selected the largest of these putative alternative splice forms.
As both genomes were sequenced a decade ago, the annotation has been extensively curated, making these appropriate benchmarks to assess the assembly software. We chose to assess both plant and animal transcriptomes because most other studies only assessed animals or even simpler organisms like yeastand we wanted to be sure that our assembler could handle the difficulties created by plant data.
Plants have larger gene families and more transposable 50900 TEs ; some of these TEs are also highly expressed. We first assessed the computational demands of the three software programs with regard to peak memory and time Table 1.
All assemblies were processed with 10 threads, on a computer with two Quad-core Intel 2.
Genome-wide association study identifies two risk loci for tuberculosis in Han Chinese.
Alignment of the assembled transcripts to the annotated genomes Table 2 showed that SOAPdenovo-Trans produced the fewest transcripts, by more than factor of 2 in the most extreme cases, even after removing assemblies that were shorter than bp.
One might naively attribute the differences in transcript numbers to alternative splice forms, but we would advise caution.
There could be, for example, non-overlapping fragments of the same isoform or redundant copies of the same isoform. Our analyses generated a successive reduction in the number of assemblies. First, we restricted our analyses to assemblies larger than bp. We then confined our analysis to assemblies that overlapped with annotated genes. Because multiple assemblies could align to the same genome locus, we generated two datasets: In choosing among the isoforms, whether for series-B or the genome annotations, we always used the longest available sequence.
In the case of the ngi transcriptome, about We indicate here the percentage of the assembled transcripts that were not known to be TEs. Ngi following analyses are focused only on those transcripts that aligned to genome loci with annotated genes. We used the terms series-A and series-B to denote the sets of transcripts that included or excluded putative alternative splice forms, respectively. Series-A includes all assembled transcripts, while series-B is a strict subset that includes only the largest assembled transcript for any given 50990.
To properly assess the differences between assemblers, it is important to first understand how the rice and mouse assemblies differed from each other. Despite the fact that the rice and mouse datasets have similar amounts of raw input data, i. S and L datasets S: Given that many more rice genes had to be recovered from the same amount of sequence data, the read depths per gene were lower; as a result, the rice assemblies were not as high quality as the mouse assemblies.
Furthermore, we expected that, given no extensive assembly errors i.
We could eliminate most of the alignment failures by aligning the transcripts to combined genomes of both subspecies; however, to avoid the complications of having two genome annotations, we used only the alignments to the japonica genome.