Web of Science: 45 cites, Scopus: 46 cites, Google Scholar: cites,
A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments
Esnaola, Mikel (Centre de Recerca en Epidemiologia Ambiental)
Puig, Pedro (Universitat Autònoma de Barcelona. Departament de Matemàtiques)
González Buisán, David (Centre de Regulació Genòmica)
Castelo, Robert (Universitat Pompeu Fabra. Departament de Ciències Experimentals i de la Salut)
González, Juan Ramón (Universitat Autònoma de Barcelona. Departament de Matemàtiques)

Data: 2013
Resum: Background: High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user. Results: Here we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or Pólya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility. Conclusions: RNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at http://www. bioconductor. org.
Drets: Aquest document està subjecte a una llicència d'ús Creative Commons. Es permet la reproducció total o parcial, la distribució, la comunicació pública de l'obra i la creació d'obres derivades, fins i tot amb finalitats comercials, sempre i quan es reconegui l'autoria de l'obra original. Creative Commons
Llengua: Anglès
Document: Article ; recerca ; Versió publicada
Publicat a: BMC bioinformatics, Vol. 14, N. 254 (August 2013) , p. 1-47, ISSN 1471-2105

DOI: 10.1186/1471-2105-14-254
PMID: 23965047


22 p, 4.2 MB

Fitxers addicionals. Open data
22 p, 4.4 MB

El registre apareix a les col·leccions:
Articles > Articles de recerca
Articles > Articles publicats

 Registre creat el 2013-10-03, darrera modificació el 2024-01-16



   Favorit i Compartir