Identification of periodically expressed genes in high-throughput data

Cycle webpage

Periodicity is an important phenomenon in molecular biology and physiology. Prominent examples are the cell cycle and the circadian clock. Omics technologies such as microarrays, next-generation sequencing or mass-spec have enabled us to screen complete sets of transcripts or proteins for possible association with such fundamental periodic processes on a system-wide level. To assess the significance of the identified periodic expression, several approaches for detection have been proposed based on time series analysis and statistical modeling. Most of the proposed methods rely on data normality or the extensive use of permutation tests. However, this neglects the fact that time series data exhibit generally a considerable autocorrelation i.e. correlation between successive measurements. Therefore, neither the assumptions of data normality nor for randomizations may hold.

This failure can substantially interfere with the significance testing, and that neglecting autocorrelation can potentially lead to a considerable overestimation of the number of periodically expressed genes (Bioinformatics 2008). Our analysis shows that randomized and Gaussian background models neglect the dependency structure within the observed data. In contrast, the use of autoregressive AR(1) background models gave a more accurate representation of correlations between measurements. More importantly,the choice of background model has drastic effects on the number of genes detected as significantly periodically expressed. A study of expression data of yeast cell cycle showed clearly that randomized and Gaussian background models tend to overestimate the number of significant periodically expressed genes. Strikingly, the use of the more accurate AR(1)-background led to a considerable reduction of the number of periodic genes. Most importantly, AR(1)-based models achieve superior accuracy in determining periodically expressed genes as a subsequent assessment using benchmark datasets demonstrated.

Further information

Download

The identification of periodically expressed genes using Fourier analysis and the statistical assessment of significance using different background models is implemented in following R-scripts.

R/Bioconductor package

There is also a fully documented Bioconductor/R package available. The current version of the R package can be downloaded from the Bioconductor repository.

Contact

Questions and comments can be addressed to Matthias Futschik.