Clustering is an important tool in gene expression data analysis - both on transcript as well as on protein level. This unsupervised classification technique is commonly used to reveal structures hidden in large gene expression data sets. The vast majority of clustering algorithms applied produce hard partitions of the data, i.e. each gene or protein is assigned exactly to one cluster. Hard clustering is favourable if clusters are well separated. However, this is generally not the case for gene expression time-course data, where gene/protein clusters frequently overlap. Additionally, hard clustering algorithms are often highly sensitive to noise.
To overcome the limitations of hard clustering, we have implemented soft clustering which offers several advantages for researchers. First, it generates accessible internal cluster structures, i.e. it indicates how well corresponding clusters represent genes or proteins, respectively. This can be used for the more targeted search for regulatory elements (see Publication). Second, the overall relation between clusters, and thus a global clustering structure, can be defined. Additionally, soft clustering is more noise robust and a priori pre-filtering of genes/proteins can be avoided. This prevents the exclusion of biologically relevant genes/proteins from the data analysis.
Q: Can I use Mfuzz for clustering RNA-Seq data? In the publication, only its application for clustering microarray data is presented.
A: Yes, you can perfectly use Mfuzz for analysis of RNA-Seq data (as well as many other types of time course data). However, you might need to do some additional preprocessing. For instance, starting from FPKMs, you might first exclude genes, which do not show expression (i.e. with FPKM equals zero). Since it is common to log-transform the FPKM before clustering (although it is not absolutely required), you might need to add pseudo-counts to your FPKM, if some FPKM values are still zero. This avoids log-transformation of zero. After that, you can construct a ExpressionSet object and standardise the FPKM values using standardise function.
Q:I have a plain table of expression data - how can I use Mfuzz?
A: The table needs to be converted into an ExpressionSet object. For instance, if you have imported a table into the R environment (e.g. using read.table) and converted it to a matrix M, you can generate a ExpressionSet object by eset <- new("ExpressionSet",exprs=M). More details can be found by typing help(ExpressionSet).
If you have an table with expression values and gene names,
you set as.is=TRUE in read.table, as otherwise the numbers will be imported as characters.
For example, if you have a tab-delimited table called data.tab like
then
ex <- read.csv("data.tab", sep="\t", as.is=TRUE,header=TRUE,row.names=1)
ex.m <- as.matrix(ex)
eset <- new('ExpressionSet', exprs=ex.m)
will produce an ExpressionSet object called eset, which can be used for Mfuzz.
Q:Can I use the raw (i.e. not normalised) expression data for Mfuzz clustering?
A: In principle, this is possible, but usually not advisiable. Mfuzz assumes that the given expression data are preprocessed
(including the normalisation). The function standardise, which transform the expression of individual genes/proteins to have a mean value of zero and standard deviation of one, does not replace the normalisation step. Note the difference: Normalisation is carried
out to make different samples comparable, while standardisation (in Mfuzz)
is carried out to make genes/transcripts/proteins comparable.
Q: I have a time series experiment but the columns do not correspond to the temporal order. Does Mfuzz sort the columns?
A: No, it does not. It uses the order given in the expression matrix included in the ExpressionSet object (i.e. exprs(eset)). So please make sure that the columns are in correct order.
Q: Where do I see the genes/proteins included in a cluster and the size of the clusters?
A: You can get the information from the fclust object that is produced by
the mfuzz function. For instance, after cl <- mfuzz(yeastF,c=20,m=1.25), just type
cl$size for the size of the cluster and cl$cluster for the cluster association of the individual genes/proteins. Note this association is based on the highest membership, and performs hard clustering i.e. even a poorly clustered gene (e.g. with a membership value of 0.1) will be attributed to a cluster. Alternatively, you can use the acore function to set a minimum membership value for the association of a gene/protein with a cluster. This ensures that the associated genes are well clustered and is more suited for soft clustering.
Q:I want to highlight the cluster center in the plots. How can I do this?
A: From Mfuzz version 2.29 on, the function mfuzz.plot2 includes the arguments centre,
centre.col, centre.lwd, which can be used to plot the centre of the clusters with a specific color and line width.
Q:I like to display the number of genes in the cluster plots. Can I do this?
A: There is no functionality within the Mfuzz package for this task (yet). But you can plot the numbers (or any other information) directly in the plots, using the text function of the R graphics package.
Soft clustering was implemented here using the fuzzy c-means algorithm. A software package termed Mfuzz for soft clustering has been developed based on the open-source statistical language R. It uses the cmeans function of the e1071 package. The current version can be downloaded from the Bioconductor repository (see below). A graphical interface for the Mfuzz-package is included but may not have all functionality compared to Mfuzz using command line.