Gasch and Eisen analyzed the expression response of yeast genes under different environmental conditions,
corresponding to 93 microarray experiments. They applied a fuzzy k-means clustering to obtain groups of genes
with similar patterns of expression, and to allow genes to belong to several clusters (i.e., clusters can overlap).
By relaxing a membership threshold (k), more genes are associated to each of 91 resulting centroids.
In other words, as k is relaxed, the clusters become larger and contain genes whose expression patterns are less
similar. Exploring different k-thresholds allows to analyze the programs of gene expression at different levels of
detail.
We analyzed Gasch and Eisen clusters at several membership thresholds (k), and under the
Harbison-Compiled and Harbison-Conditions datasets (see below). The corresponding ChIPCodis results can
be explored below. P-values were corrected with the permutation-based method
(Boyle et al.
Bioinformatics 2004, 20:3710-3715).
Dataset
k < 0.1
k < 0.08
k < 0.06
k < 0.04
Harbison Compiled
Harbison 14 Conditions
Input clusters
The example presented in the manuscript (cluster #39 at k < 0.06) can be found
here. And below,
the corresponding heatmap of Conditions+TF_combinations versus genes.
ChIPCodis methodology basics
The methodology of ChIPCodis is based on the one of GeneCodis (Carmona-Saez et al, Genome Biology 2007; 8(1):R3.).
Basically, ChIPCodis finds frequent co-occurring TFs in an input
list of genes and then statistically validate these TF associations.
The basic workflow is summarized in this figure:
Association Rules Discovery (ARD):
This is a technique that aims to identify associations and correlations
in a database of transactions.
Traditionally, it has been applied to purchasing transactions with the objective
of finding concurrently bought products. More details of this technique can be found
here.
In our case, each gene represents a transaction, and the TFs bound to its promoter the "bought"
products. Based on the apriori algorithm
(Carmona-Saez et al., BMC Bioinformatics 2006)
TF combinations that frequently co-occur in the input list of genes (the database of transactions)
are identified.
The following feature of the algorithm might be illustrative. The implemented ARD method
will extract the largest combinations of TFs. Those combinations that are subsets of
larger combinations will be filtered out if they are associated to the same set of genes.
For example, let say that TF1, TF2, TF3 and TF4 are four different transcription factors
that jointly coordinate the transcription of a given set of genes. If we only evaluate
pairs of TFs we will have 6 different combinations (TF1-TF2, TF1-TF3, TF1-TF4, TF2-TF3,
TF2-TF4, and TF3-TF4) that come from a unique pattern. In contrast, ChIPCodis
will only extract one pattern: TF1-TF2-TF3-TF4 (more details).
Under the "Harbison-Conditions" dataset (see below) each of the 14
experimental conditions is treated as an independent transaction database, in order to avoid
the extraction of rules mixing distinct conditions.
Statistical assesment
The associations found with the ARD technique need to be statistically assessed.
For this purpose, ChIPCodis implements two statistical tests, the first based on the
hypergeometric distribution, and the second on the Chi-square test.
More details can be found here.
In order to correct the obtained p-values for multiple hypothesis testing two statistical
approaches are implemented in ChIPCodis: FDR and Permutations.
The simulation correction method (permutation) implements the approach described in Boyle et al.
(Bioinformatics 2004, 20:3710-3715). Briefly, a gene list of the same size of the input list is
generated by randomly selecting genes from the set of genes defined as the reference distribution.
The process of extracting frequent sets of annotations is repeated and p-values for the annotations
and combinations of annotations generated from this random list are calculated using the same
statistical test. This process is repeated 1000 times and the corrected p-values for each set
of k-annotations are calculated as the fraction of permutations having any annotation of the
same value of k with a p-value as good or better than the observed p-value.
ChIPCodis datasets
There are two groups of datasets: the Harbison et al and the MacIsaac et al.
Harbison et al.: It contains the data of
Harbison et al. (2004) Nature 431, 99-104.
Analyses can be conducted based on the Growth under rich medium dataset, which is the only condition for which
the whole repertoire of yeast TFs was tested, or on different environmental conditions (Conditions dataset).
The Conditions dataset contains chip-on-chip experiments for 14 situations
(see details) for which subsets of yeast TFs were
tested with the chip-on-chip technology. It provides a dynamic perspective of to which gene and under which
condition a TF is bound.
In addition, under the Harbison et al group of datasets, it can be found the
Compiled dataset, which is a summary of all the condition-dependent data. In the compiled dataset a TF is said to
be bound to a gene if it was bound in some of the 14 experimental conditions. Hence, the compiled dataset embodies a static view
of the binding of TFs to genes.
MacIsaac et al.: Based on the data of Harbison et al, MacIsaac et al
(BMC Bioinformatics 2006, 7, 113)
searched for sequence motifs supporting the chip-on-chip positives in the corresponding promoters of genes.
Hence, this is a more trusted although reduced dataset than the Harbison et al one.
As implemented in ChIPCodis, the MacIsaac et al dataset
represents a static view of the binding of TFs, not considering the condition dependence
or dynamic nature of TF regulation.
If a TF is really binding a motif in the promoter of a gene,
it is likely that it binds the same promoter in closely related species.
MacIsaac et al exploited this evolutionary criterion, providing different datasets for which
no conservation, weak conservation or strong conservation of motifs across
closely-related species is required.