TY - JOUR
T1 - Using mutual information to discover temporal patterns in gene expression data
AU - Chumakov, Sergei
AU - Ballesteros, Efren
AU - Rodriguez Sanchez, Jorge E.
AU - Chavez, Arturo
AU - Zhang, Meizhuo
AU - Pettit, B. Montgomery
AU - Fofanov, Yuriy
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2006
Y1 - 2006
N2 - Finding relations among gene expressions involves the definition of the similarity between experimental data. A simplest similarity measure is the Correlation Coefficient. It is able to identify linear dependences only; moreover, is sensitive to experimental errors. An alternative measure, the Shannon Mutual Information (MI), is free from the above mentioned weaknesses. However, the calculation of MI for continuous variables from the finite number of experimental points, N, involves an ambiguity arising when one divides the range of values of the continuous variable into boxes. Then the distribution of experimental points among the boxes (and, therefore, MI) depends on the box size. An algorithm for the calculation of MI for continuous variables is proposed. We find the optimum box sizes for a given N from the condition of minimum entropy variation with respect to the change of the box sizes. We have applied this technique to the gene expression dataset from Stanford, containing microarray data at 18 time points from yeast Saccharomyces cerevisiae cultures (Spellman et al.,). We calculated MI for all of the pairs of time points. The MI analysis allowed us to identify time patterns related to different biological processes in the cell.
AB - Finding relations among gene expressions involves the definition of the similarity between experimental data. A simplest similarity measure is the Correlation Coefficient. It is able to identify linear dependences only; moreover, is sensitive to experimental errors. An alternative measure, the Shannon Mutual Information (MI), is free from the above mentioned weaknesses. However, the calculation of MI for continuous variables from the finite number of experimental points, N, involves an ambiguity arising when one divides the range of values of the continuous variable into boxes. Then the distribution of experimental points among the boxes (and, therefore, MI) depends on the box size. An algorithm for the calculation of MI for continuous variables is proposed. We find the optimum box sizes for a given N from the condition of minimum entropy variation with respect to the change of the box sizes. We have applied this technique to the gene expression dataset from Stanford, containing microarray data at 18 time points from yeast Saccharomyces cerevisiae cultures (Spellman et al.,). We calculated MI for all of the pairs of time points. The MI analysis allowed us to identify time patterns related to different biological processes in the cell.
KW - Gene expression
KW - Mutual information
UR - http://www.scopus.com/inward/record.url?scp=33846527894&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33846527894&partnerID=8YFLogxK
U2 - 10.1063/1.2356392
DO - 10.1063/1.2356392
M3 - Conference article
AN - SCOPUS:33846527894
SN - 0094-243X
VL - 854
SP - 25
EP - 30
JO - AIP Conference Proceedings
JF - AIP Conference Proceedings
T2 - 9h Mexican Symposium on Medical Physics
Y2 - 18 March 2006 through 23 March 2006
ER -