TY - JOUR
T1 - Statistical properties of short subsequences in microbial genomes and their link to pathogen identification and evolution
AU - Zhang, Meizhuo
AU - Putonti, Catherine
AU - Chumakov, Sergei
AU - Gupta, Adhish
AU - Fox, George E.
AU - Graur, Dan
AU - Fofanov, Yuriy
PY - 2006
Y1 - 2006
N2 - Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.
AB - Numerous sequencing projects have unveiled partial and full microbial genomes. The data produced far exceeds one person's analytical capabilities and thus requires the power of computing. A significant amount of work has focused on the diversity of statistical characteristics along microbial genomic sequences, e.g. codon bias, G+C content, the frequencies of short subsequences (n-mers), etc. Based upon the results of these studies, two observations were made: (1) there exists a correlation between regions of unusual statistical properties, e.g. difference in codon bias, etc., from the rest of the genomic sequence, and evolutionary significant regions, e.g. regions of horizontal gene transfer; and (2) because no two microbial genomes look statistically identical, statistical properties can be used to distinguish between genomic sequences. Recently, we conducted extensive analysis on the presence/absence of n-mers for many microbial genomes as well as several viral and eukaryotic genomes. This analysis revealed that the presence of n-mers in all genomes considered (in the range of n, when the condition M<<4n holds, where M is the genome length) can be treated as a nearly random and independent process. Thus we hypothesize that one may use relatively small sets of randomly picked n-mers for differentiating between different microorganisms. Recently, we analyzed the frequency of appearance of all 8- to 12-mers present in each of the 200+ publicly available microbial genomes. For nearly all of the genomes under consideration, we observed that some n-mers are present much more frequently than expected: from 50 to over a thousand copies. Upon closer inspection of these sequences, we found several cases in which an overrepresented n-mer exhibits a bias towards being located in the coding or being located in the non-coding region. Although the evolutionary reason for the conservation of such sequences remains unclear, in some cases it is plausible to believe that sequences having a clear bias for non-coding regions may be because of their role in the DNA uptake/recombination process, being parts in insertion sequences, or serving as transcription factors recognition sites. Our analysis of the frequency of appearance of 6-mers for each microbial genome revealed regions that display unusual statistical properties with respect to their own genome. After inspection of the genes contained within these regions, we believe that such regions are likely to have been acquired into the genomic sequence through horizontal gene transfer.
KW - Pathogen identification
KW - Short subsequences
KW - Statistical properties
UR - http://www.scopus.com/inward/record.url?scp=33846531154&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33846531154&partnerID=8YFLogxK
U2 - 10.1063/1.2356390
DO - 10.1063/1.2356390
M3 - Conference article
AN - SCOPUS:33846531154
SN - 0094-243X
VL - 854
SP - 13
EP - 18
JO - AIP Conference Proceedings
JF - AIP Conference Proceedings
T2 - 9h Mexican Symposium on Medical Physics
Y2 - 18 March 2006 through 23 March 2006
ER -