Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences

Qianyu Lin, Yunchuanxiang Huang, Ziyi Jiang, Feng Wu, Lan Ma

Research output: Contribution to journalArticle

Abstract

SARS-CoV-2 has caused a worldwide pandemic. Existing research on coronavirus mutations is based on small data sets, and multiple sequence alignment using a global-scale data set has yet to be conducted. Statistical analysis of integral mutations and global spread are necessary and could help improve primer design for nucleic acid diagnosis and vaccine development. Here, we optimized multiple sequence alignment using a conserved sequence search algorithm to align 24,768 sequences from the GISAID data set. A phylogenetic tree was constructed using the maximum likelihood (ML) method. Coronavirus subtypes were analyzed via t-SNE clustering. We performed haplotype network analysis and t-SNE clustering to analyze the coronavirus origin and spread. Overall, we identified 33 sense, 17 nonsense, 79 amino acid loss, and 4 amino acid insertion mutations in full-length open reading frames. Phylogenetic trees were successfully constructed and samples clustered into subtypes. The COVID-19 pandemic differed among countries and continents. Samples from the United States and western Europe were more diverse, and those from China and Asia mainly contained specific subtypes. Clades G/GH/GR are more likely to be the origin clades of SARS-CoV-2 compared with clades S/L/V. Conserved sequence searches can be used to segment long sequences, making large-scale multisequence alignment possible, facilitating more comprehensive gene mutation analysis. Mutation analysis of the SARS-CoV-2 can inform primer design for nucleic acid diagnosis to improve virus detection efficiency. In addition, research into the characteristics of viral spread and relationships among geographic regions can help formulate health policies and reduce the increase of imported cases.

Original languageEnglish (US)
Article number591833
JournalFrontiers in Genetics
Volume11
DOIs
StatePublished - Jan 11 2021

Keywords

  • SARS-CoV-2
  • haplotype network analysis
  • multiple sequence alignment
  • phylogenetic tree
  • t-SNE

ASJC Scopus subject areas

  • Molecular Medicine
  • Genetics
  • Genetics(clinical)

Fingerprint Dive into the research topics of 'Deciphering the Subtype Differentiation History of SARS-CoV-2 Based on a New Breadth-First Searching Optimized Alignment Method Over a Global Data Set of 24,768 Sequences'. Together they form a unique fingerprint.

Cite this