Nucleotide amino acid k-mer vector: an alignment-free method for comparing genomic sequences

Bao, Xiaona; He, Lily; Cui, Jingan; Yau, Stephen S.-T.

doi:10.4310/CIS.2022.v22.n3.a2

Contents Online

Communications in Information and Systems

Volume 22 (2022)

Number 3

Special issue on bioinformatics and biophysics in honor of professor Michael Waterman on his 80th birthday

Guest Editors: Fengzhu Sun (University of Southern California), Guowei Wei (Michigan State University), Stephen S.-T. Yau (Tsinghua University), and Shan Zhao (University of Alabama)

Nucleotide amino acid k-mer vector: an alignment-free method for comparing genomic sequences

Pages: 317 – 337

DOI: https://dx.doi.org/10.4310/CIS.2022.v22.n3.a2

Authors

Xiaona Bao (School of Science, Beijing University of Civil Engineering and Architecture, Beijing, China)

Lily He (School of Science, Beijing University of Civil Engineering and Architecture, Beijing, China)

Jingan Cui (School of Science, Beijing University of Civil Engineering and Architecture, Beijing, China)

Stephen S.-T. Yau (Department of Mathematical Sciences, Tsinghua University, Beijing, China)

Abstract

Evolutionary analysis of genomic data is a valuable issue in the study of bioinformatics, and a great deal of DNA data has become available. In the field of evolutionary analysis, protein sequences are more meaningful than DNA sequences, and the alignment-free methods based on k‑mer mean are widely used. However, the dimension of the k‑mer vector based on protein sequence is very high. This paper proposes a new Nucleotide Amino Acid K‑mer Vector (NAAKV) technique, which converts the DNA sequence to a pseudo amino acid sequence (PAAS). This transformation does not need to find the coding region of the gene sequence, but also reflects the change of nucleotide. Meanwhile, there is a strong correlation between the amino acids, which leads to the types of k‑mer are much lower than that of protein sequence, thus the dimension is greatly reduced. To test NAAKV, we carry out phylogenetic analysis of several viruses and bacteria. The traditional k‑mer method and alignment-based MUSCLE method are used for comparison on each dataset. Eventually, the results suggest that NAAKV is accurate and time-efficient for phylogenetic analysis and genome classification.

Keywords

alignment-free, k-mer, NAAKV, evolutionary analysis, DNA sequence

2010 Mathematics Subject Classification

92D20

Full Text (PDF format)

Xiaona Bao and Lily He contributed equally to this work and should be considered co-first authors.

This work is supported by National Natural Science Foundation of China (NSFC) grant 12171275, 11871093, Tsinghua University Spring Breeze Fund (2020Z99CFY044), Tsinghua University start-up fund, Tsinghua University Education Foundation fund (042202008), Promotion plan for young teachers’ scientific research ability of Beijing University of Civil Engineering and Architecture (X21026), R&D Program of Beijing Municipal Education Commission (KM202210016014).

Received 22 December 2021

Published 22 July 2022