Communications in Information and Systems

Volume 19 (2019)

Number 1

Assessment of kmer degeneration method for complicated genomes

Pages: 17 – 35

DOI: https://dx.doi.org/10.4310/CIS.2019.v19.n1.a2

Authors

Shuai Liu (Institute of Physical Science and Information Technology, Anhui University, Hefei, China; and Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China)

Shaojun Pei (Department of Mathematical Sciences, Tsinghua University, Beijing, China)

Stephen S.-T. Yau (Department of Mathematical Sciences, Tsinghua University, Beijing, China)

Qi Wu (Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China; and the State Key Laboratory of Mycology (SKLM), Institute of Microbiology, Chinese Academy of Sciences, Beijing, China)

Abstract

The kmer frequency is widely used in alignment-free sequence analysis methods. To better describe the overall statistical features of a complicated sequence, such as those of mammals, a longer length of kmer is required. However, the long length of kmer will cause exponential increasing of the types of kmer ($4^K$ types of kmer with length $K$), which results in an extremely intensive computational burden and makes k-mer method impractical. In this work, we propose a novel method of kmer degeneration (KD) to balance the kmer length and kmer type. The method only considers $N$ positions of nucleotides out of $K$ positions of $K$-mers and degenerates all other $(K-N)$ positions. Then the $K$-mers can be substituted by $(K)N$-mer. Therefore, the kmer types were reduced from $4^K$ to $4^N$ and remain the linkages among the nucleotides within the $K$-mer. We first show how $N$ can be determined for a given $K$. Then we assess which types of combinations of the $N$ positions from the $K$ positions are better for describing the sequence. Finally, to illustrate the utility of the method, we construct the phylogenetics tree of Carnivora with 16 genomes using our method, which is better than non-degenerated kmer with the same $N$ value.

This work is supported by National Natural Science Foundation of China grant (#91746119) and the Strategic Priority Program of the Chinese Academy of Sciences (XBD31020000).

Published 18 April 2019