Communications in Information and Systems
Volume 20 (2020)
Similarity analysis of protein sequences using a reduced $k$-mer amino acid model
Pages: 45 – 60
Based on the properties of amino acid side chain, the 20 natural amino acids are divided into a simplified feature space, and the original protein sequence could be represented by a reduced amino acid sequence, which contains only four residues. Associating with this reduced protein sequence representation, the $k$‑mer natural vector is defined and utilized to describe the similarity analysis of protein sequences, in which the frequencies and positional information of $k$‑mers appearing in a reduced amino acid sequence are characterized by a feature vector. The similarity analysis of protein sequences can be easily and fast performed without requiring evolutionary models or human intervention. In order to show the utilities of our new method, it is applied on the real protein datasets for similarity analysis, and the obtaining results demonstrate that our new approach can precisely describe the similarities of protein sequences, and also strengthen the computing efficiency, compared with multiple sequence alignment. Therefore, our reduced $k$‑mer amino acid representation model is a very powerful tool for analyzing and annotating protein sequence.
similarity analysis, protein sequence, a reduced amino acid model, $k$-mer natural vector, multiple sequence alignment
This work is partially supported by Scientific Research Funding of Suihua University (K1501009, 2017-XGYYWF-017), and by Natural Scientific Research Funding of Heilongjiang (LH2019A031).
Received 5 September 2019
Published 17 April 2020