A graph-theoretical approach to DNA similarity analysis

Nguyen, Dong Quan Ngoc; Xing, Lin; Le, Phuong Dong Tan; Lin, Lizhen

doi:10.4310/CIS.2022.v22.n3.a5

Contents Online

Communications in Information and Systems

Volume 22 (2022)

Number 3

Special issue on bioinformatics and biophysics in honor of professor Michael Waterman on his 80th birthday

Guest Editors: Fengzhu Sun (University of Southern California), Guowei Wei (Michigan State University), Stephen S.-T. Yau (Tsinghua University), and Shan Zhao (University of Alabama)

A graph-theoretical approach to DNA similarity analysis

Pages: 383 – 400

DOI: https://dx.doi.org/10.4310/CIS.2022.v22.n3.a5

Authors

Dong Quan Ngoc Nguyen (Department of Applied and Computational, Mathematics and Statistics, University of Notre Dame, Indiana, U.S.A.)

Lin Xing (Department of Applied and Computational, Mathematics and Statistics, University of Notre Dame, Indiana, U.S.A.)

Phuong Dong Tan Le (Department of Applied Mathematics, University of Waterloo, Ontario, Canada)

Lizhen Lin (Department of Applied and Computational, Mathematics and Statistics, University of Notre Dame, Indiana, U.S.A.)

Abstract

One of the very active research areas in bioinformatics is DNA similarity analysis. There are several approaches using alignment-based or alignment-free methods to analyze similarities/dissimilarities between DNA sequences. In this work, we introduce a novel representation of DNA sequences, using $n$‑ary Cartesian products of graphs for arbitrary positive integers $n$. Each of the component graphs in the representing Cartesian product of each DNA sequence contain combinatorial information of certain tuples of nucleotides appearing in the DNA sequence. We further introduce a metric space structure to the set of all Cartesian products of graphs that represent a given collection of DNA sequences in order to be able to compare different Cartesian products of graphs, which in turn signifies similarities/dissimilarities between DNA sequences. We test our proposed method on several datasets including Human Papillomavirus, Human rhinovirus, Influenza A virus, and Mammals. We compare our method to other methods in literature, which indicates that our analysis results are comparable in terms of time complexity and high accuracy, and in one dataset, our method performs the best in comparison with other methods.

Keywords

DNA similarity, graph representations, metric space

Full Text (PDF format)

Lizhen Lin and Dong Quan Ngoc Nguyen acknowledge the generous support of NSF grant DMS-2113642.

Received 21 February 2022

Published 22 July 2022