Analysis method and algorithm design of biological sequence problem based on generalized k-mer vector
K-mer can be used for the description of biological sequences and k-mer distribution
is a tool for solving sequences analysis problems in bioinformatics. We can use k-mer vector as
a representation method of the k-mer distribution of the biological sequence. Problems, such as
similarity calculations or sequence assembly, can be described in the k-mer vector space. It helps
us to identify new features of an old sequence-based problem in bioinformatics and develop new
algorithms using the concepts and methods from linear space theory. In this study, we defined
the k-mer vector space for the generalized biological sequences. The meaning of corresponding
vector operations is explained in the biological context. We presented the vector/matrix form of
several widely seen sequence-based problems, including read quantification, sequence assembly,
and pattern detection problem. Its advantages and disadvantages are discussed. Also, we
implement a tool for the sequence assembly problem based on the concepts of k-mer vector
methods. It shows the practicability and convenience of this algorithm design strategy.
关键词:
vector space,
biological sequence,
k-mer,
algorithm design,
analysis method