Read/Search this Article
Abstract
ゲノム配列から,ある機能発現に関わる遺伝子を発見したいという要望がある.このような遺伝子は対象となるゲノム配列集合においてエラーを含んだ形で類出する配列パターンの形をとることがしばしばある.本研究では,エラーを含む文字列集合から,ある長さの頻出部分文字列パターンを高速に全列挙するアルゴリズムを提案する.エラーの影響から通常のパターン列挙と異なり,入力文字列には現れないパターンも列挙の対象となる.提案手法ではパターン頻出性の必要条件を利用して最小限の候補パターンをハッシュ格納することにより,高速な全列挙を実現する.
We propose a practically fast algorithm that enumerates all m-length substring patterns appearing in at least θ sequences among a given set of string sequences, where at most k errors are allowed for each appearance. The problem of enumerating such substring patterns is derived from the genome science, where frequent substring patterns allowing errors are candidates of a gene related to a certain function. From this context, some pattern should be enumerated even if it does not appear in any sequence, because it may match a sufficient number of sequences by allowing errors. In order to prevent overlooking such potential patterns, we propose a hash-based enumeration algorithm. The algorithm stores not only a substring pattern appearing in a sequence but also its neighboring patterns. By using several techniques/conditions to exclude non-frequent patterns, our algorithm achieves an efficient enumeration of frequent substring patterns allowing errors.
Journal
- IPSJ SIG technical reports [List of Volumes]
-
IPSJ SIG technical reports 2008(58), 63-66, 2008-06-19 [Table of Contents]
Information Processing Society of Japan (IPSJ)