Average-case linear-time similar substring searching by the q-gram distance
抄録
In this paper we consider the problem of similar substring searching in the q-gram distance. The q-gram distance d(q)(x, y) is a similarity measure between two strings x and y defined by the number of different q-grams between them. The distance can be used instead of the edit distance due to its lower computation cost, O(|x| + |Y|) vs. O(|x||Y|). and its good approximation for the edit distance. However, if this distance is applied to the problem of finding all similar strings, in a long text t, to a given pattern p, the total computation cost is sometimes not acceptable. Ukkonen already proposed two fast algorithms: one with an array and the other with a tree. When "similar" means k or less in dq, their time complexities are O(|t|k + |P|) and O(|t| log k + |p|). respectively. In this paper, we propose two algorithms of average-case complexity O(|t| + |p|). although their worst-case complexities are still O(|t|k + |P|) and O(|t| log k + |p|). respectively. The linearity of the average-case complexity is analyzed under the assumption of random sampling of t and the condition that q is larger than a threshold. The algorithms exploit the fact that similar substrings in t are often found at very close positions if the beginning positions of the substrings are close. In the second proposed algorithm, we adopted a doubly-linked list supported by an array and a search tree to search for a list element in O(log k) time. Experimental results support their theoretical average-case complexities. (C) 2014 Elsevier B.V. All rights reserved.
収録刊行物
-
- Theoretical Computer Science
-
Theoretical Computer Science 530 23-41, 2014-04-17
Elsevier
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1050282813994818176
-
- NII論文ID
- 120005593436
-
- HANDLE
- 2115/58429
-
- ISSN
- 03043975
-
- 本文言語コード
- en
-
- 資料種別
- journal article
-
- データソース種別
-
- IRDB
- Crossref
- CiNii Articles
- KAKEN