Approximate Frequent Pattern Discovery in Compressed Space

Access this Article

Author(s)

Abstract

<p>A grammar compression is a restricted context-free grammar (CFG) that derives a single string deterministically. The goal of a grammar compression algorithm is to develop a smaller CFG by finding and removing duplicate patterns, which is simply a frequent pattern discovery process. Any frequent pattern can be obtained in linear time; however, a huge working space is required for longer patterns, and the entire string must be preloaded into memory. We propose an <i>online</i> algorithm to address this problem approximately within compressed space. For an input sequence of symbols, <i>a</i><sub>1</sub>,<i>a</i><sub>2</sub>,..., let <i>G<sub>i</sub></i> be a grammar compression for the string <i>a</i><sub>1</sub><i>a</i><sub>2</sub>…<i>a<sub>i</sub></i>. In this study, an online algorithm is considered one that can compute <i>G</i><sub><i>i</i>+1</sub> from (<i>G<sub>i</sub></i>,<i>a</i><sub><i>i</i>+1</sub>) without explicitly decompressing <i>G<sub>i</sub></i>. Here, let <i>G</i> be a grammar compression for string <i>S</i>. We say that variable <i>X</i> approximates a substring <i>P</i> of <i>S</i> within approximation ratio <i>δ</i> iff for any interval [<i>i</i>,<i>j</i>] with <i>P</i>=<i>S</i>[<i>i</i>,<i>j</i>], the parse tree of <i>G</i> has a node labeled with <i>X</i> that derives <i>S</i>[<i>l</i>,<i>r</i>] for a subinterval [<i>l</i>,<i>r</i>] of [<i>i</i>,<i>j</i>] satisfying |[<i>l</i>,<i>r</i>]|≥<i>δ</i>|[<i>i</i>,<i>j</i>]|. Then, <i>G</i> solves the frequent pattern discovery problem approximately within <i>δ</i> iff for any frequent pattern <i>P</i> of <i>S</i>, there exists a variable that approximates <i>P</i> within <i>δ</i>. Here, <i>δ</i> is called the approximation ratio of <i>G</i> for <i>S</i>. Previously, the best approximation ratio obtained by a polynomial time algorithm was Ω(1/lg<sup>2</sup>|<i>P</i>|). The main contribution of this work is to present a new lower bound Ω(1/<<sup>*</sup>|<i>S</i>|lg|<i>P</i>|) that is smaller than the previous bound when lg<sup>*</sup>|<i>S</i>|<lg|<i>P</i>|. Experimental results demonstrate that the proposed algorithm extracts sufficiently long frequent patterns and significantly reduces memory consumption compared to the offline algorithm in the previous work.</p>

Journal

  • IEICE Transactions on Information and Systems

    IEICE Transactions on Information and Systems E101.D(3), 593-601, 2018

    The Institute of Electronics, Information and Communication Engineers

Codes

Page Top