Scalable Detection of Frequent Substrings by Grammar-Based Compression

Access this Article

Search this Article

Author(s)

Abstract

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of <i>grammar-based compression</i>, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on <i>edit-sensitive parsing</i> (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.

Journal

  • IEICE Transactions on Information and Systems

    IEICE Transactions on Information and Systems 96(3), 457-464, 2013-03-01

    The Institute of Electronics, Information and Communication Engineers

References:  18

Cited by:  1

  • An Encouragement of Compression  [in Japanese]

    SAKAMOTO Hiroshi

    The Journal of the Institute of Electronics, Information, and Communication Engineers 96(7), 501-506, 2013-07-01

    References (14)

Codes

  • NII Article ID (NAID)
    10031167431
  • NII NACSIS-CAT ID (NCID)
    AA10826272
  • Text Lang
    ENG
  • Article Type
    Journal Article
  • ISSN
    09168532
  • Data Source
    CJP  CJPref  J-STAGE 
Page Top