Automatic Extraction of Auxiliary Phrases from a Corpus

Bibliographic Information

Other Title
  • コーパスからの付属語的表現の自動抽出

Search this article

Abstract

<p>In this paper, we describe a method to automatically extract Japanese auxiliary phrases from a corpus. The auxiliary phrase is a kind of idiomatic expression corresponding to auxiliary verb or postpositional particle. Typical examples are "にかんして" and "なければならない". Generally it is advantageous to handle the auxiliary phrase as one word. Therefore, building a dictionary, we need bring together auxiliary phrases like standard words. However, it is difficult to pick up auxiliary phrases. Because it is unclear to distinguish them from normal phrases. Thoroughly investigating the difference, it is defined by subjectivity of system developer. Therefore, it needs vast time to select auxiliary phrases, and there must be considerable doubt that phrases collected comprise all necessary phrases, and have uniformity. To overcome this problem, we present this method. The point of our method is to utilize the following heuristics that a auxiliary phrase has : (H1) The auxiliary phrase is consist of HIRAGANA characters. Even if KANJI character is found in it, its length is 1. (H2) Characters in front and behind of the auxiliary phrase are a certain confined characters. (H3) Each word composed the auxiliary phrase are strongly connected. Firstly, we pick up all phrases whose length is N from the corpus, however, the phrase is consist of HIRAGANA characters and KANJI characters whose length are 1. For all N(&ge;4), we carry out above operation. In view of (H1), all auxiliary phrases must exist in the set of phrases acquired by these operations. Then, using (H2) and (H3), we remove not auxiliary phrases from this set. Last, we remove duplicate phrases by investigating whether there is a longer phrase included the phrase. As the result, we can acquire phrases to aim in this paper. This method has a merit to easily carry out under poor environment. We made experiment on this method with ASAHI newspaper articles for one month (about 9 Mbyte). We report this result, too.</p>

Journal

Citations (2)*help

See more

References(4)*help

See more

Details 詳細情報について

  • CRID
    1390285697602021760
  • NII Article ID
    110002806721
  • NII Book ID
    AN10067140
  • DOI
    10.11517/jjsai.10.3_429
  • ISSN
    24358614
    21882266
  • Text Lang
    ja
  • Data Source
    • JaLC
    • CiNii Articles
  • Abstract License Flag
    Disallowed

Report a problem

Back to top