形態情報注釈入りロシア語コーパス作成のためのツール  [in Japanese] Tools for building morphologically annotated corpora of Russian  [in Japanese]

Access this Article

Search this Article

Author(s)

Abstract

This paper introduces the two tools that I have developed for building morphologically annotated corpora of Russian, RusTagger and PostAno, and presents schemes for annotations that are used in the tools. RusTager is used to annotate texts with morphological information in XML. The <w>-</w> and <punc>-</punc> tags are added to texts to identify token (word or punctuation) boundaries. The following attributes are assigned to <w> tags: id, lemma, 1 _ id, pos, subclass, infl-v, infl-n, amb, and unfound. The id attribute is used to identify the position of each word. The lemma (base form) of each word form is given as the value of the lemma attribute. The l _ id attribute is used to distinguish homonyms. Part-of-speech information is hierarchically provided as the values of the pos and subclass attributes. The values of infl-v and infl-n attributes provide information of inflexional categories for verbal and nominal elements, respectively. The amb attribute is used to find lexically ambiguous word forms. The unfound attribute is used to show the absence of the relevant word form in the dictionary that RusTagger uses. The attributes that are assigned to <punc> tags are pre and post, which are used to show the existence of a space before and after the relevant punctuation, respectively. RusTagger can be customized by replacing the dictionary file and/or rewriting the files infl _ v.map and infl _ n.map. The program MakeDicFile enables to users to define their original classification of morphological informations in new dictionary files. The files infl - v.map and infl - n.map are used to define the format in which the values of the infl - v and infl - n attributes are displayed. RusTagger can annotate the texts that are partially marked up in XML, too. This function enables users to add <w>-</w> and <punc>-</punc> tags to texts that are marked to the level of the sentence. PostAno is used to edit the texts that are marked up in RusTagger. This tool enables users to find word forms that are lexically or morphologically ambiguous or word forms that are not registered in the dictionary that Rustagger uses. RusTagger and PostAno can be freely downloaded from the sites: RusTagger: http://www.otaru-uc.ac.jp/-hisanari/ja/software/rustagger/index.html PostAno: http://www.otaru-uc.ac.jp/-hisanari/ja/software/postano/index.html

Journal

  • ロシア語ロシア文学研究

    ロシア語ロシア文学研究 (36), 111-118, 2004

    日本ロシア文学会

Codes

  • NII Article ID (NAID)
    110001247122
  • NII NACSIS-CAT ID (NCID)
    AN10428431
  • Text Lang
    JPN
  • ISSN
    03873277
  • NDL Article ID
    7167706
  • NDL Source Classification
    ZK31(言語・文学--外国語・外国文学)
  • NDL Call No.
    Z12-233
  • Data Source
    NDL  NII-ELS  NDL-Digital 
Page Top