ロシア語テキストへのメタ情報タグ付与に関する一考察

Bibliographic Information

Other Title
  • An Approach to the Determination of Metadata Tags for Russian Texts

Search this article

Abstract

In this article, we examined the meta information which is added to texts in the Russian National Corpus (RNC) and examined what kinds of metadata can be added in order to collect required samples more efficiently. In the current RNC, the morphologic and semantic metadata is generally attached to a unit of meaning classified as “word”. By using such metadata, users can collect various samples by setting composite search criteria. However, meta information is not attached to the unit of meaning classified as “sentence”. Therefore, there is room for improvement in order to obtain more meaningful search results. In this paper, we reviewed the possibility of adding additional metadata to the text constituting the RNC corpus by making use of such linguistic features of Russian language as sentence classification. Russian sentences can be classified into 5 types according to their morphological and semantic features. These are: 1) the definite personal sentence (Я люблю работать), 2) the indefinite-personal sentence (Здесь го-ворят только по-русски), 3) the generalized personal sentence (Тише едешь, дальше будешь), 4) the impersonal sentence (Мне холодно!) and 5) the infinitive sentence (Тебе этого не понять!). If metadata classified in this way is attached to the “sentence” unit, it will be possible to investigate the grammatical behavior of words in more detailed way. The difficulty in attaching such metadata is that sentence classification cannot be performed mechanically for all sentences. Therefore, when classifying, it is necessary to consider their semantic context. However, for some sentence types, it is possible to determine the classification unambiguously. Firstly, if the sentence is accompanied by a grammatical subject (i.e. a nominative noun), the sentence can be classified as a ‘definite personal sentence’. On the other hand, if the sentence is not accompanied by a grammatical subject, it is necessary to differentiate between cases of non-past tense and past tense. In the case of non-past tense, the classification of sentences can be unambiguously defined when the verb is in the form of the first-person singular. In this case, it can be marked mechanically as a ‘definite personal sentence’. In the case of the past tense, it is only when the form of the verb is a male form or a female form, that it can be defined as a ‘definite personal sentence’. Whenever a sentence has an impersonal predicate, it is unambiguously determined as an ‘impersonal sentence’.

Article

Journal

Details 詳細情報について

Report a problem

Back to top