A Web Corpus and Word Sketches for Japanese

Erjavec Irena Srdanovic, Erjavec Tomaz, Kilgarriff Adam

doi:10.5715/jnlp.15.2_137

この論文をさがす

抄録

Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries of a word's grammatical and collocational behaviour), a distributional thesaurus, and robot use. We describe the steps taken to gather and process the corpus and to establish its validity, in terms of the kinds of language it contains. We then describe the development of a shallow grammar for Japanese to enable word sketching. We believe that the Japanese web corpus as loaded into the Sketch Engine will be a useful resource for a wide number of Japanese researchers, learners, and NLP developers.

収録刊行物

自然言語処理

自然言語処理 15 (2), 137-159, 2008

一般社団法人　言語処理学会

キーワード

詳細情報詳細情報について

CRID: 1390001204476556032

NII論文ID: 130004291947; 10021991917

NII書誌ID: AN10472659

DOI: 10.5715/jnlp.15.2_137

ISSN: 21858314; 13407619; http://id.crossref.org/issn/13407619

NDL書誌ID: 9571652

Web Site: https://ndlsearch.ndl.go.jp/books/R000000004-I9571652; http://www.jstage.jst.go.jp/article/jnlp1994/15/2/15_2_137/_pdf

本文言語コード: en

データソース種別

JaLC
NDL
Crossref
CiNii Articles

抄録ライセンスフラグ: 使用不可

A Web Corpus and Word Sketches for Japanese

この論文をさがす

抄録

収録刊行物

被引用文献 (2)*注記

参考文献 (31)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

A Web Corpus and Word Sketches for Japanese

この論文をさがす

抄録

収録刊行物

被引用文献 (2)*注記

参考文献 (31)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について