Various problems concerning the construction of a WWW Japanese-language corpus: the current state and future prospects of Japanese-language corpus research
Access this Article
Search this Article
The objective of this paper is to discuss various problems that arise when Japanese-language text that is circulating on the World-Wide Web （WWW）is utilized as a corpus. First of all, our review of previous research relating to Japanese-language corpora showed that research into the application of the WWW as a Japanese-language corpus has still not been tackled sufficiently. We then studied all of the research papers that were presented at national conventions over a two-year period for one Japanese academic society relating to information and education. As a result, it became clear that although there have been several research projects into the use of text-mining methods, there has been almost no research relating to WWW Japanese-language corpora. In the light of these findings, we considered the various problems that might arise during research into WWW Japanese-language corpora. In other words, some of the points that we need to consider include: 1） sample bias, 2）the self-images projected by authors, 3）proof of the validity of such contents, 4）the large numbers of submissions by the same people, 5）the fact that data management including the revision and update of such contents is often done at the individual level, and 6）plagiarism of written works and quoting from other sites. Thus, although sample bias does remain, we can say that the Internet gives us the first opportunity in history to accumulate vast quantities of personally-published data. We have become able to utilize the Internet both quantitatively and qualitatively as a modern intellectual resource. Once we have clarified suitable methods for using this intellectual resource as a target of research in analysis, we should be able to engage in research towards structuring a WWW Japanese-language corpus in the future.
愛知教育大学教育実践総合センタ-紀要 (12), 117-123, 2009-02