Classifying scientific texts in biology for focus species 生物学論文を対象とした焦点生物種の分類
Classifying scientific texts in biology for focus species
In recent years high throughput methods have led to a massive expansion in the free text literature on molecular biology. Automated text mining has developed as an application technology to organize this wealth of published results into structured database entries. Presently, there are more than 10,000 species and taking the marbled lungfish (Protopterus aethiopicus) as an example, there are 132.8 billion base pairs in this fish genome. In a typical systems biology abstract, there are 4-5 genes mentioned on average. Thus, recording and encoding them manually would take prohibitive amounts of time and human resources. Building intelligent tools to help authors and database curators integrate published results into databases has therefore become a major goal of research in biomedical natural language processing. However, the multiplicity of interpretations of meanings makes the specification of the author’s intended meaning extremely challenging for automated natural language processing. In this dissertation, the contribution is presented through a series of three experiments for identifying the focus species in biological papers as an aid to classifying and summarizing the experimental result. The focus species presents the author’s major claim in reporting their own results. I present a new method to identify focus species with novel features in full-text papers and abstracts. I present a new knowledge model for species citations in biomedical papers. With this scheme, I developed a tool to provide authors and curators with a high-throughput method capable of determining the focus species in experimental papers. Unlike previous studies my approach does not consider target documents in isolation but makes use of a network of citation relationships, amplifying information which is implicit in the target document. The various features explored in the thesis questions are evaluated on gold standard data sets that have been constructed by external groups for community evaluation exercises. In the experiments, 3 model organisms are classified in full papers selected based on the BioCreative 1b dataset and 4 model organisms are classified in abstracts selected from the DECA corpus. With three experiments, I showed a best F-score of 90.7% for classifying the full papers by using internal features. I also showed that when only using internal features, full papers perform much better than abstracts. By using external features from related publications, I demonstrated a best F-score of 91.14% for classifying abstracts. Finally I developed a new typed citation scheme and showed that among the four citation classes of background, method, results and data, the strongest relation for aiding the focus species classification was the one relating author results to the target paper. The thesis explores the general question "What features are most effective for resolving conflicting evidence about focus organism in biomedical abstract and full text?” Since the question is potentially open-ended, I break this down into three specific sub-questions.1. What level of classification performance is achievable using state-of-the-art lexical semantic features for focus species in full papers and abstracts?2. Of the abstracts which are cited or archived in the PubMed database, do bibliographic features provide enhanced classification accuracy?3. Of the abstracts which are cited does a typed citation function provide enhanced classification accuracy? Also what citation types prove the most useful? This Ph.D. dissertation presents a method for identifying the focus species of full-text papers and abstracts and a new citation scheme for biomedical papers. This dissertation consists of seven chapters. Chapter 1 gives the introduction, and Chapter 2 presents the related work. Chapter 3 describes the first experiment on focus species classification for full-text papers. Chapters 4 describes the second experiment on focus species classification for abstracts. Chapter 5 discusses the new citation scheme for biomedical papers and its application to focus species classification. And chapter 6 discusses the difficult cases for the task and online tools. Chapter 7 concludes this dissertation and discusses future work.There is one set of experiments for each thesis question. Hypothesis 1 is explored in a series of experiments in chapter 3. Based on the findings of this experiment which showed the relative merits of various in document lexical semantic features, I conducted Hypothesis 2 experiments which are reported in chapter 4. Based on the findings of experiments in chapter 4 that showed the effectiveness of bibliographic features, I conducted Hypothesis 3 experiments which are reported in chapter 5.