Study on building a high-quality homepage collection from the Web considering page group structures Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures

この論文にアクセスする

この論文をさがす

著者

    • Wang, Yuxin ワン, ユーシン

書誌事項

タイトル

Study on building a high-quality homepage collection from the Web considering page group structures

タイトル別名

Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures

著者名

Wang, Yuxin

著者別名

ワン, ユーシン

学位授与大学

総合研究大学院大学

取得学位

博士 (情報学)

学位授与番号

甲第1000号

学位授与年月日

2006-09-29

注記・抄録

博士論文

This disseration is devoted to investigate the method for building a high-quality<br />homepage collection from the web effciently by considering the page group struc-<br />tures. We mainly investigate in researchers' homepages and homepages of other<br />categories partly.<br /> A web page collection with a guaranteed high quality (i.e., high recall and high<br />precision) is required for implementing high quality web-based information services.<br />Building such a collection demands a large amount of human work, however, be-<br />cause of the diversity, vastness and sparseness of web pages. Even though many<br />researchers have investigated methods for searching and classifying web pages, etc.,<br />most of the methods are best-effort types and pay no attention to quality assurance.<br />We are therefore investigating a method for building a homepage collection eff-<br />ciently while assuring a given high quality, with the expectation that the investigated<br />method can be applicable to the collection of various categories of homepages.<br /> This dissertation consists of seven chapters. Chapter 1 gives the introduction,<br />and Chapter 2 presents the related work. Chapter 3 describes the objectives, the<br />overall performance goal of the investigated system, and the scheme of the system.<br />Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail<br />respectively. Chapter 6 discusses the method for reducing the processing cost of the<br />system, and Chapter 7 concludes the dissertation by summarizing it and discussing<br />future work.<br /> Chapter 3, taking into account the enormous size of the real web, introduces a<br />two-step-processing method comprising rough filtering and accurate classifica-<br />tion. The former is for narrowing down the amount of candidate pages effciently<br />with the required high recall and the latter is for accurately classifying the candidate<br />pages into three classes-assured positive, assured negative, and uncertain-while<br /> We present in detail the con?guration, the experiments, and the evaluation of<br />the rough filtering in Chapter 4. The rough filtering is a method for gathering<br />researchers' homepages (or entry pages) by applying our original, simple, and effec-<br />tive local page group models exploiting the mutual relations between the structure<br />and the content of a logical page group. It aims at narrowing down the candidates<br />with a very high recall. First, property-based keyword lists that correspond to<br />researchers' common properties are created and are grouped either as organization-<br />related or non-organization-related. Next, four page group models (PGMs)<br />taking into consideration the structure in an individual logical page group are intro-<br />duced. PGM_Od models the out-linked pages in the same and lower directories,<br />PGM Ou models the out-linked pages in the upper directories, PGM_I models<br />the in-linked pages in the same and the upper directories, and PGM_U models the<br />site top and the directory entry pages in the same and the upper directories.<br /> Based on the PGMs, the keywords are propagated to a potential entry page from<br />its surrounding pages to compose a virtual entry page. Finally, the virtual entry<br />pages that scored at least a threshold value are selected. Since the application of<br />PGMs generally causes a lot of noises, we introduced four modified PGMs with<br />two original techniques: the keywords are propagated based on PGM_Od only when<br />the number of out-linked pages in the same and lower directories is less than a<br />threshold value, and only the organization-related keywords are propagated based<br />on other PGMs. The four modified PGMs are used in combination in order to utilize<br />as many informative keywords as possible from the surrounding pages.<br /> The effectiveness of the method is shown by comparing it with that of a single-<br />page-based method through experiments using a 100GB web data set and a manually<br />created sample data set. The experiment results show that the output pages from<br />the rough filtering are less than 23% of the pages in the 100GB data set when the<br />four modified PGMs are used in combination under a condition that the recall is<br />more than 98%. Another experiment using a 1.36TB web data set with the same<br />rough filtering configuration shows that the output pages are less than 15% of the<br />pages in the corpus.<br /> In Chapter 5 we present in detail the configuration, the experiments, and the<br />evaluation of the accurate classification method. Using two types of component<br />classifiers (a recall-assured classifier and a precision-assured classifier) in<br />combination, we construct a three-way classifier that inputs the candidate pages<br />output by the rough filtering and classifies them to three classes: assured posi-<br />tive, assured negative, and uncertain. The assured positive output assures the<br />precision and the assured positive and uncertain output together assure the recall,<br />so only the uncertain output needs to be manually assessed in order to assure the<br />quality of the web data collection.<br /> We first devise a feature set for building the high-performance component clas-<br />sifiers using Support Vector Machine (SVM). We use textual features obtained<br />from each page and its surrounding pages. After the surrounding pages are grouped<br />according to connection types (in-link, out-link, and directory entry) and relative<br />URL hierarchy (same, upper, or lower in the directory hierarchy), an independent<br />feature subset is generated from each group. Feature subsets are further concate-<br />nated conceptually to compose the feature set of a classifier. We use two types of<br />textual features (plain-text-based and tagged-text-based). The classifier using<br />only the plain-text-based features in each page alone is used as the baseline. Various<br />feature sets are tested in the experiment using manually prepared sample data, and<br />the classifiers are tuned by two methods, one offset-based and the other c-j-option-<br />based. The results show that the performance obtained by using c-j-option-based<br />tuning method is statistically signi?cant at 95% confidence level. The F-measures<br />of the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58%<br />and show that the proposed method is evidently effective.<br /> To know the performances of the classiffers with the abovementioned feature sets<br />in more general cases, we experimented with our method on the Web->KB data<br />set, a test collection commonly used for the web page classi?cation task. It contains<br />seven categories and four of them-course, faculty, project, and student-are used<br />for comparing the performance. The experiment results show that our method out-<br />performed all seven of the previous methods in terms of macro-averaged F-measure.<br />We can therefore conclude that our method performs fairly well and is applicable not<br />only to researchers' homepages in Japanese but also to other categories of homepages<br />in other languages.<br /> By tuning the well-performing classifiers independently, we then build a recall-<br />assured classifier and a precision-assured classifier and compose a three-way classi-<br />fier by using them in combination. We estimated the numbers of the pages to be<br />manually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and<br />98%/90%, using the output pages from a 100GB data set through the rough filter-<br />ing. The results show that the manual assessment cost can be reduced, compared<br />to the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas-<br />sification result examples, and the results show the effectiveness of the classifiers.<br /> In Chapter 6 the cascaded structure of the recall-assured classifiers, used in<br />combination with the rough filtering, is proposed for reducing the computer pro-<br />cessing cost. Estimation on the numbers of pages requiring feature extraction in the<br />accurate classification shows that the computer processing cost can be reduced<br />down to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set.<br /> In Chapter 7 we summarize our contributions. One of our unique contributions<br />is that we pointed out the importance of assuring the quality of web page collection<br />and proposed a framework for doing so. Another is that we introduced an idea of<br />local page group models (PGMs) and demonstrated its effective uses for filtering<br />and classifying web pages.<br /> We first presented a realistic framework for building a high-quality web page<br />collection with a two-step process, composing the rough filtering followed by the<br />accurate classification, in order to reduce the processing cost. In the rough filtering<br />we contributed two original key techniques used in the modified PGMs to reduce the<br />irrelevant keywords to be propagated. One is to introduce a threshold on the number<br />of out-linked pages in the same and lower directories, and the other is to introduce<br />keyword list types and propagate only the organization-related keyword lists from<br />the upper directories. In the accurate classification we contributed not only a original<br />method for exploiting features from the surrounding pages and concatenating the<br />features independently to improve web page classification performance but also a<br />way to use a recall-assured classifier and a precision-assured classifier in combination<br />as a three-way classifier in order to reduce the amount of pages requiring manual<br />assessment under the given quality constraints.<br /> We also discuss the future work: finding a more systematic way for modifying<br />the property set and property-based keywords for the rough ?ltering, investigating<br />ways to estimate the likelihood of the component pages and incorporate them for<br />the accurate classification, and further utilizing the information from the homepage<br />collection for practical applications.

application/pdf

総研大甲第1000号

3アクセス

各種コード

  • NII論文ID(NAID)
    500000375648
  • NII著者ID(NRID)
    • 8000000376819
  • 本文言語コード
    • eng
  • NDL書誌ID
    • 000008550898
  • データ提供元
    • 機関リポジトリ
    • NDL ONLINE
ページトップへ