Study on building a high-quality homepage collection from the Web considering page group structures

この論文にアクセスする

機関リポジトリ総合研究大学院大学

この論文をさがす

NDL ONLINE

著者

- Wang, Yuxin ワン, ユーシン

書誌事項

タイトル: Study on building a high-quality homepage collection from the Web considering page group structures

タイトル別名: Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures

著者名: Wang, Yuxin

著者別名: ワン, ユーシン

学位授与大学: 総合研究大学院大学

取得学位: 博士 (情報学)

学位授与番号: 甲第1000号

学位授与年月日: 2006-09-29

注記・抄録

博士論文

This disseration is devoted to investigate the method for building a high-quality homepage collection from the web effciently by considering the page group struc- tures. We mainly investigate in researchers' homepages and homepages of other categories partly. 　A web page collection with a guaranteed high quality (i.e., high recall and high precision) is required for implementing high quality web-based information services. Building such a collection demands a large amount of human work, however, be- cause of the diversity, vastness and sparseness of web pages. Even though many researchers have investigated methods for searching and classifying web pages, etc., most of the methods are best-effort types and pay no attention to quality assurance. We are therefore investigating a method for building a homepage collection eff- ciently while assuring a given high quality, with the expectation that the investigated method can be applicable to the collection of various categories of homepages. 　This dissertation consists of seven chapters. Chapter 1 gives the introduction, and Chapter 2 presents the related work. Chapter 3 describes the objectives, the overall performance goal of the investigated system, and the scheme of the system. Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail respectively. Chapter 6 discusses the method for reducing the processing cost of the system, and Chapter 7 concludes the dissertation by summarizing it and discussing future work. 　Chapter 3, taking into account the enormous size of the real web, introduces a two-step-processing method comprising rough filtering and accurate classifica- tion. The former is for narrowing down the amount of candidate pages effciently with the required high recall and the latter is for accurately classifying the candidate pages into three classes-assured positive, assured negative, and uncertain-while 　We present in detail the con?guration, the experiments, and the evaluation of the rough filtering in Chapter 4. The rough filtering is a method for gathering researchers' homepages (or entry pages) by applying our original, simple, and effec- tive local page group models exploiting the mutual relations between the structure and the content of a logical page group. It aims at narrowing down the candidates with a very high recall. First, property-based keyword lists that correspond to researchers' common properties are created and are grouped either as organization- related or non-organization-related. Next, four page group models (PGMs) taking into consideration the structure in an individual logical page group are intro- duced. PGM_Od models the out-linked pages in the same and lower directories, PGM Ou models the out-linked pages in the upper directories, PGM_I models the in-linked pages in the same and the upper directories, and PGM_U models the site top and the directory entry pages in the same and the upper directories. 　Based on the PGMs, the keywords are propagated to a potential entry page from its surrounding pages to compose a virtual entry page. Finally, the virtual entry pages that scored at least a threshold value are selected. Since the application of PGMs generally causes a lot of noises, we introduced four modified PGMs with two original techniques: the keywords are propagated based on PGM_Od only when the number of out-linked pages in the same and lower directories is less than a threshold value, and only the organization-related keywords are propagated based on other PGMs. The four modified PGMs are used in combination in order to utilize as many informative keywords as possible from the surrounding pages. 　The effectiveness of the method is shown by comparing it with that of a single- page-based method through experiments using a 100GB web data set and a manually created sample data set. The experiment results show that the output pages from the rough filtering are less than 23% of the pages in the 100GB data set when the four modified PGMs are used in combination under a condition that the recall is more than 98%. Another experiment using a 1.36TB web data set with the same rough filtering configuration shows that the output pages are less than 15% of the pages in the corpus. 　In Chapter 5 we present in detail the configuration, the experiments, and the evaluation of the accurate classification method. Using two types of component classifiers (a recall-assured classifier and a precision-assured classifier) in combination, we construct a three-way classifier that inputs the candidate pages output by the rough filtering and classifies them to three classes: assured posi- tive, assured negative, and uncertain. The assured positive output assures the precision and the assured positive and uncertain output together assure the recall, so only the uncertain output needs to be manually assessed in order to assure the quality of the web data collection. 　We first devise a feature set for building the high-performance component clas- sifiers using Support Vector Machine (SVM). We use textual features obtained from each page and its surrounding pages. After the surrounding pages are grouped according to connection types (in-link, out-link, and directory entry) and relative URL hierarchy (same, upper, or lower in the directory hierarchy), an independent feature subset is generated from each group. Feature subsets are further concate- nated conceptually to compose the feature set of a classifier. We use two types of textual features (plain-text-based and tagged-text-based). The classifier using only the plain-text-based features in each page alone is used as the baseline. Various feature sets are tested in the experiment using manually prepared sample data, and the classifiers are tuned by two methods, one offset-based and the other c-j-option- based. The results show that the performance obtained by using c-j-option-based tuning method is statistically signi?cant at 95% confidence level. The F-measures of the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58% and show that the proposed method is evidently effective. 　To know the performances of the classiffers with the abovementioned feature sets in more general cases, we experimented with our method on the Web->KB data set, a test collection commonly used for the web page classi?cation task. It contains seven categories and four of them-course, faculty, project, and student-are used for comparing the performance. The experiment results show that our method out- performed all seven of the previous methods in terms of macro-averaged F-measure. We can therefore conclude that our method performs fairly well and is applicable not only to researchers' homepages in Japanese but also to other categories of homepages in other languages. 　By tuning the well-performing classifiers independently, we then build a recall- assured classifier and a precision-assured classifier and compose a three-way classi- fier by using them in combination. We estimated the numbers of the pages to be manually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and 98%/90%, using the output pages from a 100GB data set through the rough filter- ing. The results show that the manual assessment cost can be reduced, compared to the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas- sification result examples, and the results show the effectiveness of the classifiers. 　In Chapter 6 the cascaded structure of the recall-assured classifiers, used in combination with the rough filtering, is proposed for reducing the computer pro- cessing cost. Estimation on the numbers of pages requiring feature extraction in the accurate classification shows that the computer processing cost can be reduced down to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set. In Chapter 7 we summarize our contributions. One of our unique contributions is that we pointed out the importance of assuring the quality of web page collection and proposed a framework for doing so. Another is that we introduced an idea of local page group models (PGMs) and demonstrated its effective uses for filtering and classifying web pages. 　We first presented a realistic framework for building a high-quality web page collection with a two-step process, composing the rough filtering followed by the accurate classification, in order to reduce the processing cost. In the rough filtering we contributed two original key techniques used in the modified PGMs to reduce the irrelevant keywords to be propagated. One is to introduce a threshold on the number of out-linked pages in the same and lower directories, and the other is to introduce keyword list types and propagate only the organization-related keyword lists from the upper directories. In the accurate classification we contributed not only a original method for exploiting features from the surrounding pages and concatenating the features independently to improve web page classification performance but also a way to use a recall-assured classifier and a precision-assured classifier in combination as a three-way classifier in order to reduce the amount of pages requiring manual assessment under the given quality constraints. 　We also discuss the future work: finding a more systematic way for modifying the property set and property-based keywords for the rough ?ltering, investigating ways to estimate the likelihood of the component pages and incorporate them for the accurate classification, and further utilizing the information from the homepage collection for practical applications.

application/pdf

総研大甲第1000号

Study on building a high-quality homepage collection from the Web considering page group structures Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures

この論文にアクセスする

この論文をさがす

著者

書誌事項

注記・抄録

各種コード

書き出し