How Intuitive Are Diversified Search Metrics? Concordance Test Results for the Diversity U-measures How Intuitive Are Diversified Search Metrics? Concordance Test Results for the Diversity U-measures

Search this Article

Author(s)

Abstract

For the past few decades, ranked retrieval (e.g. web search) has been evaluated using rank-based evaluation metrics such as Average Precision and normalised Discounted Cumulative Gain (nDCG). These metrics discount the value of each retrieved relevant document based on its rank. The situation is similar with diversified search which has gained popularity recently: diversity metrics such as α-nDCG, Intent-Aware Expected Reciprocal Rank (ERR-IA) and Dfl-nDCG are also rank-based. These widely-used evaluation metrics just regard the system output as a list of document IDs, and ignore all other features such as snippets and document full texts of various lengths. The recently-proposed U-measure framework of Sakai and Dou uses the amount of text read by the user as the foundation for discounting the value of relevant information, and can take into account the user's snippet reading and full text reading behaviours. The present study compares the diversity versions of U-measure (D-U and U-IA) with state-of-the-art diversity metrics in terms of how "intuitive" they are: given a pair of ranked lists, we quantify the ability of each metric to favour the more diversified and more relevant list by means of the concordance test. Our results show that while Dfl-nDCG is the overall winner in terms of simultaneous concordance with diversity and relevance, D-U and U-IA statistically significantly outperform other state-of-the-art metrics. Moreover, in terms of concordance with relevance alone, D-U and U-IA significantly outperform all rank-based diversity metrics. These results suggest that D-U and U-IA are not only more realistic than rank-based metrics but also intuitive, i.e., that they measure what we want to measure.

For the past few decades, ranked retrieval (e.g. web search) has been evaluated using rank-based evaluation metrics such as Average Precision and normalised Discounted Cumulative Gain (nDCG). These metrics discount the value of each retrieved relevant document based on its rank. The situation is similar with diversified search which has gained popularity recently: diversity metrics such as α-nDCG, Intent-Aware Expected Reciprocal Rank (ERR-IA) and Dfl-nDCG are also rank-based. These widely-used evaluation metrics just regard the system output as a list of document IDs, and ignore all other features such as snippets and document full texts of various lengths. The recently-proposed U-measure framework of Sakai and Dou uses the amount of text read by the user as the foundation for discounting the value of relevant information, and can take into account the user's snippet reading and full text reading behaviours. The present study compares the diversity versions of U-measure (D-U and U-IA) with state-of-the-art diversity metrics in terms of how "intuitive" they are: given a pair of ranked lists, we quantify the ability of each metric to favour the more diversified and more relevant list by means of the concordance test. Our results show that while Dfl-nDCG is the overall winner in terms of simultaneous concordance with diversity and relevance, D-U and U-IA statistically significantly outperform other state-of-the-art metrics. Moreover, in terms of concordance with relevance alone, D-U and U-IA significantly outperform all rank-based diversity metrics. These results suggest that D-U and U-IA are not only more realistic than rank-based metrics but also intuitive, i.e., that they measure what we want to measure.

Journal

  • IPSJ SIG Notes

    IPSJ SIG Notes 2013-IFAT-111(12), 1-6, 2013-07-15

    Information Processing Society of Japan (IPSJ)

Keywords

Codes

  • NII Article ID (NAID)
    110009585854
  • NII NACSIS-CAT ID (NCID)
    AN10114171
  • Text Lang
    ENG
  • Data Source
    NII-ELS 
Page Top