Abstract :
There is an increasing interest in recent years for querying and ranking XML documents. In this paper, we present a new framework for querying and ranking schema-less XML documents based on concise summaries of their structural and textual content. We introduce a novel data synopsis structure to summarize the textual content of an XML document for efficient indexing. More importantly, we extend the traditional vector space model to effectively rank XML documents over the proposed data synopses. We conduct extensive experiments over XML benchmark data to demonstrate the advantages of the indexing scheme and the effectiveness of our ranking scheme. We also compare our framework with Lucene to demonstrate our extended TF*IDF scoring function is effective. Categories and Subject Descriptors D.3.2 [Language Classifications]: Data-flow languages; H.3.1 [Content Analysis and Indexing]: I.7 [Document and Text Processing]: Markup languages; H.3.3 [Information Search and Retrieval]: Query Formulation General Terms: XML, Information Retrieval, Data Processing Keywords: XML, Query processing, Document ranking, Query synopses, Document ranking Received: 11 June 2011, Revised 12 August 2011, Accepted 19 August 2011