OREGON STATE UNIVERSITY

You are here

Keyword search for data-centric XML collections with long text fields

TitleKeyword search for data-centric XML collections with long text fields
Publication TypeConference Paper
Year of Publication2010
AuthorsTermehchy, A., and M. Winslett
Secondary AuthorsManolescu, I., S. Spaccapietra, J. Teubner, M. Kitsuregawa, A. Leger, F. Naumann, A. Ailamaki, and F. Ozcan
Conference NameProceedings of the 13th International Conference on Extending Database Technology (EDBT’10)
Pagination537-548
Date Published03/2010
PublisherACM Press
Conference LocationLausanne, Switzerland
ISBN Number9781605589459
Abstract

Users who are unfamiliar with database query languages can search XML data sets using keyword queries. Current approaches for supporting such queries are either for text-centric XML, where the structure is very simple and long text fields predominate; or data-centric, where the structure is very rich. However, long text fields are becoming more common in data-centric XML, and existing approaches deliver relatively poor precision, recall, and ranking for such data sets. In this paper, we introduce an XML keyword search method that provides high precision, recall, and ranking quality for data-centric XML, even when long text fields are present. Our approach is based on a new group of structural relationships called normalized term presence correlation (NTPC). In a one-time setup phase, we compute the NTPCs for a representative DB instance, then use this information to rank candidate answers for all subsequent queries, based on each answer's structure. Our experiments with 65 user-supplied queries over two real-world XML data sets show that NTPC-based ranking is always as effective as the best previously available XML keyword search method for data-centric data sets, and provides better precision, recall, and ranking than previous approaches when long text fields are present. As the straightforward approach for computing NTPCs is too slow, we also present algorithms to compute NTPCs efficiently.

DOI10.1145/1739041.1739106