Abstract:
Inverted files, equivalent to database indices, are used to speed up the search of both
Hyper Text Markup Language (HTML) and eXtensible Markup Language (XML) files in
the web. Searching XML files differs from that ofHTML in two ways: inverted files for
XML need to be compressed because of their large size and the query evaluation against
XML files requires keyword searching both in the structure and in the values. XML
queries are often composed of multiple keywords with logical relations. XML queries
with conjunction, disjunction, ancestor-descendant, and preceding-following relations
among the multiple keywords have already been evaluated successfully. Multiple
keywords often appear in the XML queries as a phrase. Phrase Query in a single XML
document has already been evaluated. However, the method to evaluate phrase query in a
large or small collection of XML documents does not exist. Additionally, a special type
of query where keywords or phrases must not be present in the evaluated XML
documents is alsoTequired in many applications. As per our study, the method to evaluate
this NOT queries does not exist either. XML document retrieval will not be complete
without evaluating these two important types of queries. New solutions are required to
process both phrase and NOT queries efficiently. In this thesis, we introduce the methods
to evaluate both phrase and NOT queries proposing necessary changes in the inverted file
structure and query processing algorithms. We have used pull parser to parse the XML
documents. We have developed a prototype query processor which is capable of creating
inverted files and evaluating all types of queries including phrase and NOT queries. Our
experimental results using this prototype query processor show the effectiveness of our
proposed query evaluation methods.