Abstract:
RDF data management has received a lot of attention in the past decade due to the
widespread growth of Semantic Web and Linked Open Data initiatives. RDF data is expressed in the form of triples (as Subject - Predicate - Object), with SPARQL used for
querying it. Many novel database systems such as RDF-3X, TripleBit, etc. β store RDF
in its native form or within traditional relational storage β have demonstrated their ability
to scale to large volumes of RDF content. However, it is increasingly becoming obvious
from the knowledge representation applications of RDF that it is equally important to integrate with RDF triples additional information such as source, time and place of occurrence, uncertainty, etc. Consider an RDF fact (BarackObama, isPresidentOf, UnitedStates). While this fact is useful for finding information regarding president of United States, it does not provide sufficient information for answering many challenging questions like what is the temporal validity of this fact?, where did this fact come from?, etc.
Annotations like confidence, geolocation, time, etc. can be modeled in RDF through a techniques called reification, which is also a W3C recommendations. Reification, retains the triple nature of RDF and associates annotations using blank nodes.
The focus of this thesis is on database aspects of storing and querying RDF graphs containing annotations like confidence, etc. on RDF triples. In this thesis, we start by developing an RDF database, named RQ-RDF-3X for efficiently querying these RDF graphs containing annotations over native RDF triples. Next, we noticed that more than 62% facts in real-world RDF datasets like YAGO, DBpedia, etc. have numerical object values. Suggesting the use of queries containing ORDER-BY clause on traditional graph pattern queries of SPARQL. State-of-the-art RDF processing systems such as Virtuoso, Jena, etc. handle such queries by first collecting the results and then sorting them in-memory based on the userspecified function, making them not very scalable. In order to efficiently retrieve results of top-𝑘 queries, i.e. queries returning the top-𝑘 results ordered by a user-defined scoring function, we developed a top-k query processing database named Quark-X. In Quark-X we propose indexing and query processing techniques for making top-𝑘 querying efficient.
Motivated by the importance of geo-spatial data in critical applications such as emergency response, transportation, agriculture etc. In addition to its widespread use in knowledge bases such as YAGO, WikiData, LinkedGeoData, etc. We developed STREAK, a RDF data management system that is designed to support a wide-range of queries with spatial filters including complex joins, with top-𝑘 queries over spatially enriched databases. While developing STREAK we realized that to make effective use of this rich data, it is crucial to efficiently evaluate queries combining topological and spatial operators β e.g., overlap, distance, etc. β with traditional graph pattern queries of SPARQL. While there have been research efforts for efficient processing of spatial data in RDF/SPARQL, very little effort has gone into building systems that can handle both complex SPARQL queries as well as spatial filters.
We describe novel contributions of each of these engines developed below.
RQ-RDF-3X : RQ-RDF-3X presents extensions to triple-store style RDF storage engines to support reification and quads. In RQ-RDF-3X, we support triple annotations by assigning a unique identifier (R) to each (S, P, O) triple. Thus, the fundamental change required is to support an additional field (R) that has triple identifier. The inclusion of this additional field requires the query optimizer of the triple store being extended to be aware of the unique characteristic of the triple identifier (R). Additionally this requires careful re-thinking of existing indexing and query optimization approaches adopted by state-ofthe-art triple stores. In order to achieve fast performance in RQ-RDF-3X we propose an efficient set of indices which enables RQ-RDF-3X to efficiently reduce the query processing time by making use of merge joins. The set of indices are stored compactly using an efficient compression scheme. We demonstrate experimentally that RQ-RDF-3X achieves one to two orders of magnitude speed-up over both commercial and academic engines such as Virtuoso, RDF-3X, and Jena-TDB on real-world datasets - YAGO and DBpedia.
Quark-X: Quark-X is an efficient top-𝑘 query processing framework for RDF quad stores. The contributions of Quark-X include novel in-memory synopsis indexes for predicates describing numerical objects. This is in the same spirit as building impact-layered indexes in information retrieval but carefully redesigned for use for ranking in reified RDF. Additionally, Quark-X proposes a novel Rank-Hash Join (RHJ) algorithm designed to utilize the synopsis indexes, by selectively performing range scans for facts containing numerical objects early on β this is crucial to the overall performance of SPARQL queries which involve a large number of joins. We show experimentally that Quark-X achieves one to two magnitude speed-up over baseline databases namely Virtuoso, Jena-TDB, SPARQLRANK and RDF-3X on YAGO and DBpedia datasets.
STREAK: STREAK is an efficient engine for processing top-k SPARQL queries with spatial filters. Spatial filters are used to evaluate distance relationships between entities in SPARQL queries. STREAK introduces various novel features such as a careful identifier encoding strategy for spatial and non-spatial entities for reducing storage cost and for early pruning, the use of a semantics-aware Quad-tree index that allows for early-termination and a clever use of adaptive query processing with zero plan-switch cost. For experimental evaluations, we focus on top-k distance join queries and demonstrate that STREAK outperforms popular spatial join algorithms as well as state of the art commercial systems such as Virtuoso.