Distributed SPARQL Throughput Increase: On the effectiveness of Workload-driven RDF partitioning

Authors:

Cosmin Basca and Abraham Bernstein

Abstract:

The Web of Data (WoD) continues to grow steadily each year. At over 31 billion triples in 2011, querying this globally distributed data space poses several scalability challenges. One critical aspect when processing distributed SPARQL queries is given by the number and type of distributed joins needed. Traditionally, query optimizers alleviate this issue by attempting to find an optimal query plan assuming a given and fixed data distribution. Discarding this fixed data partitioning assumption, offers the opportunity to create a data distribution that minimizes the number of distributed joins. Recent research focused on data- and query-driven partitioning strategies for both RDF and relational data. In this paper we propose a novel and naive workload-driven approach to data partitioning and investigate the impact of various critical factors on the number of resulting distributed joins. In a preliminary experiment we empirically compare our method to traditional partitioning strategies using a DBpedia query log of 400’000 queries and observe that it can produce up to 50% less distributed joins than an expert (manual) partitioning scheme, 45% less than partitioning based on hashing by subject and up to 83% less distributed joins than just random assignment.

Voting ID:

P11

Paper Download: