Canonical URL
Do not index
Do not index
Introducing BARQ
Stardog is an enterprise knowledge graph platform. Stardog supports query answering via the standarized SPARQL query language; GraphQL through a mapping to SPARQL; SQL through the BI/SQL endpoint; and, more recently, answering natural language queries via Voicebox.
This blog post describes our initial efforts to improve performance of Stardog’s SPARQL query engine for new use cases. Stardog BARQ—Batch-Based Accelerated Query Engine—is our newest query executor. It improves performance of some commonly used algorithms for processing graph data, such as the merge join, by two orders of magnitude compared to the standard, row-based execution Engine in Stardog 10 and can improve some queries up to 90x or more!
What’s the issue?
Stardog’s query engine has been designed around a sophisticated query optimizer, which uses advanced selectivity statistics to minimize disk IO over sorted RDF indexes. This works well for queries with selective patterns enabling high throughput on OLTP workload composed of a large number of queries that require processing of small parts of the source graph. Some popular SPARQL benchmarks, such as BSBM Explore, contain only such queries.
However, this approach faces its limitations on the so-called analytical or CPU-bound queries that process many intermediate results in memory. CPU-bound are increasingly common these days, for example, they naturally occur when analyzing very connected graphs or just when the query does not have a selective condition and has to traverse millions of edges. For these kinds of queries performance can be less satisfying than our customers expect.
Last year we started thinking about how to improve performance for these OLAP-style workloads. We have made considerable process in that direction, and Stardog now performs significantly better on industry standard benchmarks such as ldbc-LSQB and BSBM Business Intelligence. We are able to improve performance by up to 5x for join heavy queries in the LSQB benchmark.
Some Query Engine Basics
The query engine’s job is to generate the set of results for a SPARQL query according to the SPARQL specification. To give you some context, below is a very brief outline how the query engine in Stardog 10 works.
Since its inception, Stardog's query engine has been based on two principles:
- Pull-based evaluation (aka the "Volcano model" or the "Iterator model"). Each operator in the execution plan pulls data from its argument (children) operators.
- Tuple-at-a-time evaluation (aka row-based). Each operator processes just enough argument tuples to produce a single output tuple.
The tuple-at-a-time model excels at executing specific, selective, OLTP-style queries, where the key to runtime performance is avoiding unnecessary disk and network IO. For example, what we call “needle-in-haystack” queries in the Voicebox context.
In particular, this model guarantees that, assuming the optimal query plan, the storage layer won't need to read data from disk that is not needed for final results. However, operating over single tuples becomes a computational bottleneck for less selective analytical queries, which have to join or aggregate a large number of intermediate results in RAM.
How does Stardog Evaluation Queries Now?
The engine is roughly going to process the query with the following pipeline:
- Parsing the query into an abstract syntax tree (AST) and turning it into a (unoptimzed) query plan.
- The cost-based optimizer executes a set of complex rules to find the best version of the query plan. The central part of the optimization process is determining the best order of joins.
- The translator turns the plan into a tree of executable operators (i.e. scans, joins, filters, etc).
- The operators are executed by recursively pulling rows from the root operator to produce the final results.
The BARQ project focuses on changing the last step: “the execution” of the query engine pipeline. Lets consider how the following SPARQL query would be executed in the current model.
The query is supposed to calculate each employee’s yearly bonus, based on their age, if they are over 30.
SELECT ?name ((?age - 30) * 50 AS ?bonus) {
?person a :People;
:age ?age;
:name ?name.
FILTER (?age > 30)
}
The query would be translated into the following executable operator tree. Note how the expression
(?age - 30) * 50
is evaluated by a the Bind
operator. In the diagrams below, the arrows indicate calls made to pull data. Data flows from the bottom to the top of the diagram.
In Stardog 10.1 each operator performs exactly one logical operation (like index scans read triples from indexes in the storage layer and filters evaluate SPARQL expressions) and will produce one tuple at a time. The internal API is also designed to pass one tuple from each operator to its parent operator.
This is inefficient if the number of solutions becomes very large. The interface requires the CPU to resolve the actual method via virtual function table. Moreover the data is actually scattered in memory and not well suited for tiered memory caches of the CPU.
What is a Vectorized Query Engine?
Modern analytical databases (OLAP databases) are designed to answer complex and compute heavy queries with a large amount of intermediate results. For these analytical queries the traditional tuple-at-a-time model will offer only limited throughput.
OLAP databases tend to work on data in a columnar format. This means we pivot the above row based format into one where each attribute is stored in contiguous columns. This dense storage is better suited for modern CPUs with deep instruction pipelines and tiered cache architectures. We can design operators which work directly on a single column, which can also be optimized much better by modern compilers.
The key insight is that CPUs are very fast when doing simple tasks on data which can be accessed fast. The BARQ execution model supports this by replacing all existing row-based Stardog operators with versions which work on batched solutions with a columnar storage.
BARQ is inspired by systems like MonetDB (later rebranded as VectorWise and then Actian) and more recently Velox, in which executable operators operate on and generate a batch of tuples at a time. This enables much higher throughput on CPU-bound query workloads.
The Road to BARQ
Lets consider the previous query now with a vectorized query engine. The operator tree has gained some new specialized operators to execute the bind expression. Every step in the expression is executed via its own specialized operator.
The API between the operators now also changed to account for the solution batches. Each operator now pulls a
SolutionBatch
from its child operators. The
SolutionBatch
also has a columnar layout, where each solution row is stored across several columns. Vectorized Algorithms
Lets take a look at the merge join algorithm in Stardog. The merge join is one of the fundamental algorithms Stardog uses to join results of two SPARQL operators that are ordered by the join key (that’s often the case for Basic Graph Patterns). The merge join produces an ordered multiset of solutions, matching every pair of inputs where the join key is the same. Lets consider the following simple plan, it shows a merge join over two scans that correspond to triple patterns in SPARQL:
`─ MergeJoin(?joinKey) [#2.2M]
+─ Scan[PSOC](?joinKey,:p1, ?var1) [#6.0M]
`─ Scan[PSOC](?joinKey,:p2, ?var2) [#2.2M]
Essentially the merge join algorithm will perform a linear scan over both input scans. Every time it finds a matching pair it will emit a row into the output. The merge join algorithm becomes very efficient if we consider that we can skip over input rows for which we know there is not a match. The idea is that we can “skip-ahead” on the input scan, where we see a lesser value.
This is in theory a performant algorithm but due to the nature of the row-wise execution model it can be slower than we want. In a vectorized execution we want to decompose the algorithm and work directly on a batch of input rows (remember in the old model we only see one row at a time), instead of in a row-wise fashion.
To do this we decompose the classic merge join into two phases: Probe and Build.
- Probe: Determine all the input groups that need to be materialized to output.
- A group is defined to be a pair of ranges, one ordinal range for each input operator, such that the join key in each of the two ranges is equal.
- Build: Take these groups and materialize them a column at a time,
- Key here is that to perform the row materialization, we only need to know the group lengths. We do not need to access any other information.
- Each left input value in a column is expanded times the right range length
- The right column values in a range are repeated times the left range length
- This achieves a conversion to a column based cross product, since we never need to look at more than one column at a time
The graphic should illustrate what happens here. We see how in the final cross phase the group of the join key value “2” is build: Each row of the left side (just “?var1” here) is expanded three times, and the whole range the right side (just “?var2” here) is repeated twice.
Performance
As we show below, performance of the merge join alone can improve as much as 2 orders of magnitude. Of course, this is merely a synthetic measurement since not all the time in joins is spent in the cross phase. Any join will spend time fetching data from its sources; if there are no matching rows the majority of time is spent walking over the input. In cases where the connectedness of the graph is not as high, the actual performance difference will be smaller.
We use the public LSQB benchmark to comparatively evaluate BARQ performance against the default execution engine in Stardog 10.1. LSQB has been specifically designed to stress join algorithms of graph databases even on small graphs. We use the 0.3 scale factor (7.5M edges) and an r4.2xlarge EC2 instance on AWS for tests.
With the BARQ merge joiner we are able to improve the runtime of query 6 by the factor of 5x and on query 9 about 2x.
Looking at query 6 in detail we can see where we improve the most. Below is the query, the basic pattern over the
lp:Person_knows_Person
predicate is a two-hop join which produces hundreds of millions of results (depending on the scale factor). This poses quite a chellange for the row-based merge join algorithm.SELECT (COUNT(*) as ?count)
WHERE {
?person1 (lp:Person_knows_Person | ^lp:Person_knows_Person ) ?person2 .
?person2 (lp:Person_knows_Person | ^lp:Person_knows_Person ) ?person3 .
?person3 lp:Person_hasInterest_Tag ?tag .
FILTER ( ?person1 != ?person3 )
}
Since, as of now, BARQ and the row-based engine use exactly the same query plans, it’s easy to compare query profiles to understand the performance difference. Below is the profile for the row-based engine. We see that 38% of the runtime was spent in the top-most merge join
MergeJoin(?person2)
producing 288.1M
solution rows. Another 30% was spent in the Filter block, to remove loops from the two-hop partial results.Group(aggregates=[(COUNT(*) AS ?count)]) [#1], results: 1, wall time: 13710 ms (14.4%)
`─ Filter(?person1 != ?person3) [#46.1M], results: 285.5M, wall time: 29138 ms (30.7%)
`─ MergeJoin(?person2) [#92.2M], results: 288.1M, wall time: 36308 ms (38.2%)
+─ Union [#114K], results: 114K, wall time: 25 ms (0.0%)
│ +─ Scan[POSC](?person1, http://ldbcouncil.org/Person_knows_Person, ?person2), results: 57K, wall time: 18 ms (0.0%)
│ `─ Scan[PSOC](?person2, http://ldbcouncil.org/Person_knows_Person, ?person1), results: 57K, wall time: 20 ms (0.0%)
`─ Sort(?person2) [#2.6M], memory: {total=121M (50.0%)}, results: 121.7M (with gaps), wall time: 15432 ms (16.3%)
`─ Restriction(?person2, ?person3) [#2.6M]
`─ MergeJoin(?person3) [#2.6M], results: 2.6M, wall time: 241 ms (0.3%)
+─ Union [#114K], results: 114K, wall time: 15 ms (0.0%)
│ +─ Scan[POSC](?person2, http://ldbcouncil.org/Person_knows_Person, ?person3), results: 57K, wall time: 11 ms (0.0%)
│ `─ Scan[PSOC](?person3, http://ldbcouncil.org/Person_knows_Person, ?person2), results: 57K, wall time: 10 ms (0.0%)
`─ Scan[PSOC](?person3, http://ldbcouncil.org/Person_hasInterest_Tag, ?tag), results: 84K (with gaps), wall time: 16 ms (0.0%)
Now we can take a look at the same plan profile with BARQ enabled. We can see here the top merge join on
MergeJoin(?person2)
only uses 390ms of time compared to 36308ms, a stark 93x improvement! Similarly the filter block reduced it runtime by about 6x. Some of this is due to the overhead profiling has in the legacy engine, but this only highlights how promising BARQ is for the future.Group(aggregates=[(COUNT(*) AS ?count)]) [#1], results: 1, wall time: 238 ms (2.9%), batched
`─ Filter(?person1 != ?person3) [#100.9M], results: 285.5M, wall time: 5185 ms (62.6%), batched
`─ MergeJoin(?person2) [#201.8M], memory: {total=4.0M (3.8%); max=256K}, results: 288.1M, wall time: 398 ms (4.8%), batched
+─ Union [#114K], results: 114K, wall time: 3 ms (0.0%), batched
│ +─ Scan[POSC](?person1, http://ldbcouncil.org/Person_knows_Person, ?person2) [#57K], results: 57K, wall time: 9 ms (0.1%), batched
│ `─ Scan[PSOC](?person2, http://ldbcouncil.org/Person_knows_Person, ?person1) [#57K], results: 57K, wall time: 9 ms (0.1%), batched
`─ Sort(?person2) [#2.6M], memory: {total=101M (96.2%)}, results: 2.6M (with gaps), wall time: 2224 ms (26.8%), batched
`─ Restriction(?person2, ?person3) [#2.6M]
`─ MergeJoin(?person3) [#2.6M], results: 2.6M, wall time: 100 ms (1.2%), batched
+─ Union [#114K], results: 114K, wall time: 4 ms (0.0%), batched
│ +─ Scan[POSC](?person2, http://ldbcouncil.org/Person_knows_Person, ?person3) [#57K], results: 57K, wall time: 10 ms (0.1%), batched
│ `─ Scan[PSOC](?person3, http://ldbcouncil.org/Person_knows_Person, ?person2) [#57K], results: 57K, wall time: 10 ms (0.1%), batched
`─ Scan[PSOC](?person3, http://ldbcouncil.org/Person_hasInterest_Tag, ?tag) [#90K], results: 87K (with gaps), wall time: 21 ms (0.3%), batched
Future Work
Currently we are still working on fully vectorizing other parts of the query engine. For example in the near term we will vectorize the
HashJoin
, MINUS
(aka anti-join), and DISTINCT
operators. That should make BARQ more generally applicable and improve performance on more queries. For example, query 9 in the LSQB benchmark is a variation of query 6 above with the following additional filter:FILTER NOT EXISTS { ?person1 (lp:Person_knows_Person | ^lp:Person_knows_Person ) ?person3 }
The condition filters out loops of length 3 and is automatically rewritten by the Stardog query optimizer as
MINUS
over a materialized hash table. That is the current computational bottleneck since it performs 285M
row lookups in the table. Once the lookups are vectorized, query performance will improve far beyond the current factor of 2x. The same applies to the hash join, which is a more generally applicable join algorithm since it does not require the inputs to be sorted by the join key.Enabling the BARQ Engine
BARQ is fully integrated into the Stardog’s query engine pipeline. Once enabled, the query translator will pick the best execution model for each operator in the query plan given the expected number of intermediate results. It will prefer BARQ for supported operators that process many results and may use the row-based operators for very selective, or IO-bound, parts of the query plan. The execution models can be mixed and matched during execution of a single query with very little integration cost.
You can learn more about BARQ in the Stardog docs.