Canonical URL
Do not index
Do not index
Data is naturally distributed and knowledge even more so. Stardog query federation—i.e., Virtual Graph capability—is one way we respond to the natural distribution of data in the enterprise. Efficient query federation requires reducing the number of requests to and, hence, data transferred from, upstream data sources when querying (federations of) Virtual Graphs.
One key approach to achieve this is leveraging Virtual Graph mappings during query planning to determine query plans that avoid contacting data sources that do not contribute to the final results of the query. This approach works well as long as there are sufficient constraints to rule out non-contributing data sources during planning. However, in practice, this is not always the case and, as a result, it is also crucial to minimize the data transferred from non-contributing sources during query execution.
In this blog post, we discuss the latter and present the
MultiWayServiceJoin
operator, introduced in Stardog 10.1. Managing Distributed Knowledge about Employees
Assume that the relevant information about employees isn’t localized in a single data source. The data exists in a Virtual Graph and a local database. In the local database, we keep data about the employees and the department they belong to. For example:
<http://example.com/emp/1> a emp:Employee .
<http://example.com/emp/1> rdfs:label "Alice" .
<http://example.com/emp/1> emp:memberOf <http://example.com/dept/2> .
The Virtual Graph provides data about departments and their meeting rooms with the following mappings.
MAPPING
FROM SQL { SELECT * FROM departments }
TO {
?subject a emp:Department ;
rdfs:label ?dept_name .
} WHERE {
BIND(template("http://example.com/dept/{dept_no}") AS ?subject)
};
MAPPING
FROM SQL { SELECT * from meeting_rooms }
TO {
?subject a emp:MeetingRoom ;
emp:dept ?dept ;
rdfs:label ?room_name .
} WHERE {
BIND(template("http://example.com/dept/{dept_no}") AS ?dept)
BIND(template("http://example.com/room/{room_id}") AS ?subject)
}
Finding the Names of all Employees
Now, let’s assume we have a simple query to retrieve the names (i.e., labels) of all employees. We use Virtual Transparency (VT) to query both local and VG data since we may not be sure where the data that we want to query resides. The query Q looks as follows:
SELECT *
FROM stardog:context:all
{
?person a emp:Employee . # A
?person rdfs:label ?name . # B
}
Note that the predicate
rdfs:label
is used both in the local data as well as in the Virtual Graph, while the class emp:Employee
only occurs in the local data. That means that it is sufficient to evaluate the first triple pattern A over the local data only. The second triple pattern B must be evaluated over both the local data and the meeting_rooms
table and the departments
table. This leads to the following decomposition:
As a result, a naive approach would retrieve all data for triple pattern B from the Virtual Graph and join the results with the local data locally. However, this can be costly and often we can do better based on how the virtual data is distributed. In our example, we notice that no solution for from the local data is compatible with any solution from the Virtual Graph for pattern B.
This can be determined by comparing the IRI templates in the mappings with the possible bindings for
?person
in the local data. In our example, we can see that <http://example.com/emp/1>
is not compatible with the two subject IRI templates template("http://example.com/room/{room_id}")
and template("http://example.com/dept/{dept_no}")
. This means, that the evaluation of our query could be simplified as follows:Consequently, it is possible to evaluate the query without retrieving any data from the Virtual Graph, reducing both the query execution time and the load on the remote data source. Yet, we cannot determine this during query planning because we do not know about all the IRIs in the local data. But there is hope. We can use the new
MultiWayServiceJoin
operator to avoid requesting unnecessary data from remote data source.The query decomposition above is a common decomposition which can be encountered when Voicebox is enabled. This is the case when Virtual Transparency is enabled and multiple mappings or Virtual Graphs share the same predicates.
The Regular ServiceJoin
In Stardog, a key operator to evaluate joins over remote data sources, such as Virtual Graphs, is the
ServiceJoin
operator. The operator is similar to a block-based bind join operator where the right operand is a Service
(e.g., a Virtual Graph or a SAPRQL service). In a nutshell, the
ServiceJoin SJ(L, S)
takes fixed-size batches of solutions from its left operand L
and probes these batches of solutions in the right service operand S
by constraining the query of S
. A key advantage of this operator is that during query execution it determines which solutions form the left operand actually need to be probed in the service. For example, when the service is a Virtual Graph, the operator leverages the mappings to filter solutions from the left operand which are guaranteed to be not compatible with the solution of the Virtual Graph. As a result, if there are no potentially compatible solutions, the remote source does not need to be contacted at all. In our example, the local solution
?person = <http://example.com/emp/1>
would not need to be probed in neither the meeting_rooms
nor the departments
table of the Virtual Graph. Distributing Joins over Unions
However, in our example query decomposition, we cannot use the
ServiceJoin
because it requires its right-hand side to be a single Service
. To overcome this, we can rewrite the query and distribute the join over the union to get the following equivalent decomposition.Since in all joins in this rewriting we now join with at most a single
Service
, we can use two ServiceJoin
operators to evaluate the query. This is great as it avoids requesting any data from the remote data source because we can determine during the execution that no solution from evaluating A over the local data is compatible with the solutions in the Virtual Graph. The drawback of this rewriting, however, is the fact we have to evaluate A over the local data three times and since physical query plans in Stardog are trees, it has to be evaluated three times during query execution. Moreover, with this approach, the number of times the pattern needs to be evaluated increases linearly with the number of
Service
patterns in the query. Introducing the MultiWayServiceJoin
In order to overcome this limitation, we introduce the
MultiWayServiceJoin MWSJ(L, S1 U ... U Sn)
which extends the ServiceJoin
by allowing the right operand to be the union of multiple Services S1, ..., Sn
. With this new operator, we can rewrite our initial query decomposition in such a way that we can leverage the MultiWayServiceJoin
during execution:In contrast to our prior rewriting, we only need to evaluate A over the local data at most two times regardless of the number of
Service
patterns in the original decomposition. At the same time, we get the benefit that we do not need to request solutions that are guaranteed to be incompatible from the remote data source. The resulting query plan with the MultiWayServiceJoin
for the example query is: Projection(?person, ?name) [#2]
`─ Union [#2]
+─ MultiWayServiceJoin [#1]
│ +─ Scan[POSC](?person, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, emp:Employee) [#1]
│ `─ Union [#18]
│ +─ VirtualGraphSql<virtual://example> [#10] {
│ │ +─ RelNode=
│ │ +─ JdbcTableScan(table=[[vgtest, public, departments]])
│ │ +─ Query=
│ │ +─ SELECT *
│ │ +─ FROM "vgtest"."public"."departments"
│ │ +─ Vars=
│ │ +─ ?person <- TEMPLATE(http://example.com/dept/{dept_no/0})
│ │ +─ ?name <- COLUMN($1)^^xsd:string
│ │ } [#10]
│ `─ VirtualGraphSql<virtual://example> [#8] {
│ +─ RelNode=
│ +─ LogicalProject(room_id=[$0], room_name=[$1])
│ +─ LogicalFilter(condition=[IS NOT NULL($1)])
│ +─ JdbcTableScan(table=[[vgtest, public, meeting_rooms]])
│ +─ Query=
│ +─ SELECT "room_id", "room_name"
│ +─ FROM "vgtest"."public"."meeting_rooms"
│ +─ WHERE "room_name" IS NOT NULL
│ +─ Vars=
│ +─ ?person <- TEMPLATE(http://example.com/room/{room_id/0})
│ +─ ?name <- COLUMN($1)^^xsd:string
│ } [#8]
`─ MergeJoin(?person) [#1]
+─ Scan[POSC](?person, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, emp:Employee) [#1]
`─ Scan[PSOC](?person, <http://www.w3.org/2000/01/rdf-schema#label>, ?name) [#1]
When executing this query plan over the example data, only the second operand of the outermost
Union
(MergeJoin(?person) ...
) produces solutions to the query because all relevant data about employees resides in the local database. At the same time, there are no requests made to the remote data source of the Virtual Graph, reducing execution time and load on the remote data source.Since the rewriting requires evaluating the pattern which is distributed over the join multiple times, our rewriting optimizer aims to only rewrite the query for cases where the rewriting in conjunction with the
MulitWayServiceJoin
is expected to be more performant. For example, if it is very costly to evaluate the pattern that is supposed to be distributed, the optimizer might opt for not distributing it. There are query hints to disable the distribute joins rewriter and the join operator.
A Small Benchmark
Based on our federated setting of the BSBM benchmark with two VGs and a local database (see our previous blog post for details), we created a small benchmark with three queries to better understand the performance impact of the rewritings in conjunction with the
MultiWayServiceJoin
operator. In all three queries, the query is rewritten such that the
MultiWayServiceJoin
is used during execution. The results clearly indicate the benefits and demonstrate the performance gains achieved by avoiding requesting unnecessary data from remote data sources.Of course, the queries specifically showcase the strength of the new operator and it might not always yield such substantial gains or cannot be used at all. Nonetheless, we want to share these results to provide some insights into the possible gains with the new
MultiWayServiceJoin
as another building block to make federated query in Stardog more performant. Wrapping Up
In this blog post, we introduce the new
MultiWayServiceJoin
operator developed to further improve query performance over federations of Virtual Graphs. The full potential of the new operator is reached when it is combined an optimizer that rewrites the queries during query planning. This can improve query execution substantially by reducing the number of requests to remote data sources, especially in the presence of multiple Virtual Graphs.