Improving VG Perf with Multiway Service Joins

Canonical URL

Do not index

Data is naturally distributed and knowledge even more so. Stardog query federation—i.e., Virtual Graph capability—is one way we respond to the natural distribution of data in the enterprise. Efficient query federation requires reducing the number of requests to and, hence, data transferred from, upstream data sources when querying (federations of) Virtual Graphs.

One key approach to achieve this is leveraging Virtual Graph mappings during query planning to determine query plans that avoid contacting data sources that do not contribute to the final results of the query. This approach works well as long as there are sufficient constraints to rule out non-contributing data sources during planning. However, in practice, this is not always the case and, as a result, it is also crucial to minimize the data transferred from non-contributing sources during query execution.

In this blog post, we discuss the latter and present the MultiWayServiceJoin operator, introduced in Stardog 10.1.

Managing Distributed Knowledge about Employees

Assume that the relevant information about employees isn’t localized in a single data source. The data exists in a Virtual Graph and a local database. In the local database, we keep data about the employees and the department they belong to. For example:

<http://example.com/emp/1> a emp:Employee .
<http://example.com/emp/1> rdfs:label "Alice" .
<http://example.com/emp/1> emp:memberOf <http://example.com/dept/2> .

The Virtual Graph provides data about departments and their meeting rooms with the following mappings.

MAPPING
FROM SQL { SELECT * FROM departments } 
TO {
  ?subject  a emp:Department ;
            rdfs:label ?dept_name .
} WHERE {
  BIND(template("http://example.com/dept/{dept_no}") AS ?subject)
};

MAPPING
FROM SQL { SELECT * from meeting_rooms }
TO {
  ?subject  a emp:MeetingRoom ;
            emp:dept ?dept ;
            rdfs:label ?room_name .
} WHERE {
  BIND(template("http://example.com/dept/{dept_no}") AS ?dept)
  BIND(template("http://example.com/room/{room_id}") AS ?subject)
}

Finding the Names of all Employees

Now, let’s assume we have a simple query to retrieve the names (i.e., labels) of all employees. We use Virtual Transparency (VT) to query both local and VG data since we may not be sure where the data that we want to query resides. The query Q looks as follows:

SELECT *
FROM stardog:context:all
{
    ?person a emp:Employee .   # A
    ?person rdfs:label ?name . # B
}

Note that the predicate rdfs:label is used both in the local data as well as in the Virtual Graph, while the class emp:Employee only occurs in the local data. That means that it is sufficient to evaluate the first triple pattern A over the local data only. The second triple pattern B must be evaluated over both the local data and the meeting_rooms table and the departments table.

This leads to the following decomposition:

As a result, a naive approach would retrieve all data for triple pattern B from the Virtual Graph and join the results with the local data locally. However, this can be costly and often we can do better based on how the virtual data is distributed. In our example, we notice that no solution for from the local data is compatible with any solution from the Virtual Graph for pattern B.

This can be determined by comparing the IRI templates in the mappings with the possible bindings for ?person in the local data. In our example, we can see that <http://example.com/emp/1> is not compatible with the two subject IRI templates template("http://example.com/room/{room_id}")and template("http://example.com/dept/{dept_no}"). This means, that the evaluation of our query could be simplified as follows:

Consequently, it is possible to evaluate the query without retrieving any data from the Virtual Graph, reducing both the query execution time and the load on the remote data source. Yet, we cannot determine this during query planning because we do not know about all the IRIs in the local data. But there is hope. We can use the new MultiWayServiceJoin operator to avoid requesting unnecessary data from remote data source.

💡

The query decomposition above is a common decomposition which can be encountered when Voicebox is enabled. This is the case when Virtual Transparency is enabled and multiple mappings or Virtual Graphs share the same predicates.

The Regular `ServiceJoin`

In Stardog, a key operator to evaluate joins over remote data sources, such as Virtual Graphs, is the ServiceJoin operator. The operator is similar to a block-based bind join operator where the right operand is a Service (e.g., a Virtual Graph or a SAPRQL service).

In a nutshell, the ServiceJoin SJ(L, S) takes fixed-size batches of solutions from its left operand L and probes these batches of solutions in the right service operand S by constraining the query of S. A key advantage of this operator is that during query execution it determines which solutions form the left operand actually need to be probed in the service.

For example, when the service is a Virtual Graph, the operator leverages the mappings to filter solutions from the left operand which are guaranteed to be not compatible with the solution of the Virtual Graph. As a result, if there are no potentially compatible solutions, the remote source does not need to be contacted at all. In our example, the local solution ?person = <http://example.com/emp/1> would not need to be probed in neither the meeting_rooms nor the departments table of the Virtual Graph.

Distributing Joins over Unions

However, in our example query decomposition, we cannot use the ServiceJoin because it requires its right-hand side to be a single Service. To overcome this, we can rewrite the query and distribute the join over the union to get the following equivalent decomposition.

Since in all joins in this rewriting we now join with at most a single Service, we can use two ServiceJoin operators to evaluate the query. This is great as it avoids requesting any data from the remote data source because we can determine during the execution that no solution from evaluating A over the local data is compatible with the solutions in the Virtual Graph.

The drawback of this rewriting, however, is the fact we have to evaluate A over the local data three times and since physical query plans in Stardog are trees, it has to be evaluated three times during query execution. Moreover, with this approach, the number of times the pattern needs to be evaluated increases linearly with the number of Service patterns in the query.

Introducing the `MultiWayServiceJoin`

In order to overcome this limitation, we introduce the MultiWayServiceJoin MWSJ(L, S1 U ... U Sn) which extends the ServiceJoin by allowing the right operand to be the union of multiple Services S1, ..., Sn. With this new operator, we can rewrite our initial query decomposition in such a way that we can leverage the MultiWayServiceJoin during execution:

In contrast to our prior rewriting, we only need to evaluate A over the local data at most two times regardless of the number of Service patterns in the original decomposition. At the same time, we get the benefit that we do not need to request solutions that are guaranteed to be incompatible from the remote data source. The resulting query plan with the MultiWayServiceJoin for the example query is:

Projection(?person, ?name) [#2]
`─ Union [#2]
   +─ MultiWayServiceJoin [#1]
   │  +─ Scan[POSC](?person, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, emp:Employee) [#1]
   │  `─ Union [#18]
   │     +─ VirtualGraphSql<virtual://example> [#10] {
   │     │  +─ RelNode=
   │     │  +─    JdbcTableScan(table=[[vgtest, public, departments]])
   │     │  +─ Query=
   │     │  +─    SELECT *
   │     │  +─    FROM "vgtest"."public"."departments"
   │     │  +─ Vars=
   │     │  +─    ?person <- TEMPLATE(http://example.com/dept/{dept_no/0})
   │     │  +─    ?name <- COLUMN($1)^^xsd:string
   │     │  } [#10]
   │     `─ VirtualGraphSql<virtual://example> [#8] {
   │        +─ RelNode=
   │        +─    LogicalProject(room_id=[$0], room_name=[$1])
   │        +─      LogicalFilter(condition=[IS NOT NULL($1)])
   │        +─        JdbcTableScan(table=[[vgtest, public, meeting_rooms]])
   │        +─ Query=
   │        +─    SELECT "room_id", "room_name"
   │        +─    FROM "vgtest"."public"."meeting_rooms"
   │        +─    WHERE "room_name" IS NOT NULL
   │        +─ Vars=
   │        +─    ?person <- TEMPLATE(http://example.com/room/{room_id/0})
   │        +─    ?name <- COLUMN($1)^^xsd:string
   │        } [#8]
   `─ MergeJoin(?person) [#1]
      +─ Scan[POSC](?person, <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, emp:Employee) [#1]
      `─ Scan[PSOC](?person, <http://www.w3.org/2000/01/rdf-schema#label>, ?name) [#1]

When executing this query plan over the example data, only the second operand of the outermost Union (MergeJoin(?person) ...) produces solutions to the query because all relevant data about employees resides in the local database. At the same time, there are no requests made to the remote data source of the Virtual Graph, reducing execution time and load on the remote data source.

Since the rewriting requires evaluating the pattern which is distributed over the join multiple times, our rewriting optimizer aims to only rewrite the query for cases where the rewriting in conjunction with the MulitWayServiceJoin is expected to be more performant.

For example, if it is very costly to evaluate the pattern that is supposed to be distributed, the optimizer might opt for not distributing it. There are query hints to disable the distribute joins rewriter and the join operator.

A Small Benchmark

Based on our federated setting of the BSBM benchmark with two VGs and a local database (see our previous blog post for details), we created a small benchmark with three queries to better understand the performance impact of the rewritings in conjunction with the MultiWayServiceJoin operator.

Mean execution time (log-scale) without (Stardog 10.0) and with the MultiWayServiceJoin (Stardog 10.1) — Mean execution time (log-scale) without (Stardog 10.0) and with the `MultiWayServiceJoin` (Stardog 10.1)

In all three queries, the query is rewritten such that the MultiWayServiceJoin is used during execution. The results clearly indicate the benefits and demonstrate the performance gains achieved by avoiding requesting unnecessary data from remote data sources.

Of course, the queries specifically showcase the strength of the new operator and it might not always yield such substantial gains or cannot be used at all. Nonetheless, we want to share these results to provide some insights into the possible gains with the new MultiWayServiceJoin as another building block to make federated query in Stardog more performant.

Wrapping Up

In this blog post, we introduce the new MultiWayServiceJoin operator developed to further improve query performance over federations of Virtual Graphs. The full potential of the new operator is reached when it is combined an optimizer that rewrites the queries during query planning. This can improve query execution substantially by reducing the number of requests to remote data sources, especially in the presence of multiple Virtual Graphs.

Improving VG Perf with Multiway Service Joins

Managing Distributed Knowledge about Employees

Finding the Names of all Employees

The Regular ServiceJoin

Distributing Joins over Unions

Introducing the MultiWayServiceJoin

A Small Benchmark

Wrapping Up

Stardog Voicebox is a fast, accurate AI Data Assistant that's 100% hallucination-free guaranteed.

Stay up-to-date with Labs innovations

The Regular `ServiceJoin`

Introducing the `MultiWayServiceJoin`