Steve Harris (theno23) wrote,
Steve Harris

RIP the worlds most complex SPARQL engine

Geek content follows.

So, I've been looking at the BSBM (SPARQL benchmark) qualification set and realised that there was a fundamental flaw in the join optimiser in 4store.

It was set up to try and minimise the number of intermediate binding tables, and joins. It worked very hard at it, and had no fallback to the naive execution of the SPARQL algebra. (mistake no. 1).

Consequently it worked far too hard at applying the optimisation, putting far too much effort into it, and there are some joins that it just wasn't capable of computing correctly. It was also mindbendingly complex. You can see the code with this over-optimisation in if you look at or earlier, particularly frontend/query.c and frontend/query-datatypes.c, but it's not for the feint of heart.

So that I don't forget how it worked, I'm going to briefly describe it below, I'm working on a paper describing the 4store optimiser, but it won't include this stuff for obvious reasons.

Imagine a query like

Where BnQm is quad m in block n. Sorted so that the most specific quad comes first.

Given that, the joins go

σ(t1 and t2) (B0Q0 [X] B0Q1 [X] B0Q2 =X] (B1Q0 [X](s1) D1Q0) =X] (B1Q0 [X](s2) D1Q0))

Where DnQm is an integer that used to identify the "source" of that value in the column, and sn and tn are expressions on the various columns that enforces the SPARQL algebra.

For the above case it's not too complex, but when you start throwing nested OPTIONALs and UNIONs into the mix it rapidly becomes extremely complex.

the advantage is this is that you never need intermediary binding tables, you're just always joining a new triple or (set of triples in some cases) onto the existing master table.

The two really tricky parts are tagging every value in a UNION branch with the correct DnQm value, and selecting that correctly in tn. The thing which made me finally give up was when faced with a triple in a OPTIONAL block that introduced no new variables. It would have been possible to rewrite it as a triple with a variable and a constraint, but at some point you have to decide that enough is enough.

The new join optimiser is much simpler, and much more correct (debugging the above in tricky cases is frighteningly complex), but I may re introduce this for some of the simpler cases, especially where the DnQm joins aren't required, which is the purpose of this post.

I'm going to dive back into tidying up the actually-SPARQL join optimiser now. Generally it will be a bit slower (but only noticeably on very large result sets), and UNIONs may actually be faster.

If you've read this far, I admire your perseverance :)

- Steve
Tags: 4store, garlik, rdf, sparql
  • Post a new comment


    Comments allowed for friends only

    Anonymous comments are disabled in this journal

    default userpic