INSERT/DELETE in 4store

The "delete" branch (so named because DELETE was much harder than INSERT to implement) of 4store http://github.com/garlik/4store/tree/delete has some basic support for adding a removing triples to/from graphs.

Examples:

PREFIX ex: <http://example.com/>
INSERT {
  GRAPH ex:mygraph {
    ex:a ex:p 12 .
    ex:a ex:q 23.0 .
    ex:a ex:r "foo"@en .
  }
}

PREFIX ex: <http://example.com/>
DELETE {
  GRAPH ex:mygraph {
    ex:a ex:p 12 .
  }
}

Caveats:

It behaves like a SPARQL 1.1 INSERT DATA, so no WHERE clause is allowed (the rasqal parser doesn't allow it, and the code doesn't support it even if it did).

Don't try to add bNodes, or you will just get an error.

Using variables will cause an error at best.

This code is not well tested, I pushed out the branch so that people can give it a go. If it works OK, and doesn't screw stores then we can look it merging it into the mainline code.

It fragments one of the indexes, so if you add and delete lots of triples to a graph then that index will grow quite large. It should be a huge problem, and it can be fixed retrospectively, but it's worth looking out for.

It's not well optimised, so the performance isn't all it could be. INSERTing or DELETEing large numbers of triples is especially inefficient. It wouldn't be much work to improve, but it's exactly the sort of thing I usually make a mess of.

If you try to INSERT/DELETE from/to the default graph:

INSERT { <x> <y> <z> }

It will end up using a graph called <default:>.

I guess DELETE could remove from all graphs, which would be a parallel to what SELECT does, but then it would be wildly different from INSERT, which I wasn't keen on either. INSERTing into all graphs, would be a bit... odd, and potentially very annoying.

- Steve

Vestigial fulltext indexing in 4store

Many people have requested fulltext indexing in 4store, at some point we'll probably use something like c-lucene, but for now I've done a bit of work on a sort-of entailment scheme. It's similar to what we do on qdos.com.

Imagine you have the file:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<en> rdfs:label "Basking fishING Whales frogs cows Indeterminate"@en .
<ru> rdfs:label "Корова Хайнак Морфология"@ru .
<it> rdfs:label "Rane mangiano le mosche, ma non può volare"@IT-gb .
<gibberish> rdfs:label "Gibber gibber gibber"@gibberish .
For now the code latches onto rdfs:label predicates, but it will be configurable when it released.
When matching predicates are imported, you get the following data produced:
$ 4s-query text -f text 'SELECT * WHERE { ?x ?y ?z } ORDER BY ?x DESC(?y) ?z'
?x    ?y    ?z
<file:///tmp/en>    <http://www.w3.org/2000/01/rdf-schema#label> "Basking fishING Whales frogs cows Indeterminate"@EN
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "bask"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "cow"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "fish"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "frog"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "indetermin"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "whale"
<file:///tmp/gibberish>    <http://www.w3.org/2000/01/rdf-schema#label>    "Gibber gibber gibber"@GIBBERISH
<file:///tmp/it>    <http://www.w3.org/2000/01/rdf-schema#label>    "Rane mangiano le mosche, ma non può volare"@IT-GB
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "le"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "ma"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "mang"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "mosc"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "non"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "può"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "ran"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "vol"
<file:///tmp/ru>    <http://www.w3.org/2000/01/rdf-schema#label> "Корова Хайнак Морфология"@RU
<file:///tmp/ru>    <http://4store.org/fulltext#stem>            "коров"
<file:///tmp/ru>    <http://4store.org/fulltext#stem>            "морфолог"
<file:///tmp/ru>    <http://4store.org/fulltext#stem>	         "хайнак"
This means that you can run queries like:
$ 4s-query text -f text 'SELECT * WHERE { ?x <http://4store.org/fulltext#stem> "mang", "non", "può" }'
?x
<file:///tmp/it>
I've also got code that does just word tokenisation, and I'll add metaphones when I track down a decent C implementation.

One thing I haven't worked out is how to take natural language strings in SPARQL and stem them. Metaphones are easy, cos that's a standard algorithm, but there are tonnes of stemming algorithms, and if you use a different one it might not match.

Something like
SELECT ?x WHERE { ?x <http://4store.org/fulltext#unstemed> "Корова Хайнак Морфология"@RU }
Would work, with some query rewriting, but it's a bit cheesy. When I remember out how to push this to a public branch on github people can try it.

Left Join in LaTeX

Once again I've been frustrated by the lack of a left join symbol in LaTeX. Wasysym has a bowtie-style join, but no left join.

So, I've finally made a decent one in EPS, which matches the computer modern join symbol to an acceptable level.

http://plugin.org.uk/misc/ljoin.eps

You can create save it into your Tex build directory and then define a \LJoin macro for it with
\newcommand{\LJoin}{\includegraphics[scale=1.0]{ljoin.eps}}

The scale is correct for whatever font size LNCS uses, but you might need to tweak it for other styles.

Don't ask how I made it, it was a real horror story, and involved far too many random bits of software.

PS \bowtie and \Join look sufficiently different than it's annoying if you mix them, so use the right one.

RIP the worlds most complex SPARQL engine

Geek content follows.

So, I've been looking at the BSBM (SPARQL benchmark) qualification set and realised that there was a fundamental flaw in the join optimiser in 4store.

It was set up to try and minimise the number of intermediate binding tables, and joins. It worked very hard at it, and had no fallback to the naive execution of the SPARQL algebra. (mistake no. 1).

Consequently it worked far too hard at applying the optimisation, putting far too much effort into it, and there are some joins that it just wasn't capable of computing correctly. It was also mindbendingly complex. You can see the code with this over-optimisation in if you look at http://github.com/garlik/4store/commit/2fe39024d148175fb579d9699e90b22c6afdab84 or earlier, particularly frontend/query.c and frontend/query-datatypes.c, but it's not for the feint of heart.

So that I don't forget how it worked, I'm going to briefly describe it below, I'm working on a paper describing the 4store optimiser, but it won't include this stuff for obvious reasons.

Imagine a query like
  WHERE {
    B0Q0
    B0Q1
    B0Q2
    OPTIONAL {
      B1Q0
      B1Q2
    }
  }

Where BnQm is quad m in block n. Sorted so that the most specific quad comes first.

Given that, the joins go

σ(t1 and t2) (B0Q0 [X] B0Q1 [X] B0Q2 =X] (B1Q0 [X](s1) D1Q0) =X] (B1Q0 [X](s2) D1Q0))

Where DnQm is an integer that used to identify the "source" of that value in the column, and sn and tn are expressions on the various columns that enforces the SPARQL algebra.

For the above case it's not too complex, but when you start throwing nested OPTIONALs and UNIONs into the mix it rapidly becomes extremely complex.

the advantage is this is that you never need intermediary binding tables, you're just always joining a new triple or (set of triples in some cases) onto the existing master table.

The two really tricky parts are tagging every value in a UNION branch with the correct DnQm value, and selecting that correctly in tn. The thing which made me finally give up was when faced with a triple in a OPTIONAL block that introduced no new variables. It would have been possible to rewrite it as a triple with a variable and a constraint, but at some point you have to decide that enough is enough.

The new join optimiser is much simpler, and much more correct (debugging the above in tricky cases is frighteningly complex), but I may re introduce this for some of the simpler cases, especially where the DnQm joins aren't required, which is the purpose of this post.

I'm going to dive back into tidying up the actually-SPARQL join optimiser now. Generally it will be a bit slower (but only noticeably on very large result sets), and UNIONs may actually be faster.

If you've read this far, I admire your perseverance :)

- Steve

Mail.app, iPhone and spam

So, I have this debilitating problem where my work email inbox gets so much spam that I can't read real mail on my phone unless I've used my Mac mail client (with clientside spam filter) in the last couple of hours. My main machine is a laptop, so while it's asleep, no spam filtering gets done.

Work's ISP runs a spam filter, but it just modifies the subject line, which means the phone still downloads it.

I found a for-money, mac based solution, but I don't have any always-on macs that I wanted to run the software on, and I'm not keen on paying money to get rid of spam anyway.

But, I found a solution that keeps the phone relativly free of spam and doesn't mess with Mail.app. It involves running spamassasin on my home server and a pythion script, ISBG.

You can follow the instructions on the ISBG page to get it setup, ignore how out-of-date it is, it works fine. The key thing is the exact IMAP folder ID for the Mail.app Junk folder (so you don't end up with excess spam folders). I had to brush up on my IMAP skills and connect with telnet to find the exact folder id that Mail.app expects to find spam in. The key line is:

isbg.py --imaphost mail.host.com --spaminbox "INBOX.Junk (NameOfAccount)" --imapuser username --delete --expunge --nostats

NameOfAccount is the name the account appears under when you expand the Inbox virtual folder.

I've put it in a crontab, to run every 10 mins (*/10 * * * *), and it seems to be doing the job...

EBooks and Anathem

While on holiday I wanted to read Anathem, having been reassured by elseware that it was good.

To cut a long story short, yes, it's good. Not Snowcrash good, but still well worth reading.

Now comes the rant:

Anyway, I went to buy the book, but it turns out it was only available in hardback, and it's massive. Great if you have stout doors you need kept open, but way bigger than I wanted to take on holiday, and bigger than I wanted to carry on the train to work, or even store for that matter.

However, after a bit of googling, it seems that you can get it as an "ebook" from the publisher. Can't grep dead trees and all that, so out with the credit card.

This turns out to be a mixed blessing. Ebook readers are fairly cheap now, so I got a Sony something-or-other from Waterstones. I've had a go on a couple of other models, and the Sony had nothing particularly wrong with it - Amazon's Kindle is probably better, but it's still useless in the civilised world, and I wanted something to read in a hurry.

Well, the hardware is great - page turns could be quicker, and it could have a touch more resolution and contrast, but it's easily as good to read as the average paperback. The size, convienience and capacity makes it a great device overall. Give the technology time for a couple of revisions and it has the potential to do the the paperback what the iPod is doing to CDs. No question.

However, the PC-side software is appaling. Really, beyond unacceptable. Harper-Collins use Adobe-brand DRM of some sort, and Sony only support Windows (badly). So, while the Adobe software is cross platform, you need access to a Windows box to generate some sort of key and copy it over to the hardware - that took be about 2 hours, but it's a long story and mostly Window's fault. Once you've done that you can use Calibre (GPL ebook mangling software) to do the rest, it's better than the Sony software, but still pretty complicated.

I've now read Anathem, a Sherlock Holmes book and a bit of PG Woodehouse - all on one charge. Getting the books on the thing in a form that you want to read is a right PITA though. PDFs are generally too small to read unless you crop the margins, which you can't do if it has DRM on it... You can convert DRM'd PDFs to text, but if there's anything funny going on with layout, diagrams or tables (hello Mr. Stephenson), it just doesn't work that well. Project Gutenberg books work great though, as a counter example.

Stanza on the iPhone is much better at all this stuff, but the screen is too small and lowres to read on for any length of time.

As is typical, it's more painful to buy and use DRM'd ebooks legimataly than it is to rip dodgy ones off from bittorent. Publishers need to sort that out. Even though I'd bought the ebook of Anathem from the publishers, I ended up reading a dodgy one because it was actually easier to format it a way that was readable. Learn from the iPod/iTunes thing morons; Don't Make It Hard For Me To Give You Money.

Overall, it's much better than lugging dead trees around, but given how good the hardware is, it ought to completly changing the publishing world.

A new kind of scary

Drove home from Liverpool last night. The motorways were fine, and the major London roads were OK, bar the usual sociopathic bus drivers - who were still pulling out without looking or any kind of warning when there's ice on the ground. Cocks.

However, the minor road off Kennington lane to our road was ungritted, and had a thick layer of snow over sheet ice. I got very sideways at about 15mph with almost no throttle, trying to get down it. Almost slid into a merc, but luckily didn't instinctively go for the brakes and managed to steer out of it.

Trying to get onto the drive was equally interesting, it must have all of a 5º slope, but I was spinning the wheels trying to get traction, eventully managed to get up it, only slightly diagonally.

There's essentially no trains running today, and the tube line that goes that way is out, so working from home.
  • Current Music
    Last Train to Trancentral, The KLF
  • Tags

Panasonic G1 Mini Review


P1000028_Effra-Outlet.jpg
Originally uploaded by swh
I got a Lumix G1 the other weekend, so I'll write a brief review of the novel things.

Viewfinder: the viewfinder is surprisingly good - in good light. In poor light it shimmers with noise, and it's very hard to check focus, and quite unpleasant to use. OTOH in daylight it's probably the easiest camera to manualy focus I've ever used, easier than a Leica. In poor light the screen is still OK for composition, for some reason. The screen is articulated, reasonably bright, and quite useful. One neat thing is that it can see a B&W image in the viewfinder, which is great if you want to shoot B&W with filters.

Auto focus: the auto focus is contrast based, and noticably slower than my D300, but very accurate, and fast enough for most things. It's never going to be a good action camera anyway, the frame rate is too low.

Meter: because it's using the main sensor to meter, in theory it should be good, but I find it tends to overexpeose and clip highlights. -1/3EV compensation pretty much sorts it out.

Noise/sensor issues: under ISO400 the noise is pretty good, over that it gets a bit questionable. High ISO images look good converted to B&W. Probably because the sensor is quite small, the dynamic range is not that great. High contrast scenes tend to clip and both ends.

Short sensor-mount distance: I don't know how much it's down to this, but the kit lens is really good, decent distortion, good contrast and sharpness. The short distance should make for decent wideangle lenses, at least in theory. The key thing about this is that it's about the only changeable lens system where you can adapt Leica M-Mount lenses, and because of the way the manual focus works, it should be easy to focus them.

Edit to add: maybe it's not obvious from this review, but I really like the camera! waiting to get an M adaptor so I can use my Leica lenses on it, but even with the kit lens it's nice to use.