Steve Harris (theno23) wrote,
Steve Harris

Vestigial fulltext indexing in 4store

Many people have requested fulltext indexing in 4store, at some point we'll probably use something like c-lucene, but for now I've done a bit of work on a sort-of entailment scheme. It's similar to what we do on

Imagine you have the file:
@prefix rdfs: <> .

<en> rdfs:label "Basking fishING Whales frogs cows Indeterminate"@en .
<ru> rdfs:label "Корова Хайнак Морфология"@ru .
<it> rdfs:label "Rane mangiano le mosche, ma non può volare"@IT-gb .
<gibberish> rdfs:label "Gibber gibber gibber"@gibberish .
For now the code latches onto rdfs:label predicates, but it will be configurable when it released.
When matching predicates are imported, you get the following data produced:
$ 4s-query text -f text 'SELECT * WHERE { ?x ?y ?z } ORDER BY ?x DESC(?y) ?z'
?x    ?y    ?z
<file:///tmp/en>    <> "Basking fishING Whales frogs cows Indeterminate"@EN
<file:///tmp/en>    <>            "bask"
<file:///tmp/en>    <>            "cow"
<file:///tmp/en>    <>            "fish"
<file:///tmp/en>    <>            "frog"
<file:///tmp/en>    <>            "indetermin"
<file:///tmp/en>    <>            "whale"
<file:///tmp/gibberish>    <>    "Gibber gibber gibber"@GIBBERISH
<file:///tmp/it>    <>    "Rane mangiano le mosche, ma non può volare"@IT-GB
<file:///tmp/it>    <>            "le"
<file:///tmp/it>    <>            "ma"
<file:///tmp/it>    <>            "mang"
<file:///tmp/it>    <>            "mosc"
<file:///tmp/it>    <>            "non"
<file:///tmp/it>    <>            "può"
<file:///tmp/it>    <>            "ran"
<file:///tmp/it>    <>            "vol"
<file:///tmp/ru>    <> "Корова Хайнак Морфология"@RU
<file:///tmp/ru>    <>            "коров"
<file:///tmp/ru>    <>            "морфолог"
<file:///tmp/ru>    <>	         "хайнак"
This means that you can run queries like:
$ 4s-query text -f text 'SELECT * WHERE { ?x <> "mang", "non", "può" }'
I've also got code that does just word tokenisation, and I'll add metaphones when I track down a decent C implementation.

One thing I haven't worked out is how to take natural language strings in SPARQL and stem them. Metaphones are easy, cos that's a standard algorithm, but there are tonnes of stemming algorithms, and if you use a different one it might not match.

Something like
SELECT ?x WHERE { ?x <> "Корова Хайнак Морфология"@RU }
Would work, with some query rewriting, but it's a bit cheesy. When I remember out how to push this to a public branch on github people can try it.
Tags: 4store, fulltext, metaphones, rdf, search, sparql, stemming
  • Post a new comment


    Comments allowed for friends only

    Anonymous comments are disabled in this journal

    default userpic