Steve Harris (theno23) wrote,
Steve Harris
theno23

Vestigial fulltext indexing in 4store

Many people have requested fulltext indexing in 4store, at some point we'll probably use something like c-lucene, but for now I've done a bit of work on a sort-of entailment scheme. It's similar to what we do on qdos.com.

Imagine you have the file:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<en> rdfs:label "Basking fishING Whales frogs cows Indeterminate"@en .
<ru> rdfs:label "Корова Хайнак Морфология"@ru .
<it> rdfs:label "Rane mangiano le mosche, ma non può volare"@IT-gb .
<gibberish> rdfs:label "Gibber gibber gibber"@gibberish .
For now the code latches onto rdfs:label predicates, but it will be configurable when it released.
When matching predicates are imported, you get the following data produced:
$ 4s-query text -f text 'SELECT * WHERE { ?x ?y ?z } ORDER BY ?x DESC(?y) ?z'
?x    ?y    ?z
<file:///tmp/en>    <http://www.w3.org/2000/01/rdf-schema#label> "Basking fishING Whales frogs cows Indeterminate"@EN
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "bask"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "cow"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "fish"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "frog"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "indetermin"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "whale"
<file:///tmp/gibberish>    <http://www.w3.org/2000/01/rdf-schema#label>    "Gibber gibber gibber"@GIBBERISH
<file:///tmp/it>    <http://www.w3.org/2000/01/rdf-schema#label>    "Rane mangiano le mosche, ma non può volare"@IT-GB
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "le"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "ma"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "mang"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "mosc"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "non"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "può"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "ran"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "vol"
<file:///tmp/ru>    <http://www.w3.org/2000/01/rdf-schema#label> "Корова Хайнак Морфология"@RU
<file:///tmp/ru>    <http://4store.org/fulltext#stem>            "коров"
<file:///tmp/ru>    <http://4store.org/fulltext#stem>            "морфолог"
<file:///tmp/ru>    <http://4store.org/fulltext#stem>	         "хайнак"
This means that you can run queries like:
$ 4s-query text -f text 'SELECT * WHERE { ?x <http://4store.org/fulltext#stem> "mang", "non", "può" }'
?x
<file:///tmp/it>
I've also got code that does just word tokenisation, and I'll add metaphones when I track down a decent C implementation.

One thing I haven't worked out is how to take natural language strings in SPARQL and stem them. Metaphones are easy, cos that's a standard algorithm, but there are tonnes of stemming algorithms, and if you use a different one it might not match.

Something like
SELECT ?x WHERE { ?x <http://4store.org/fulltext#unstemed> "Корова Хайнак Морфология"@RU }
Would work, with some query rewriting, but it's a bit cheesy. When I remember out how to push this to a public branch on github people can try it.
Tags: 4store, fulltext, metaphones, rdf, search, sparql, stemming
  • Post a new comment

    Error

    Comments allowed for friends only

    Anonymous comments are disabled in this journal

    default userpic
  • 4 comments