Steve Harris ([info]theno23) wrote,

Vestigial fulltext indexing in 4store

Many people have requested fulltext indexing in 4store, at some point we'll probably use something like c-lucene, but for now I've done a bit of work on a sort-of entailment scheme. It's similar to what we do on qdos.com.

Imagine you have the file:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<en> rdfs:label "Basking fishING Whales frogs cows Indeterminate"@en .
<ru> rdfs:label "Корова Хайнак Морфология"@ru .
<it> rdfs:label "Rane mangiano le mosche, ma non può volare"@IT-gb .
<gibberish> rdfs:label "Gibber gibber gibber"@gibberish .
For now the code latches onto rdfs:label predicates, but it will be configurable when it released.
When matching predicates are imported, you get the following data produced:
$ 4s-query text -f text 'SELECT * WHERE { ?x ?y ?z } ORDER BY ?x DESC(?y) ?z'
?x    ?y    ?z
<file:///tmp/en>    <http://www.w3.org/2000/01/rdf-schema#label> "Basking fishING Whales frogs cows Indeterminate"@EN
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "bask"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "cow"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "fish"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "frog"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "indetermin"
<file:///tmp/en>    <http://4store.org/fulltext#stem>            "whale"
<file:///tmp/gibberish>    <http://www.w3.org/2000/01/rdf-schema#label>    "Gibber gibber gibber"@GIBBERISH
<file:///tmp/it>    <http://www.w3.org/2000/01/rdf-schema#label>    "Rane mangiano le mosche, ma non può volare"@IT-GB
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "le"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "ma"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "mang"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "mosc"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "non"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "può"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "ran"
<file:///tmp/it>    <http://4store.org/fulltext#stem>            "vol"
<file:///tmp/ru>    <http://www.w3.org/2000/01/rdf-schema#label> "Корова Хайнак Морфология"@RU
<file:///tmp/ru>    <http://4store.org/fulltext#stem>            "коров"
<file:///tmp/ru>    <http://4store.org/fulltext#stem>            "морфолог"
<file:///tmp/ru>    <http://4store.org/fulltext#stem>	         "хайнак"
This means that you can run queries like:
$ 4s-query text -f text 'SELECT * WHERE { ?x <http://4store.org/fulltext#stem> "mang", "non", "può" }'
?x
<file:///tmp/it>
I've also got code that does just word tokenisation, and I'll add metaphones when I track down a decent C implementation.

One thing I haven't worked out is how to take natural language strings in SPARQL and stem them. Metaphones are easy, cos that's a standard algorithm, but there are tonnes of stemming algorithms, and if you use a different one it might not match.

Something like
SELECT ?x WHERE { ?x <http://4store.org/fulltext#unstemed> "Корова Хайнак Морфология"@RU }
Would work, with some query rewriting, but it's a bit cheesy. When I remember out how to push this to a public branch on github people can try it.
Tags: 4store, fulltext, metaphones, rdf, search, sparql, stemming

  • Post a new comment

    Error

    Comments allowed for friends only

    Anonymous comments are disabled in this journal

  • 4 comments

[info]elseware

September 28 2009, 08:39:28 UTC 2 years ago

I am most shocked by the fact that I appear to understand this on a Monday morning.

[info]theno23

September 28 2009, 09:45:25 UTC 2 years ago

Don't tell me you've started to believe in The Triples?

Also, I'm starting to wonder if it's a bad thing if your software is better at languages than you.

[info]theno23

November 16 2009, 18:02:46 UTC 2 years ago

the implementation details are now described at http://4store.org/trac/wiki/TextIndexing, for people looking for it, like me...

[info]theno23

February 25 2010, 21:17:24 UTC 2 years ago

Support for this is now in the mainline code, so there's no need to pull the branch.
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…