Imagine you have the file:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . <en> rdfs:label "Basking fishING Whales frogs cows Indeterminate"@en . <ru> rdfs:label "Корова Хайнак Морфология"@ru . <it> rdfs:label "Rane mangiano le mosche, ma non può volare"@IT-gb . <gibberish> rdfs:label "Gibber gibber gibber"@gibberish .For now the code latches onto rdfs:label predicates, but it will be configurable when it released.
When matching predicates are imported, you get the following data produced:
$ 4s-query text -f text 'SELECT * WHERE { ?x ?y ?z } ORDER BY ?x DESC(?y) ?z'
?x ?y ?z
<file:///tmp/en> <http://www.w3.org/2000/01/rdf-schema#label> "Basking fishING Whales frogs cows Indeterminate"@EN
<file:///tmp/en> <http://4store.org/fulltext#stem> "bask"
<file:///tmp/en> <http://4store.org/fulltext#stem> "cow"
<file:///tmp/en> <http://4store.org/fulltext#stem> "fish"
<file:///tmp/en> <http://4store.org/fulltext#stem> "frog"
<file:///tmp/en> <http://4store.org/fulltext#stem> "indetermin"
<file:///tmp/en> <http://4store.org/fulltext#stem> "whale"
<file:///tmp/gibberish> <http://www.w3.org/2000/01/rdf-schema#label> "Gibber gibber gibber"@GIBBERISH
<file:///tmp/it> <http://www.w3.org/2000/01/rdf-schema#label> "Rane mangiano le mosche, ma non può volare"@IT-GB
<file:///tmp/it> <http://4store.org/fulltext#stem> "le"
<file:///tmp/it> <http://4store.org/fulltext#stem> "ma"
<file:///tmp/it> <http://4store.org/fulltext#stem> "mang"
<file:///tmp/it> <http://4store.org/fulltext#stem> "mosc"
<file:///tmp/it> <http://4store.org/fulltext#stem> "non"
<file:///tmp/it> <http://4store.org/fulltext#stem> "può"
<file:///tmp/it> <http://4store.org/fulltext#stem> "ran"
<file:///tmp/it> <http://4store.org/fulltext#stem> "vol"
<file:///tmp/ru> <http://www.w3.org/2000/01/rdf-schema#label> "Корова Хайнак Морфология"@RU
<file:///tmp/ru> <http://4store.org/fulltext#stem> "коров"
<file:///tmp/ru> <http://4store.org/fulltext#stem> "морфолог"
<file:///tmp/ru> <http://4store.org/fulltext#stem> "хайнак"
This means that you can run queries like:
$ 4s-query text -f text 'SELECT * WHERE { ?x <http://4store.org/fulltext#stem> "mang", "non", "può" }'
?x
<file:///tmp/it>
I've also got code that does just word tokenisation, and I'll add metaphones when I track down a decent C implementation.One thing I haven't worked out is how to take natural language strings in SPARQL and stem them. Metaphones are easy, cos that's a standard algorithm, but there are tonnes of stemming algorithms, and if you use a different one it might not match.
Something like
SELECT ?x WHERE { ?x <http://4store.org/fulltext#unstemed> "Корова Хайнак Морфология"@RU }Would work, with some query rewriting, but it's a bit cheesy. When I remember out how to push this to a public branch on github people can try it.
September 28 2009, 08:39:28 UTC 2 years ago
September 28 2009, 09:45:25 UTC 2 years ago
Also, I'm starting to wonder if it's a bad thing if your software is better at languages than you.
November 16 2009, 18:02:46 UTC 2 years ago
February 25 2010, 21:17:24 UTC 2 years ago