lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

plasticfist · 2021-12-20T22:40:37Z

First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.

Details

wikidata's file latest-all.ttl

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

--- nearby lines from file, including problematic line, if my sed is correct

sed -n '3135042,3135046p;3135047q' latest-all.ttl

ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
	pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
	pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
	prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file

Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.


parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this

parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
    f_triples.write(f"{s}\t{p}\t{o}\n")


Other notes: 
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf

The text was updated successfully, but these errors were encountered:

ozekik · 2021-12-21T16:30:01Z

Thank you for reporting!

The problem is that #Crew?oldid=2476206#Command_crew in <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> is, strictly speaking, an invalid IRI part with # followed by unescaped # (and therefore the document is an invalid RDF, in a precise sense.)
Some library such as rdflib just ignores it, but Rio (Rust RDF library behind lightrdf) is rigid and raises an exception.

As the resume-after-exception feature is WIP in Rio, I think a possible workaround for now is to fix invalid IRIs before parsing, like:

sed -r 's/([^#]*)#/\1%23/2g' latest-all.ttl

(Use -i to replace in-place and gsed on Mac)

plasticfist · 2021-12-21T16:35:49Z

Thank you for the quick response, this is very helpful. I'm usually hesitant to manually patch source files, but this might be the best fix for the moment, agree. (thank you for the sed as well) I'm still looking at dbpedia ttls, it throws an error with that dataset as well, which I can't make sense of. At first I thought the problem was that it wasn't actually turtle format in their .ttl files, but as I start to review the spec, maybe it is turtle? (just a bare lazy dump with no prefixes?). Still looking and trying converting back and forth to other formats (e.g. with rapper)

ozekik · 2021-12-21T16:45:32Z

I understand that huge datasets in RDF tend to be more or less malformed.
In my opinion, if an ntriples file is available, it is easier than turtle to find and "patch" problems and track the changes.

plasticfist · 2021-12-21T19:01:50Z

here is the (first) dbpedia (ttl file, but turtle?) issue, for reference

../dbpedia/ttl/revisions_lang=en_uris.ttl
lightrdf.Error: error while parsing IRI 'http://dbpedia.org/resource/󠄀': Invalid IRI code point '󠄀' on line 19841225 at position 35

$ sed -n '19841223,19841227p;19841228q' revisions_lang=en_uris.ttl
<http://dbpedia.org/resource/𨳒> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𨳒?oldid=786024110&ns=0> .
<http://dbpedia.org/resource/𩧢> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𩧢?oldid=951071761&ns=0> .
<http://dbpedia.org/resource/󠄀> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄀?oldid=949255578&ns=0> .
<http://dbpedia.org/resource/󠄁> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄁?oldid=949255580&ns=0> .
<http://dbpedia.org/resource/󠄂> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄂?oldid=949255609&ns=0> .

including a screen capture, because terminal seems to give more information about the characters in these 5 lines

djstrong · 2023-04-01T21:51:46Z

I have tried with this sed solution while parsing Wikidata, but:
lightrdf.Error: error while parsing IRI 'http://archive.is/EKEWo#34.7%': Invalid IRI percent encoding '%' on line 49533684 at position 41
Another:
lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 59030363 at position 69
:(

ozekik added in progress external labels Dec 21, 2021

rami3l mentioned this issue Feb 6, 2023

IDEA: improving Turtle performance with non-strict mode oxigraph/rio#81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

plasticfist commented Dec 20, 2021 •

edited

Loading

ozekik commented Dec 21, 2021

plasticfist commented Dec 21, 2021 •

edited

Loading

ozekik commented Dec 21, 2021

plasticfist commented Dec 21, 2021

djstrong commented Apr 1, 2023 •

edited

Loading

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

Comments

plasticfist commented Dec 20, 2021 • edited Loading

ozekik commented Dec 21, 2021

plasticfist commented Dec 21, 2021 • edited Loading

ozekik commented Dec 21, 2021

plasticfist commented Dec 21, 2021

djstrong commented Apr 1, 2023 • edited Loading

plasticfist commented Dec 20, 2021 •

edited

Loading

plasticfist commented Dec 21, 2021 •

edited

Loading

djstrong commented Apr 1, 2023 •

edited

Loading