Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9

Open
plasticfist opened this issue Dec 20, 2021 · 5 comments

Comments

@plasticfist
Copy link

plasticfist commented Dec 20, 2021

First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.

Details

wikidata's file latest-all.ttl

lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92

--- nearby lines from file, including problematic line, if my sed is correct

sed -n '3135042,3135046p;3135047q' latest-all.ttl

ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
	pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
	pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
	prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;

wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file

Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.


parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this

parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
    f_triples.write(f"{s}\t{p}\t{o}\n")

Other notes: 
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf


@plasticfist plasticfist changed the title lightrdf.Error: error while parsing IRI 'http://dbpedia.org/resource/󠄀': Invalid IRI code point '󠄀' on line 16065027 at position 35 lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 Dec 21, 2021
@ozekik
Copy link
Owner

ozekik commented Dec 21, 2021

Thank you for reporting!

The problem is that #Crew?oldid=2476206#Command_crew in <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> is, strictly speaking, an invalid IRI part with # followed by unescaped # (and therefore the document is an invalid RDF, in a precise sense.)
Some library such as rdflib just ignores it, but Rio (Rust RDF library behind lightrdf) is rigid and raises an exception.

As the resume-after-exception feature is WIP in Rio, I think a possible workaround for now is to fix invalid IRIs before parsing, like:

sed -r 's/([^#]*)#/\1%23/2g' latest-all.ttl

(Use -i to replace in-place and gsed on Mac)

@plasticfist
Copy link
Author

plasticfist commented Dec 21, 2021

Thank you for the quick response, this is very helpful. I'm usually hesitant to manually patch source files, but this might be the best fix for the moment, agree. (thank you for the sed as well) I'm still looking at dbpedia ttls, it throws an error with that dataset as well, which I can't make sense of. At first I thought the problem was that it wasn't actually turtle format in their .ttl files, but as I start to review the spec, maybe it is turtle? (just a bare lazy dump with no prefixes?). Still looking and trying converting back and forth to other formats (e.g. with rapper)

@ozekik
Copy link
Owner

ozekik commented Dec 21, 2021

I understand that huge datasets in RDF tend to be more or less malformed.
In my opinion, if an ntriples file is available, it is easier than turtle to find and "patch" problems and track the changes.

@plasticfist
Copy link
Author

here is the (first) dbpedia (ttl file, but turtle?) issue, for reference

../dbpedia/ttl/revisions_lang=en_uris.ttl
lightrdf.Error: error while parsing IRI 'http://dbpedia.org/resource/󠄀': Invalid IRI code point '󠄀' on line 19841225 at position 35

$ sed -n '19841223,19841227p;19841228q' revisions_lang=en_uris.ttl
<http://dbpedia.org/resource/𨳒> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𨳒?oldid=786024110&ns=0> .
<http://dbpedia.org/resource/𩧢> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/𩧢?oldid=951071761&ns=0> .
<http://dbpedia.org/resource/󠄀> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄀?oldid=949255578&ns=0> .
<http://dbpedia.org/resource/󠄁> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄁?oldid=949255580&ns=0> .
<http://dbpedia.org/resource/󠄂> <http://www.w3.org/ns/prov#wasDerivedFrom> <http://en.wikipedia.org/wiki/󠄂?oldid=949255609&ns=0> .

including a screen capture, because terminal seems to give more information about the characters in these 5 lines
image

@djstrong
Copy link

djstrong commented Apr 1, 2023

I have tried with this sed solution while parsing Wikidata, but:
lightrdf.Error: error while parsing IRI 'http://archive.is/EKEWo#34.7%': Invalid IRI percent encoding '%' on line 49533684 at position 41
Another:
lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 59030363 at position 69
:(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants