-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92 #9
Comments
Thank you for reporting! The problem is that As the resume-after-exception feature is WIP in Rio, I think a possible workaround for now is to fix invalid IRIs before parsing, like:
(Use |
Thank you for the quick response, this is very helpful. I'm usually hesitant to manually patch source files, but this might be the best fix for the moment, agree. (thank you for the sed as well) I'm still looking at dbpedia ttls, it throws an error with that dataset as well, which I can't make sense of. At first I thought the problem was that it wasn't actually turtle format in their .ttl files, but as I start to review the spec, maybe it is turtle? (just a bare lazy dump with no prefixes?). Still looking and trying converting back and forth to other formats (e.g. with rapper) |
I understand that huge datasets in RDF tend to be more or less malformed. |
here is the (first) dbpedia (ttl file, but turtle?) issue, for reference ../dbpedia/ttl/revisions_lang=en_uris.ttl
including a screen capture, because terminal seems to give more information about the characters in these 5 lines |
I have tried with this |
First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.
Details
wikidata's file latest-all.ttl
lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92
--- nearby lines from file, including problematic line, if my sed is correct
sed -n '3135042,3135046p;3135047q' latest-all.ttl
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file
Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.
The text was updated successfully, but these errors were encountered: