Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GND endpoint -> now named DNB (Deutsche Nationalbibliothek) #1180

Open
brunnerpaul opened this issue Dec 8, 2023 · 5 comments
Open

New GND endpoint -> now named DNB (Deutsche Nationalbibliothek) #1180

brunnerpaul opened this issue Dec 8, 2023 · 5 comments

Comments

@brunnerpaul
Copy link

Many thanks, @hannahbast and @joka921, for the swift implementation of the GND dataset during the Wikidata Data Modelling days!

So far, all queries at https://qlever.cs.uni-freiburg.de/gnd result in the error
Unexpected token '<', "<!DOCTYPE "... is not valid JSON.

According to #1171 that might just be a generic error message.
Could you please have a look into what's the cause here?

@hannahbast
Copy link
Member

@brunnerpaul Thanks for the reminder! That just means that the backend is down (the supoptimal error message is a temporay bug in the QLever UI, we are currently working on a refactoring PR concerning this). Here is what happened:

During the meeting last week, I set up an instance for a selection of files from https://data.dnb.de/opendata , which worked fine. After the meeting, I tried to set up an instance for all the data I could find on https://data.dnb.de/opendata, namely:

curl -L -C - --remote-name-all https://data.dnb.de/opendata/authorities-gnd_lds.nt.gz https://data.dnb.de/opendata/bib_lds.nt.gz https://data.dnb.de/opendata/dnb-all_lds.nt.gz https://data.dnb.de/opendata/dnb-all_ldsprov.nt.gz https://data.dnb.de/opendata/zdb_lds.nt.gz

Unfortunately, it turned out that some of these files are not formatted correctly, and like most SPARQL engines, QLever refuses to index data that is not formatted correctly. Two questions:

  1. Which files should we index? If only a subset of the above, why only a subset?

  2. Wouldn't the name dnb be more appropriate for the instance than gnd? The many abbreviations used on the site are really confusing (dnb, gnd, lds, zdb, ...)

@brunnerpaul
Copy link
Author

  1. Which files should we index? If only a subset of the above, why only a subset?

I've added what I know about the specific datasets here:

https://data.dnb.de/opendata/authorities-gnd_lds.nt.gz
1.9G
Stabiler Link auf den aktuellen Gesamtabzug der GND im Format RDF (N-Triples)

This is the most commonly used dataset AFAIK, authority files for persons, institutions, places, thesauri, and should have the highest priority.

https://data.dnb.de/opendata/bib_lds.nt.gz
4.7M
Stabiler Link auf den aktuellen Gesamtabzug der Adressdatei (ISIL- und Sigelverz.) im Format RDF (N-Triples)

This can be left out IMO, it’s kind of an address book of partner libraries.

https://data.dnb.de/opendata/dnb-all_lds.nt.gz
4.5G
Stabiler Link auf den aktuellen Gesamtabzug der DNB-Titeldaten im Format RDF (N-Triples)

This is bibliographic data, all the books of the German National Library, I think. Would be nice if Qlever could offer that because of its size which makes processing the file on smaller machines difficult, but with lower priority.

https://data.dnb.de/opendata/dnb-all_ldsprov.nt.gz
1.2G
Stabiler Link auf den aktuellen Gesamtabzug Metadatenprovenienz DNB-Titeldaten im Format RDF (N-Triples)

If the bibliographic data is offered, this should also be offered. Also lower priority.

https://data.dnb.de/opendata/zdb_lds.nt.gz
549M
Stabiler Link auf den aktuellen Gesamtabzug ZDB-Titeldaten im Format RDF (N-Triples)

More bibliographic data, magazines only. Also lower priority.

  1. Wouldn't the name dnb be more appropriate for the instance than gnd? The many abbreviations used on the site are really confusing (dnb, gnd, lds, zdb, ...)

Yes, good point. "DNB" (Deutsche Nationalbibliothek) as the data provider makes more sense as a name, especially if you want to add other datasets in the future. People working in GLAM mostly use the GND dataset (and call it "GND") but it would be good to keep the instance more general-purpose and have that reflected in the name "DNB".

@hannahbast
Copy link
Member

Thanks, Paul, that was very helpful indeed.

I have now indexed all the files you listed, except bib_lds.nt.gz because that contains malformed IRIs. Good that you say that it's not important and can be left out. The file dnb-all_ldsprov.nt.gz contains several invalid floating point literals, but QLever has an option to ignore those, which I did.

The instance is now live under https://qlever.cs.uni-freiburg.de/dnb . A few interesting example queries would be welcome (you can just post them in reply to this issue if you have any).

I have also added a Qleverfile for whoever wants to host an instance themselves: https://github.com/ad-freiburg/qlever-control/blob/python-qlever/Qleverfiles/Qleverfile.dnb

@hannahbast
Copy link
Member

@brunnerpaul Does it work for you now?

@hannahbast hannahbast changed the title queries for new GND endpoint result in error New GND endpoint -> now named DNB (Deutsche Nationalbibliothek) Dec 12, 2023
@brunnerpaul
Copy link
Author

Works great, thanks a lot!

I'll put together some sample queries and post them here. I have a few queries that I could combine into a single more useful query now because Qlever can just process it in one go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants