Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer content type JSON Lines and format gzip #91

Closed
acka47 opened this issue Apr 23, 2018 · 8 comments · Fixed by #126
Closed

Offer content type JSON Lines and format gzip #91

acka47 opened this issue Apr 23, 2018 · 8 comments · Fixed by #126

Comments

@acka47
Copy link
Contributor

acka47 commented Apr 23, 2018

As in lobid-resources, see http://lobid.org/resources/api#content_types

@acka47
Copy link
Contributor Author

acka47 commented Apr 24, 2018

And also support format gzip via content header.

@acka47 acka47 changed the title Offer content type JSON Lines Offer content type JSON Lines and format gzip Apr 24, 2018
@fsteeg fsteeg added the ready label May 14, 2018
@fsteeg fsteeg self-assigned this May 14, 2018
@fsteeg fsteeg added working and removed ready labels Jun 13, 2018
fsteeg added a commit that referenced this issue Jun 19, 2018
fsteeg added a commit that referenced this issue Jun 20, 2018
fsteeg added a commit that referenced this issue Jun 20, 2018
fsteeg added a commit that referenced this issue Jun 20, 2018
@fsteeg fsteeg added review and removed working labels Jun 20, 2018
@fsteeg
Copy link
Member

fsteeg commented Jun 20, 2018

Deployed to stage, see:

http://stage.lobid.org/gnd/search?q=ehrenfeld&format=bulk

Tested uncompressed request for all corporate bodies:

curl "http://stage.lobid.org/gnd/search?q=type:CorporateBody&format=bulk" > bulk.jsonl

This yields a 1.7 GB file. Took about:

  • 1:30 minutes on the same machine
  • 2:30 minutes on our local network
  • 6:45 minutes on our Eduroam WLAN

Tested same request, but compressed (handled by the Apache proxy):

curl --header "Accept-Encoding: gzip" "http://stage.lobid.org/gnd/search?q=type:CorporateBody&format=bulk" > bulk.gz

This yields a 174 MB file. Took about:

  • 1:30 minutes on the same machine
  • 1:30 minutes on our local network
  • 1:30 minutes on our Eduroam WLAN

See also documentation: http://stage.lobid.org/gnd/api#content_types

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Jun 20, 2018
@acka47
Copy link
Contributor Author

acka47 commented Jun 21, 2018

You should also be able to get jsonlines and gzip for a filter query. I tried curl --header "Accept: application/x-jsonlines" "http://stage.lobid.org/gnd/search?filter=%2B%28type%3APlaceOrGeographicName%29" > geographika.jsonl and am currently pulling the whole GND.

Furthermore, as we discussed online we will use jsonl instead of bulk (and adjust this in lobid-resources at a later point as well).

@acka47
Copy link
Contributor Author

acka47 commented Jun 21, 2018

Downloading the whole GND as gzip (1,5 GB, unzipped 14 GB) took just 13 minutes. So, this definitely works like a charm.

@acka47 acka47 assigned fsteeg and unassigned acka47 Jun 21, 2018
fsteeg added a commit that referenced this issue Jun 22, 2018
For consistency with `html`, `json`, etc.

See #91
fsteeg added a commit that referenced this issue Jun 22, 2018
@fsteeg
Copy link
Member

fsteeg commented Jun 22, 2018

@fsteeg fsteeg assigned acka47 and unassigned fsteeg Jun 22, 2018
@acka47
Copy link
Contributor Author

acka47 commented Jun 22, 2018

Looks good. One minor problem is left, though, which actually was there before this ticket. http://lobid.org/gnd/4074335-4.jsonl also gives back JSON and not JSON lines as doest http://lobid.org/gnd/4074335-4.jsonfoo. We should only allow a colon : with more to follow after .json.

fsteeg added a commit that referenced this issue Jun 22, 2018
Don't fall back to JSON if unsupported format was requested

See #91
@acka47
Copy link
Contributor Author

acka47 commented Jun 22, 2018

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants