[Question] What will happen if there is an update during download? #368

indrakaw · 2019-08-02T17:52:12Z

Let's say, I'm scrapping the JSON data from a tumblr blog that has 500k+ posts.

gallery-dl -j https://kwwwsk.tumblr.com > kwwwsk.json

Eventually there will be an update, because of post schedule and scrapping took hours.

Is there will be a skipped content? A content that skipped and didn't count on the API. Because what I see is tumblr using offset and it's limit.

[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=14400&limit=50&reblog_info=true HTTP/1.1" 200 None

The text was updated successfully, but these errors were encountered:

indrakaw · 2019-08-02T18:01:35Z

I wish it could be better like set limit=50 into limit=500 or so.

indrakaw · 2019-08-02T19:02:12Z

I'm having a problem:

[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (1/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (2/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (3/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (4/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (5/5)

Edit:
It's saved but partially

$ tac kwwwsk.json | less

]
  ]
    "429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts"                                                                 "HttpError",                                                             [
  ],
    }
      "type": "photo"
      "timestamp": 1373224748,
      "tags": [],
      "summary": "",
      "subcategory": "user",
      "state": "published",
      "source_url": "https://halfdry.tumblr.com/post/41362994761",
      "source_title": "halfdry",
      "slug": "",
      "should_open_in_legacy": true,
      "short_url": "https://tmblr.co/ZiZBVyp5EjX2",
      "recommended_source": null,                                                "recommended_color": null,                                                 "reblogged_root_uuid": "t:6ojkl9CKJp3WqQvxyD28HA",
      "reblogged_root_url": "https://halfdry.tumblr.com/post/41362994761",
      "reblogged_root_title": "HALFDRY",
      "reblogged_root_name": "halfdry",
      "reblogged_root_id": "41362994761",
      "reblogged_root_following": false,
      "reblogged_root_can_message": true,
      "reblogged_from_uuid": "t:BhM0qYlu7B527SCYCua0aA",
      "reblogged_from_url": "https://aoky97-deactivated20150328.tumblr.com/post/54846827770",

Edit 2:
I'm glad I requested this feature before: #337

mikf · 2019-08-02T19:56:00Z

Tumblr's API has a rate limit of 1000 (?) requests per hour and there is even some special logic in place to wait until this limit has recovered, but I've managed to "break" it by changing stuff elsewhere. Sorry. I'll fix this ASAP.

I'm glad I requested this feature before: #337

Yes, you can use the timestamp of the last post you managed to scrape as date-max and it will begin from there.

Regarding your questions

Is there will be a skipped content?

No, if anything you will get duplicate content. All posts get moved "one ahead" and the last post in a list of 50 will reappear as the first one in the next list of 50 if a new one gets added at the beginning.

I wish it could be better like set limit=50 into limit=500 or so.

50 posts per API request is the maximum. Setting a higher number has no effect and still only returns data for 50 posts.

indrakaw · 2019-08-03T05:56:37Z

Exceeded error doesn't return exit code.

I was planning to a loop and break if fail:

TUMBLOG=kwwwsk; \
for YEAR in {2019..2004}; do \
gallery-dl -o date-max=${YEAR}-01-01T00:00:00 -o date-min=$((YEAR - 1))-01-01T00:00:00 -vj https://${TUMBLOG}.tumblr.com > ${TUMBLOG}-${YEAR}.json || break; \
done

It doesn't work.

indrakaw closed this as completed Aug 3, 2019

mikf added a commit that referenced this issue Aug 3, 2019

fix rate limit handling for OAuth APIs (#368)

f4bc75e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] What will happen if there is an update during download? #368

[Question] What will happen if there is an update during download? #368

indrakaw commented Aug 2, 2019 •

edited

Loading

indrakaw commented Aug 2, 2019

indrakaw commented Aug 2, 2019 •

edited

Loading

mikf commented Aug 2, 2019

indrakaw commented Aug 3, 2019

[Question] What will happen if there is an update during download? #368

[Question] What will happen if there is an update during download? #368

Comments

indrakaw commented Aug 2, 2019 • edited Loading

indrakaw commented Aug 2, 2019

indrakaw commented Aug 2, 2019 • edited Loading

mikf commented Aug 2, 2019

indrakaw commented Aug 3, 2019

indrakaw commented Aug 2, 2019 •

edited

Loading

indrakaw commented Aug 2, 2019 •

edited

Loading