Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What will happen if there is an update during download? #368

Closed
indrakaw opened this issue Aug 2, 2019 · 4 comments
Closed

Comments

@indrakaw
Copy link

indrakaw commented Aug 2, 2019

Let's say, I'm scrapping the JSON data from a tumblr blog that has 500k+ posts.

gallery-dl -j https://kwwwsk.tumblr.com > kwwwsk.json

Eventually there will be an update, because of post schedule and scrapping took hours.

Is there will be a skipped content? A content that skipped and didn't count on the API. Because what I see is tumblr using offset and it's limit.

[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=14400&limit=50&reblog_info=true HTTP/1.1" 200 None
@indrakaw
Copy link
Author

indrakaw commented Aug 2, 2019

I wish it could be better like set limit=50 into limit=500 or so.

@indrakaw
Copy link
Author

indrakaw commented Aug 2, 2019

I'm having a problem:

[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (1/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (2/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (3/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (4/5)                                             
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/kwwwsk.tumblr.com/posts?offset=70800&limit=50&reblog_info=true HTTP/1.1" 429 None                                                                       
[tumblr][debug] 429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts (5/5)

Edit:
It's saved but partially

$ tac kwwwsk.json | less
]
  ]
    "429: Limit Exceeded for url: https://api.tumblr.com/v2/blog/kwwwsk.tumblr.com/posts"                                                                 "HttpError",                                                             [
  ],
    }
      "type": "photo"
      "timestamp": 1373224748,
      "tags": [],
      "summary": "",
      "subcategory": "user",
      "state": "published",
      "source_url": "https://halfdry.tumblr.com/post/41362994761",
      "source_title": "halfdry",
      "slug": "",
      "should_open_in_legacy": true,
      "short_url": "https://tmblr.co/ZiZBVyp5EjX2",
      "recommended_source": null,                                                "recommended_color": null,                                                 "reblogged_root_uuid": "t:6ojkl9CKJp3WqQvxyD28HA",
      "reblogged_root_url": "https://halfdry.tumblr.com/post/41362994761",
      "reblogged_root_title": "HALFDRY",
      "reblogged_root_name": "halfdry",
      "reblogged_root_id": "41362994761",
      "reblogged_root_following": false,
      "reblogged_root_can_message": true,
      "reblogged_from_uuid": "t:BhM0qYlu7B527SCYCua0aA",
      "reblogged_from_url": "https://aoky97-deactivated20150328.tumblr.com/post/54846827770",

Edit 2:
I'm glad I requested this feature before: #337

@mikf
Copy link
Owner

mikf commented Aug 2, 2019

Tumblr's API has a rate limit of 1000 (?) requests per hour and there is even some special logic in place to wait until this limit has recovered, but I've managed to "break" it by changing stuff elsewhere. Sorry. I'll fix this ASAP.

I'm glad I requested this feature before: #337

Yes, you can use the timestamp of the last post you managed to scrape as date-max and it will begin from there.

Regarding your questions

Is there will be a skipped content?

No, if anything you will get duplicate content. All posts get moved "one ahead" and the last post in a list of 50 will reappear as the first one in the next list of 50 if a new one gets added at the beginning.

I wish it could be better like set limit=50 into limit=500 or so.

50 posts per API request is the maximum. Setting a higher number has no effect and still only returns data for 50 posts.

@indrakaw indrakaw closed this as completed Aug 3, 2019
@indrakaw
Copy link
Author

indrakaw commented Aug 3, 2019

Exceeded error doesn't return exit code.

I was planning to a loop and break if fail:

TUMBLOG=kwwwsk; \
for YEAR in {2019..2004}; do \
gallery-dl -o date-max=${YEAR}-01-01T00:00:00 -o date-min=$((YEAR - 1))-01-01T00:00:00 -vj https://${TUMBLOG}.tumblr.com > ${TUMBLOG}-${YEAR}.json || break; \
done

It doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants