Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Add support for simply-hentai #89

Closed
ShyWest opened this issue May 26, 2018 · 10 comments
Closed

Request: Add support for simply-hentai #89

ShyWest opened this issue May 26, 2018 · 10 comments

Comments

@ShyWest
Copy link

ShyWest commented May 26, 2018

Would be possible to add support for https://www.simpy-hentai.com/? It's a hentai web similar to nhentai, hbrowse, and the like. I could try to do it myself, but there's no documentation about how to do it, and I would rather not submit a half baked patch that will have to be reviewed and rewriten.

Every doujin/manga/gallery has it's own main page containing a cover, the title, meta data such as tags, language, author/s, number of pages, etc. Language info is not always present and I believe one work can have several authors, but I can't find and example now.

Depending on several factors, the URLs for these main pages can differ.

Each work has a page showing thumbnails for every page, and follows the structure (url)/all-pages, like this:
https://pokemon.simply-hentai.com/mao-friends-9bc39/all-pages

Each page can be viewed separately and their links follow the structure (url)/page/(page_id), like this: https://pokemon.simply-hentai.com/mao-friends-9bc39/page/4052558

There are also extra sections for gifs galleries and videos who has URLs very similar to the previous ones, so some sort of detection would be needed to avoid trying downloading a manga that isn't there.

Each work has an associated json file containing the URLs to the files itself following the structure (url)/all-pages.json, like this: https://pokemon.simply-hentai.com/mao-friends-9bc39/all-pages.json.

The content of said file is like this:

{
    "4052555": {
        "giant": "https://cdn2.sh-cdn.com/images/v2/vertical/giant_thumb/2017-09/Album/58880/4052555.jpg",
        "full": "https://cdn2.sh-cdn.com/images/v2/vertical/full/2017-09/58880/4052555.jpg",
        "path": "https://pokemon.simply-hentai.com/mao-friends-9bc39/page/4052555",
        "bookmarked": false
    },
    "4052558": {
        "giant": "https://cdn2.sh-cdn.com/images/v2/vertical/giant_thumb/2017-09/Album/58880/4052558.jpg",
        "full": "https://cdn2.sh-cdn.com/images/v2/vertical/full/2017-09/58880/4052558.jpg",
        "path": "https://pokemon.simply-hentai.com/mao-friends-9bc39/page/4052558",
        "bookmarked": false
    },
    ...
}

Each page is defined by an id, a giant thumb, the link to view said page and whether it was bookmarked or not by the user. The giant thumb, although big, it's smaller than the full page, so the full page (property full) is the one that should be downloaded. The only way to know the actual page number is by their position in the list.

The main page for each work offers a download option, but it's just a list of filelockers to get an encrypted zip file.

As far as I know, the web doesn't offer an API.

Hope this info is useful.

@mikf
Copy link
Owner

mikf commented May 26, 2018

Hope this info is useful.

Why yes, this is very useful. That makes this a whole lot easier. Thanks.

Is it necessary (or would it be useful) to add login support or is everything available without being logged in?

@ShyWest
Copy link
Author

ShyWest commented May 26, 2018

Everything is available to anonymous users. I did a custom script a while back and didn't have any issues about limits nor throttling after downloading no less than one thousand pages. And that was before discovering the json containing the full index, so I was crawling the whole thing. I did put one second of sleep between requests, though. I like to be nice to servers just in case.

You can bookmark and favorite works with an account, if you want to go the extra mile and add support for that. But the site is perfectly usable without one.

mikf added a commit that referenced this issue May 30, 2018
mikf added a commit that referenced this issue May 30, 2018
All videos hosted on their own servers seem be to dead,
but myhentai.tv embeds, which are most of the videos, work fine.
@mikf
Copy link
Owner

mikf commented May 30, 2018

H-manga/galleries, single images and gifs, and even videos should work now.

I've noticed that the download speed for anything not cached by their CDN is incredibly slow and may even result in a read-timeout, but downloads still finish, given enough time, so I guess it's fine.

Video hosted on their own servers are also all gone, but most of the videos listed are hosted on another service and they work just fine.

Anyway, notify me if you find anything that doesn't work the way it should.

@mikf mikf closed this as completed May 30, 2018
@ShyWest
Copy link
Author

ShyWest commented May 31, 2018

Thanks, I have been trying it with a bunch of links and it's working nicely for the most part. Last time I crawled the site it didn't timeout that often, I suppose they have changed the infrastructure on their CDN.

Anyway, I said for the most part because I found out that the all-pages.json file isn't as reliable as I though. I found one link where the json file doesn't exists and the web gets stuck on and endless loop of redirections: https://original-work.simply-hentai.com/dolls-anzai-rina-hen-dolls-rina-anzais-story. Maybe gallery-dl should use a crawler strategy as a fallback in such cases?

Thanks again for your work and releasing it as Free Software with Linux support and all. It's appreciated.

@mikf
Copy link
Owner

mikf commented Jun 1, 2018

I've tried the link you posted and it works just fine ... well, at least now it does. Maybe the site had some sort of hiccup when you tried or it needs some time to generate the all-pages.json file on demand?

I also tested the 10 newest and 10 oldest galleries to try to reproduce this problem, but to no avail, i.e. everything worked as it should.

@ShyWest
Copy link
Author

ShyWest commented Jun 1, 2018

Nope, failed again for me, third day in a row. Removed my config file just in case, here's the log:

[gallery-dl][debug] Starting KeywordJob for 'https://original-work.simply-hentai.com/dolls-anzai-rina-hen-dolls-rina-anzais-story'
[simplyhentai][debug] Using SimplyhentaiGalleryExtractor for 'https://original-work.simply-hentai.com/dolls-anzai-rina-hen-dolls-rina-anzais-story'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): original-work.simply-hentai.com
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story HTTP/1.1" 200 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[urllib3.connectionpool][debug] https://original-work.simply-hentai.com:443 "GET /dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages.json HTTP/1.1" 301 None
[simplyhentai][error] HTTP request failed:  Exceeded 30 redirects.

I can't open the json file in my browser, either, it gets redirected endlessly too. Same with wget. I tried other links and they work flawlessly. It's not gallery-dl's fault, but it's baffling.

@mikf mikf reopened this Jun 1, 2018
@mikf
Copy link
Owner

mikf commented Jun 1, 2018

Since I don't have this infinite redirect problem, I kind of need to know what works and what doesn't on your side to fix this:

# should return the HTML version
$ wget --header='Accept: text/html' https://original-work.simply-hentai.com/dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages

# should get the same JSON data as all-pages.json would; or cause infinite redirects ...
$ wget --header='Accept: application/json' https://original-work.simply-hentai.com/dolls-anzai-rina-hen-dolls-rina-anzais-story/all-pages

@ShyWest
Copy link
Author

ShyWest commented Jun 1, 2018

Yes, I can access it. It works as intended, thumbnails and all.

Response from the first command: https://pastebin.com/A8FSZE5Z

Response from the second command: https://pastebin.com/Gmp92KzX

That trick with the header worked. The server still refuses to serve me the json file using the proper URL.

@mikf
Copy link
Owner

mikf commented Jun 1, 2018

I've changed the HTTP request to .../all-pages. Hopefully it works for all galleries now.

The Accept header thing is something I found by accident, basically: I wanted to get the thumbnail links from the .../all-pages page and convert them to their original form, but it served me JSON data instead of HTML.

As it turns out, the webserver only sends the HTML version if you send an Accept: text/html header, like a browser would, or JSON for Accept: */* and Accept: application/json, or a 404 Not Found otherwise.

@ShyWest
Copy link
Author

ShyWest commented Jun 3, 2018

Go figure. Can't decide whether that's clever or obscure web design. The last patch seems to work fine on my end. Thank you for your time, again.

@mikf mikf closed this as completed Jun 8, 2018
@mikf mikf added the nsfw label Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants