-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add extractors for fantia and fanbox #1459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A huge Thank You for adding extractors for both sites to you!
I hope I'm not too nitpicky about some of this ...
Don't worry about it, it's helpful. |
ok, I think everything above is addressed |
Do you plan to support saving video links (maybe as metadata) in Fanbox? I think for some posts the If you are interested I can share a dump with paid contents stripped, but so far I found no public sources. |
@abslamp Sure, I didn't know fanbox had this feature. If you can share the whole post response (with the paid urls removed) then I can add it. |
@thatfuckingbird Thanks for considering this! URL: https://www.fanbox.cc/@ayumasayu/posts/1737774 Formatted JSON with paid contents removed:
Please note it does not provide the whole URL, but breaking it into 2 parts. For example, for
In the acutal post, the video is shown as an embedded youtube video. |
@abslamp Hmm yea I would just add the "video" object to the metadata, if present. Problem is, in this example gallery-dl would produce no files for this post, so there would be no metadata written at all in the end... Maybe I could yield the youtube URL and worst case it could be captured with the 'write unsupported URLs to file' option, but that won't work for other serviceProviders.... |
@thatfuckingbird You can use this page for testing. But I don't suggest you adding this to test because I will remove this page later. https://mkx-bot.fanbox.cc/posts/2130801 From the creator panel I found that currently fanbox supports YouTube, Vimeo, and SoundCloud only. |
@abslamp Thanks, I will look at it in the coming days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been playing around with this a bit more and found a few things in regards to URL patterns and fetching results. In summary:
- it should be possible to safe a lot of requests to
https://api.fanbox.cc/post.info
- use
self.request(url, params=params)
instead of building URLs yourself
gallery_dl/extractor/fanbox.py
Outdated
def _pagination(self, url): | ||
headers = {"Origin": self.root} | ||
|
||
while url: | ||
url = text.ensure_http_scheme(url) | ||
body = self.request(url, headers=headers).json()["body"] | ||
for item in body["items"]: | ||
yield self._get_post_data(item["id"]) | ||
|
||
url = body["nextUrl"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The items returned from https://api.fanbox.cc/post.listCreator?creatorId=USER&limit=10
appear to be, at least for xub.fanbox.cc
, more or less the same as the single-item results from https://api.fanbox.cc/post.info?postId=ID
(*) It's only missing comments, imageForShare
, and the entries about next/previous posts.
If that's true in general (posts with videos, embeds, etc), we don't need to fetch data from /post.info
for every post and can use the items returned from /post.listCreator
directly.
(*) Diff "/post.listCreator" - "/post.info" for post 2059366
"creatorId": "xub",
- "hasAdultContent": true,
- "commentList": {
- "items": [],
- "nextUrl": null
- },
- "nextPost": {
- "id": "2085876",
- "title": "Skeb Commission",
- "publishedDatetime": "2021-04-03 04:56:12"
- },
- "prevPost": {
- "id": "2009099",
- "title": "メスガキ〇〇★〇〇〇〇〇をわからせる絵",
- "publishedDatetime": "2021-03-12 19:01:08"
- },
- "imageForShare": "https://pixiv.pximg.net/c/1200x630_90_a2_g5/fanbox/public/images/post/2059366/cover/JuTXNtvo1BRN93cLW371vVd6.jpeg"
+ "hasAdultContent": true
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you are right, at least I couldn't find any posts where listCreator didn't have all the content. Updated the code to use that directly, it's easy enough to change it back if it turns out it's needed.
@thatfuckingbird @abslamp There should also be something useful in the test data from PixivUtil2: https://github.com/Nandaka/PixivUtil2/tree/master/test |
@thatfuckingbird SoundCloud and Vimeo links added. Seems that SoundCloud is not working correctly. |
Co-authored-by: Mike Fährmann <[email protected]>
Co-authored-by: Mike Fährmann <[email protected]>
Co-authored-by: Mike Fährmann <[email protected]>
Co-authored-by: Mike Fährmann <[email protected]>
Co-authored-by: Mike Fährmann <[email protected]>
Co-authored-by: Mike Fährmann <[email protected]>
Co-authored-by: Mike Fährmann <[email protected]>
Added support for embedded videos and addressed the comments above. Looking at the pixivutil test json files, I realized that article type posts and the imageMap/fileMap things are not handled correctly in the fanbox downloader. I also want to do another round of testing with the new changes. Going to do these soon. |
docs/configuration.rst
Outdated
extractor.fanbox.videos | ||
----------------------- | ||
Type | ||
``bool`` or ``string`` | ||
Default | ||
``true`` | ||
Description | ||
Control behavior on videos embedded from external sites. | ||
Recognizes embeds from YouTube, Vimeo and SoundCloud. | ||
|
||
* ``true``: Extract video URLs (videos are not downloaded, as | ||
galley-dl does not support these sites natively) | ||
* ``"ytdl"``: Download videos and let `youtube-dl`_ handle all of | ||
video extraction and download | ||
* ``false``: Ignore videos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this option should be simplified to just
true
: Download embedded media with youtube-dl (what is currently happening for"ytdl"
)false
: Ignore videos
with just "Download embedded media from YouTube, Vimeo, and SoundCloud with youtube-dl" as description.
The current true
would download the HTML content of those external sites and I don't think anyone would want that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, true
shouldn't download html, instead the links show up in the unsupported URLs (the --write-unsupported
flag),
though I can remove it, I don't particularly mind either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, they would land in the "unsupported" file. I didn't realize it's usingMessage.Queue
.
But that wouldn't use youtube-dl to download them, either. It's Message.Url
to download and Message.Queue
to potentially spawn a new extractor.
I would still only have true
or false
as options (because its easier).
It'd need to use Message.Url
for ytdl:…
URIs if you want to keep all three.
OK, almost done. Looked at pixivutil and based on the examples found there I added support for the remaining post types: -"entry" type posts just contain a "html" entry with raw html in it, have to extract images from that (I followed what pixivutil does here for exactly what type of image URLs are downloaded) -"article" type posts seem to be similar in purpose but a newer format, it has paragraphs (+styling info, etc.) as json objects. This is where fileMap/imageMap is used. Pixivutil actually parses the paragraphs from the json and generates a HTML with the written post content, but I think it is better if gallery-dl just saves the whole article json into the metadata, then it can be postprocessed later by the user. Fortunately files/images can simply be downloaded based on the contents of imageMap/fileMap. Handling embeds was also extended, turns out there are other types of embeds, some which gallery-dl can handle itself (e.g. twitter). I renamed the "videos" option to "embeds" but the idea is mostly the same. @abslamp You can delete the test posts now, thanks for the help. Fortunately some of the posts that pixivutil tests use are public so I added some tests for the above. Only one problem remains, that is the handling of Fanbox embeds (a fanbox post embedded in another). For some reason, if I yield a fanbox URL while parsing a fanbox post, it gets ignored (written into the unsupported urls file) despite the same code working with yielding twitter URLs (and if I manually run gallery-dl with the yielded Fanbox url then it is recognized correctly). |
You need to specify the expected final_post["_extractor"] = FanboxCreatorExtractor When there's no |
Thanks, that did it. I think everything's done now. |
@thatfuckingbird |
Posts from 'https://api.fanbox.cc/post.listCreator' do not contain a 'body' with all images anymore. #1459 (comment)
Implemented post and user extractors for Fantia and Fanbox. Both use cookies for auth.
Doing it in 1 PR because these are pretty similar.
I added tests for both, however I don't know how to pass cookies to test_results.py so I didn't manage to run the Fantia tests.
The tests use freely available content, but registration (and for Fantia, subscription to the free plan) is likely needed.
Note that the test urls contain NSFW content (it's hard to find freely available + sfw on these sites).
Closes #849.
Closes #1260.
Closes #739.