Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kemono.party - Patreon] Inconsistencies downloading main files vs attachments #1899

Closed
ghost opened this issue Sep 28, 2021 · 10 comments
Closed

Comments

@ghost
Copy link

ghost commented Sep 28, 2021

[This might look like a wall of text, but I don't think it's actually that much information. Thanks in advance.]

I am attempting to download some files from kemono.party, but the behaviour of the downloader seems inconsistent depending on whether the target post has its content uploaded as files or attachments, and which ones are duplicates (because of course that's still a problem on kemono.party). I am using gallery-dl 1.18.4-dev.

Target URLs [no nudity, but NSFW]:

https://kemono.party/patreon/user/4577256/post/53549884 (no content, 5 attachments, file 1 is duplicated)
https://kemono.party/patreon/user/4577256/post/52864412 (no content, 3 attachments, file 2 is duplicated)
https://kemono.party/patreon/user/4577256/post/50117542 (2 inline files in content, no attachments)

It might be worth noting that link 2 doesn't have any images listed under "content" on the page, but if you look at the image URLs you can see that the first image is under hostname/files/etc and the others are hostname/attachments/etc

The JSON for my gallery-dl config file:

"kemonoparty":
{
	<cookie data>
	"filename": {
		"service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
		""		      : "{id}-{num}.{extension}"
	},
	"image-filter": "extension != 'psd'"
}

I have configured it this way to force all Patreon attachment filenames to use underscores instead of spaces, which protects against duplicate files with slightly different filenames. It has worked for me for several months.

When using this config, I downloaded all images except for animation 1 from link 2, and there were no duplicates, but because of the filenames the order of each picture was jumbled. I tried to change the JSON to download everything and put them in the correct order:

"kemonoparty":
{
	<cookie data>
	"filename": {
		"service == 'patreon'": "{id}-{num}.{extension}",
		""		      : "{id}-{num}.{extension}"
	},
	"image-filter": "extension != 'psd'"
}

This config improved the filenames to be in order, but it didn't download the missing picture from the first config and it downloaded the duplicate animation from link 2.

I tried to see what keywords/filters I could use in the filename by using gallery-dl -K [link 2] but that did not seem to help: according to gallery-dl, the num (index) of each picture in that link starts at 1 with the duplicate animations. Even when I remove the distinction between Patreon and other services (or removed the filename block entirely), gallery-dl does not download the first animation.

In summary:

  • What is causing the first animation in link 2 not to be downloaded?
  • Does gallery-dl distinguish between inline content, "files" content, and "attachment" content when downloading from a Patreon service on kemono.party?
  • Have I simply configured something wrong?

For reference, here is the command and verbose output when using the second config.

gallery-dl --verbose --dest . -o directory=[] -i targets.txt
[gallery-dl][debug] Version 1.18.4-dev
[gallery-dl][debug] Python 3.8.5 - Windows-7-6.1.7601-SP1
[gallery-dl][debug] requests 2.24.0 - urllib3 1.25.10
[1/3] https://kemono.party/patreon/user/4577256/post/53549884
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.party/patreon/user/4577256/post/5354988
4'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.party/patreon/user/4577256/p
ost/53549884'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.party:443
[urllib3.connectionpool][debug] https://kemono.party:443 "GET /api/patreon/user/4577256/post/5354988
4 HTTP/1.1" 200 None
# .\53549884-1.png
# .\53549884-2.png
# .\53549884-3.png
# .\53549884-4.png
[2/3] https://kemono.party/patreon/user/4577256/post/52864412
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.party/patreon/user/4577256/post/5286441
2'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.party/patreon/user/4577256/p
ost/52864412'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.party:443
[urllib3.connectionpool][debug] https://kemono.party:443 "GET /api/patreon/user/4577256/post/5286441
2 HTTP/1.1" 200 None
# .\52864412-1.gif
# .\52864412-2.gif
[3/3] https://kemono.party/patreon/user/4577256/post/50117542
[gallery-dl][debug] Starting DownloadJob for 'https://kemono.party/patreon/user/4577256/post/5011754
2'
[kemonoparty][debug] Using KemonopartyPostExtractor for 'https://kemono.party/patreon/user/4577256/p
ost/50117542'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): kemono.party:443
[urllib3.connectionpool][debug] https://kemono.party:443 "GET /api/patreon/user/4577256/post/5011754
2 HTTP/1.1" 200 None
# .\50117542-1.png
# .\50117542-2.png
@mikf
Copy link
Owner

mikf commented Sep 30, 2021

What is causing the first animation in link 2 not to be downloaded?

The patreon-skip-file option. (#1689, 4864748)
In all patreon posts on kemono that I've seen until now, it was always the main file that was a duplicate of another attachment file. but that doesn't seem to always hold true. (#1751)

Does gallery-dl distinguish between inline content, "files" content, and "attachment" content when downloading from a Patreon service on kemono.party?

There's a type metadata field that is either "file", "attachment", or "inline".

Have I simply configured something wrong?

You haven't, it's just that any attempt of fixing this "duplicate files for patreon posts" issue has always failed, including the current "ignore main file if there are attachments".

@TestPolygon
Copy link

TestPolygon commented Oct 3, 2021

BTW, for new files SHA-256 taken from the URL can be used to define are the files are same, or they just only have the same name.

@ghost
Copy link
Author

ghost commented Oct 4, 2021

Ah, I see. Thanks for clearing that up. I suppose I'll just have to download everything and manually remove duplicates, then.

The patreon-skip-file option. (#1689, 4864748) In all patreon posts on kemono that I've seen until now, it was always the main file that was a duplicate of another attachment file. but that doesn't seem to always hold true. (#1751)

Yeah. I think I made the issue that led to that option being included, actually. Heh.

Does gallery-dl distinguish between inline content, "files" content, and "attachment" content when downloading from a Patreon service on kemono.party?

There's a type metadata field that is either "file", "attachment", or "inline".

That's good to know. There may be something I use that for.

Have I simply configured something wrong?
You haven't, it's just that any attempt of fixing this "duplicate files for patreon posts" issue has always failed, including the current "ignore main file if there are attachments".

Well, for what it's worth, the "ignore main file if there are attachments" approach does filter out the vast, vast majority of duplicates and it's mostly solved kemono's data duplication. I just seem to have found an artist or a post that happens to store data differently.

BTW, for new files SHA-256 taken from the URL can be used to define are the files are same, or they just only have the same name.

Is there a download comparison option in gallery-dl that does that? I've looked through some of the comparison options in the config documentation but I don't remember seeing something like that.

@TestPolygon
Copy link

It's the new URL format introduced 4 days ago. Currently not all files uses it.

@skyvory
Copy link

skyvory commented Oct 12, 2021

There are some cases where the images aren't posted in 'files' area, but 'content' area and the downloader skipped the content ones. The images aren't links, just inline.

@mikf
Copy link
Owner

mikf commented Oct 13, 2021

@TestPolygon

Currently not all files uses it.

And they still do not, even more than a week later. Maybe these changes only got applied to patreon posts.

$ gallery-dl -g https://kemono.party/gumroad/user/trylsc/post/IURjT
https://kemono.party/data/files/gumroad/trylsc/IURjT/reward8.jpg
https://kemono.party/data/attachments/gumroad/trylsc/IURjT/$3.zip

@skyvory inline images are supposed to be supported, unless the URLs in newer posts got changed and aren't picked up by gallery-dl.

$ gallery-dl -g https://kemono.party/fanbox/user/7356311/post/802343
https://kemono.party/data/inline/fanbox/uaozO4Yga6ydkGIJFAQDixfE.jpeg

@ghost
Copy link
Author

ghost commented Oct 13, 2021

@mikf
For the particular artist that I wanted to download, another factor may be that the inline images are links to an outside source (Imgur) instead of being direct uploads to Kemono. I'm not exactly sure how Patreon allows creators to upload images to posts, but if we look at https://kemono.party/patreon/user/4577256/post/53013824, and right click > view image/open image in new tab, we stay on kemono.party.

For my artist, you can look at https://kemono.party/patreon/user/4577256/post/53013824 (mostly SFW, some minor nudity), and right click > view image/open image in new tab, we are redirected to an Imgur page.

I'm not sure if this is something gallery-dl accounts for when crawling kemono patreon posts. From some minor testing, it doesn't seem to recognize that these embedded/inline images are even there.

In any event, the workaround that I'm using now is simple but somewhat tedious using JDownloader 2:

  • Go to kemono Patreon post and view page source
  • Copy block of elements and URLs
  • Paste into JDownloader's LinkGrabber feature
  • JDownloader gets the images from Imgur with no trouble

@valdearg
Copy link

valdearg commented Oct 17, 2021

Not sure if this is the best place, apologies. But I noticed with this URL that the main attachment 404s but the inline image isn't available to download:

https://kemono.party/patreon/user/7453087/post/33060907

Not too sure how that differs from the one posted earlier, which does come through as an inline post. Most likely because it has both a file and an inline image?

https://kemono.party/fanbox/user/7356311/post/802343

@mikf
Copy link
Owner

mikf commented Oct 18, 2021

@valdearg fixed in db857b4. The inline image URL there started with https://kemono.party/ instead of the expected /inline.

@valdearg
Copy link

You're amazing! Thanks, that's got it!

mikf added a commit that referenced this issue Nov 17, 2021
Extract the SHA-256 file hash from URLs
and skip files with the same hash in the same post.

- provide a 'hash' metadata field (empty string if not available)
- remove 'patreon-skip-file' option
@mikf mikf closed this as completed Dec 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants