Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[patreon] Patreon post embedded images (type == 'content') always have the same filename #1954

Closed
JessiAuro opened this issue Oct 15, 2021 · 9 comments

Comments

@JessiAuro
Copy link

Currently, I'm using this string for filename:

"filename": "{date} - {id} - {title} - {type}_{filename[0:50]}.{extension}"

Whenever gallery-dl encounters a post with multiple inline images, it seems to report the filename as "1" for every single image, resulting in only the first image being downloaded with the above string.

The images do appear to have filenames, or at least manually downloading them with a web browser results in unique names.

If this is intentional behavior (if extracting the filename isn't possible), it should probably report num as the filename.

To work around this, I've revised my config:

"filename": {
    "type == 'content'": "{date} - {id} - {title} - {type}_{num:>03}.{extension}",
    "":                  "{date} - {id} - {title} - {type}_{filename[0:50]}.{extension}"
},
@JessiAuro JessiAuro changed the title Patreon post embedded images (type == 'content') always have the same filename [patreon] Patreon post embedded images (type == 'content') always have the same filename Oct 15, 2021
@mikf
Copy link
Owner

mikf commented Oct 16, 2021

The images do appear to have filenames, or at least manually downloading them with a web browser results in unique names.

I used https://www.patreon.com/posts/19987002 as example when implementing embedded images and they do not have filenames other than 1.jpg when downloading them in a browser (their Content-Disposition header is just inline, no filename="…" or anything)

Fetching the original filename is already done for image files when necessary, so this shouldn't be hard to also do for content files.

@JessiAuro
Copy link
Author

(their Content-Disposition header is just inline, no filename="…" or anything)

Just checked and filename="etc" does seem to be set for the posts I'm looking at, though I'm using Firefox if that affects the page rendering.


Would it be possible to implement some kind of file-unique identifier as an alternative to {num}? Perhaps something like {file_id} or {file_hash}.

Most urls seem to have some kind of unique identifier for files that could populate this (or be fed into a hash function of some kind), take the above mentioned embedded images for example:

<img data-media-id="<<probably unique?>>" src="https://c10.patreonusercontent.com/3/<<...>>/patreon-media/p/post/<<post_id>>/<<probably unique?>>/1.jpg?token-time=<...>&amp;token-hash=<<...>>">

Or for a more clear example attachment urls:

<a href="https://www.patreon.com/file?h=<<...>>&amp;i=<<probably unique?>>" class="<<...>>">...</a>

These could probably only be guaranteed to to be unique to a specific extractor, but that should be good enough.

Having a unique file identifier would be nice for ensuring all unique assets are downloaded, even in cases where posts/galleries are updated in a manner that would throw off a simple index like {num}.

@mikf
Copy link
Owner

mikf commented Oct 18, 2021

content images now have a better filename if available: 6695ef2


Would it be possible to implement some kind of file-unique identifier as an alternative to {num}? Perhaps something like {file_id} or {file_hash}.

There is {hash}, which is the "unique identifier" you are talking about, but I think it is not 100% guaranteed to have a value.

@JessiAuro
Copy link
Author

There is {hash}, which is the "unique identifier" you are talking about, but I think it is not 100% guaranteed to have a value.

Ah okay, I haven't seen/noticed it show up with the extractors I'm using, so I'm guessing it hasn't been implemented by many. I'll have to tinker with it some more. Closing issue.

@mikf
Copy link
Owner

mikf commented Oct 18, 2021

Oh, you meant an unique identifier for all sites?
{hash} exists only on patreon, *booru sites have {md5}, but I don't think something like that exists for other sites. Your best bet should be the archive_fmt value of each extractor (-E), which is supposed to be unique.

@JessiAuro
Copy link
Author

Ah, i see for Patreon it's {id}_{num}. The problem with that is {num} can change if the original post changes, whereas something derived from unique id shouldn't.

Though this is all dependent on whether or not it's possible to get a unique id for every kind of file on Patreon.

@JessiAuro
Copy link
Author

Also, is it possible to reference archive_fmt in a format string or do I just have to copy it's value?

@mikf
Copy link
Owner

mikf commented Oct 18, 2021

Ah, i see for Patreon it's {id}_{num}. The problem with that is {num} can change if the original post changes, whereas something derived from unique id shouldn't.

{id}_{num} is obviously not ideal, but backwards compatibility. {hash} was added quite a bit after the initial archive format was set, and invalidating all already existing archive keys is something I'd rather avoid.
You can set a custom archive format with archive-format and archive-prefix, by the way.

Also, is it possible to reference archive_fmt in a format string or do I just have to copy it's value?

Not possible. You'll have to copy (and possibly adjust) the format string replacement fields yourself.

@JessiAuro
Copy link
Author

You can set a custom archive format with archive-format and archive-prefix, by the way.

Awesome, I'll have to tinker with that.

backwards compatibility.

It would probably be a good idea to add some kind of version to the database, so you can make changes like this while properly handling older versions. Or maybe just include the format string used to create the id for each row in the database, assuming it wouldn't be too computationally intensive to check against that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants