Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom filename transformations (was: Filename: I want to restrict characters, but permit spaces) #5042

Open
keybounce opened this issue Feb 23, 2015 · 14 comments
Labels

Comments

@keybounce
Copy link

Right now, restricting filenames to reasonable characters also prevents spaces.

Yea, I know, most unix scripts/etc. can't handle spaces, but most modern GUI tools can.

@phihag
Copy link
Contributor

phihag commented Feb 23, 2015

If you are using a modern GUI tool, why are you restricting characters in the first place?

@keybounce
Copy link
Author

Because I want readable file names.

The code of modern GUI programs may be 8-bit smart, but I want to be able to read filenames.

And, as long as the only issue is spaces, most of what I want to do from the command line works -- bash escapes spaces. Just as long as it's either a real program, or a well-written script (and I've gotten quite good with the whole "$var" syntax all over everything ... yuck, why wasn't this in the shell design from day one?)

@phihag
Copy link
Contributor

phihag commented Feb 24, 2015

Can you elaborate on why filenames would get more readable when you pass in --restrict-filenames? I'd very much prefer the first filename:

$ youtube-dl --get-filename F59zpvPg3i0
新年快樂 - 小虎隊、憂歡派對 (1989)-F59zpvPg3i0.mp4
$ youtube-dl --get-filename F59zpvPg3i0 --restrict-filenames
1989-F59zpvPg3i0.mp4

As it stands, this issue is lacking context, and without context, we usually close issues. So please do provide a more detailed example of why you'd wanna restrict filenames in the first place.

@keybounce
Copy link
Author

Here is a directory listing

keybounceMBP:Etho michael$ ls
total 5993416
257872 Etho Plays Minecraft - Episode 390 Connected Houses.mp4
224464 Etho Plays Minecraft - Episode 391 River Terraforming.mp4
267000 Etho Plays Minecraft - Episode 392 Book Matrix.mp4
242904 Etho Plays Minecraft - Episode 394 Flying Sheep Farm.mp4
257096 Etho Plays Minecraft - Episode 395 Weird Style.mp4
239696 Etho Plays Minecraft - Episode 396 Hyper Speed Piggy.mp4
261672 Etho Plays Minecraft - Episode 397 Life Changer.mp4
502456 Etho's Modded Minecraft #13 - Bandit Camp.mp4
636848 Etho's Modded Minecraft #14 - Template vs. Blueprint.mp4
418792 Etho's Modded Minecraft #15 - Strange Voices In My Head.mp4
213968 Etho's Modded Minecraft 10 Death Mountain.mp4
252088 Etho's Modded Minecraft 11 Coke Oven Factory.mp4
183384 Etho's Modded Minecraft 12 Digital Miner.mp4
278968 Etho's Modded Minecraft 2 Tropical Fishing Huts.mp4
236544 Etho's Modded Minecraft 3 Favorite Tool.mp4
223208 Etho's Modded Minecraft 4 Smart NPCs.mp4
252792 Etho's Modded Minecraft 5 Mining Ship.mp4
276272 Etho's Modded Minecraft 6 Drilling Machine.mp4
235400 Etho's Modded Minecraft 7 Piston Power.mp4
287200 Etho's Modded Minecraft 8 Messy Closet.mp4
244792 Etho's Modded Minecraft 9 Steampunk City.mp4

Modded minecraft episodes 13-15 were downloaded with youtube-dl, as 480p (thank you for decoding the dash data and fetching it); the rest were from a firefox extension ("Download YouTube Videos as MP4") that fetches the 360p feed. The name change is significant, and breaks programs like mplayer that auto-play the next one, or even the sorting of files in Finder or the command line.

@yan12125
Copy link
Collaborator

yan12125 commented Oct 12, 2017

Other similar ideas:

  1. Keep ampersands and parentheses (@active8, --restrict-filenames problem/suggestion #4549)
  2. Strip emojis (@sayem314, Strip emojis from filename #14474)

In my opinion 2. is still a valid request even in 2017. Quite a few emojis are not in the basic multilingual plane (BMP) of Unicode. In other words, applications should support at least UCS-4 (UTF-32) to handle them correctly. Besides Android, Konsole/QTerminal don't work well, either. They use Qt's QString, which are UCS-2 internally.

Update: iTerm2 goes crazy with some emojis, too 😆

@yan12125 yan12125 changed the title Filename: I want to restrict characters, but permit spaces Custom filename transformations (was: Filename: I want to restrict characters, but permit spaces) Oct 12, 2017
@vlakoff
Copy link

vlakoff commented Aug 29, 2018

On some videos I'm downloading, I encounter emojis (example), or UTF-8 accents (example).

Because of these characters, I'm encountering the following issues:

  • Error when copying the file to my phone (USB cable)
  • "File not found" errors on some Batch scripts

So, I searched a bit and found the --restrict-filenames option. But it unnecessarily replaces spaces with underscores, which is much more visually bloated. It also removes valid ASCII characters such as &, !, etc.

I suggest adding a --ascii-filenames option, that would just produce fully ASCII filenames, without doing any other transformation.

@dantheman213
Copy link

Any update on this issue?

@Kochise
Copy link

Kochise commented Oct 31, 2019

Files are not downloaded if there is a # in the filename. Could it be possible to have an option just to restrict illegal characters ?

@frgmntdmmrs
Copy link

Chipping in to also say that I'd love to see this feature to strip down emojis or add filters to output file names. Sometimes clips being downloaded have emojis in their titles and it tends to cause weird issues with other software, such as not detecting the file or crashing.

@shillshocked
Copy link

How is this issue solvable?

@nestukh
Copy link

nestukh commented Apr 3, 2020

to strip most emojis from the final filename, add this --exec switch (it's one single line):

--exec "python -B -c \"\$(printf %b 'import os,sys,re,shutil; shutil.move(sys.argv[1],re.sub(re.compile(\"([\\U0001F1E0-\\U0001F1FF,\\U0001F300-\\U0001F5FF,\\U0001F600-\\U0001F64F,\\U0001F680-\\U0001F6FF,\\U0001F700-\\U0001F77F,\\U0001F780-\\U0001F7FF,\\U0001F800-\\U0001F8FF,\\U0001F900-\\U0001F9FF,\\U0001FA00-\\U0001FA6F,\\U0001FA70-\\U0001FAFF,\\U00002702-\\U000027B0,\\U000024C2-\\U0001F251])\"), r\"\", sys.argv[1]))')\" {}"

this was tested on the penguinean operating system.
Other possible emojis on https://en.wikipedia.org/wiki/Emoji#Unicode_blocks

Credits:
Code derived from and original work by: https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1#gistcomment-3208085

extra:
--exec uses sh. Using bash in the command line for the filename in $VARIABLE (again it's one single line):

set +H; python -B -c "$(printf %b 'import os,sys,re,shutil; shutil.move(sys.argv[1],re.sub(re.compile("([\\U0001F1E0-\\U0001F1FF,\\U0001F300-\\U0001F5FF,\\U0001F600-\\U0001F64F,\\U0001F680-\\U0001F6FF,\\U0001F700-\\U0001F77F,\\U0001F780-\\U0001F7FF,\\U0001F800-\\U0001F8FF,\\U0001F900-\\U0001F9FF,\\U0001FA00-\\U0001FA6F,\\U0001FA70-\\U0001FAFF,\\U00002702-\\U000027B0,\\U000024C2-\\U0001F251])"), r"", sys.argv[1]))')" "$VARIABLE"; set -H

@nestukh
Copy link

nestukh commented Apr 3, 2020

P.S.
in the case someone uses %(upload_date)s in the output template as well, e.g.
--output "%(upload_date)s %(uploader)s - %(title)s [%(id)s].%(ext)s"
and would like to convert from YYYYMMDD (20200403) to a more readable format like YYYY-MM-DD (2020-04-03), the correct --exec is:

--exec "python -B -c \"\$(printf %b 'import os,sys,re,shutil; shutil.move(sys.argv[1],re.sub(r\"^(\d{4})(\d{2})(\d{2})\", r\"\\\1-\\\2-\\\3\",re.sub(re.compile(\"([\\U0001F1E0-\\U0001F1FF,\\U0001F300-\\U0001F5FF,\\U0001F600-\\U0001F64F,\\U0001F680-\\U0001F6FF,\\U0001F700-\\U0001F77F,\\U0001F780-\\U0001F7FF,\\U0001F800-\\U0001F8FF,\\U0001F900-\\U0001F9FF,\\U0001FA00-\\U0001FA6F,\\U0001FA70-\\U0001FAFF,\\U00002702-\\U000027B0,\\U000024C2-\\U0001F251])\"), r\"\", sys.argv[1])))')\" {}"

the ^ is for making sure that the search pattern starts at the beginning of the filename. Remove it if your template differs from the example above.

It's a simple substitution that can be done with tools like rename (in place of sed) too:
--exec "rename 's/^(\d{4})(\d{2})(\d{2})/\$1-\$2-\$3/' {}"
but you cannot execute multiple --exec with the {} wildcard for changing the same file multiple times. Also, rename is not installed by default, while the python code above can take advantage of the same local virtualenv where youtube_dl is installed in.

@lepermagpie
Copy link

I appreciate the comprehensive scripts nestukh. Can this be easily replicated on Windows too, as in just replacing the appropriate syntaxes from the original script?

@nestukh
Copy link

nestukh commented Jun 9, 2020

Probably yes, also you will need no changes under WSL (GNU/Linux subsystem for Windows 10), MSYS2, Cygwin or others minimal GNU/Linux layer implementations on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants