Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: cat --filename with optional argument to trim extensions etc from filenames #1356

Closed
janxkoci opened this issue Aug 18, 2023 · 11 comments
Assignees

Comments

@janxkoci
Copy link

janxkoci commented Aug 18, 2023

The cat --filename parameter is very useful and I use it all the time now as I work on summarizing information from many files.

But I noticed I often want to remove parts of the filename afterwards, especially the file extension(s) or subdirectory paths. This often involves passing through put and using the gsub() et al functions, or even deeper dives to the DSL. I kinda wish there was a more sed-like way to do this, and two approaches come to mind:

  1. A new verb with a simple sed-like functionality to do quick find-and-replace operations on data column. Something like replace "old,new" -f filename would be enough for many use-cases, especially if you allow e.g. "unwanted," (i.e. empty replacement string to delete stuff). Could also support regex for extra power.
  2. An optional string argument to cat --filename "string" that trims the "string" from values in the filename column. Again, could support regex for extra power.

The first approach is more generic and can be used on any column, while the second approach is probably easier to design, implement, and maybe even use, and covers the main use case I have in mind. Of course, nothing stops you from doing both 😉

As always, thanks for considering this feature and extra huge thanks for making such an amazing tool and sharing it with the world! 🤩

PS: don't pay too much attention to the code in the link - it shows quite a bizarre way to implement the "find-and-replace" functionality using just a print statement. Rather, I meant to show how simple the required functionality could be.

@aborruso
Copy link
Contributor

aborruso commented Aug 18, 2023

Hi @janxkoci ,

A new verb with a simple sed-like functionality to do quick find-and-replace operations on data column.

in some way I think you already have it. I try to explain myself with an example.

Using sed and running

echo "/home/aborruso/git/file.csv" | sed -r 's|.+/||;s|\..+||'

you get file.

In Miller, using put and sub you have search and replace and you have regex.
Running

echo a="/home/aborruso/git/file.csv" | mlr --oxtab put '$b=sub($a,".+/","");$b=sub($b,"\..+","")'

you get

a /home/aborruso/git/file.csv
b file

And you can also use your system commands. In Linux you can use in example basename and sed:

echo a="/home/aborruso/git/file.csv" | mlr --oxtab put '$b=system("basename ".$a."| sed -r s/[.].+//")'

And you get

a /home/aborruso/git/file.csv
b file

@janxkoci
Copy link
Author

janxkoci commented Aug 18, 2023

Ciao @aborruso, yes I know I can use put but the regexes sometimes get very verbose. I was even playing with things like the =~ operator and capturing the matches, but it's a lot of typing for often very basic stuff (I don't mind doing it for complex stuff).

The thing is that I've spent time fiddling with regexes and capturing, but after a good half hour it stops being fun, so I opened the CSV in sublime-text and fixed the format in 3 seconds. But I'd like to eventually automate this. And if you open the link and see how simple the usage could be for these basic cases - why not have it? 🙂

PS: the reason I don't just use sed is when I want to restrict it to only some column, which is where I reach for awk or miller.

@johnkerl
Copy link
Owner

@janxkoci this is spot-on -- a general theme of Miller is offering the flexibility to do whatever in the DSL, and then also for certain oft-occurring patterns, to package those up in a low-keystroking verb. This is a great idea! :)

@aborruso
Copy link
Contributor

And if you open the link and see how simple the usage could be for these basic cases - why not have it? 🙂

Sure, and you can desire whatever you like and which is comfortable for you.
I have opened the link, and I can't figure out how it works and then evaluate why it's more comfortable.

In your example you say replace "old,new" -f filename. It seems to me a search and replace not by field, but by rows in the entire content of the file.
If it's so, why do not use sed and send its stdout to mlr? In some way to do it in this way it's like to do it in sublime-text and fixed the format in 3 seconds.

And I agree it's better to use awk and miller, to search and replace by field.

That said, I'm sure I don't understand your suggestion.

But as always, @johnkerl understood it

@janxkoci
Copy link
Author

janxkoci commented Aug 18, 2023

Exactly @johnkerl - having the power is nice, but as is mentioned in the link, most uses for sed is just to do a simple s/foo/bar/g and mlr put '$somefield = gsub($somefield,"foo","bar")' (or the awk equivalent) is just way longer than what could simply be mlr replace "foo,bar" -f somefield. In fact I was missing some miller verb for sed-like replacements from the beginning, just never cared enough to say anything. But last few days/weeks I do these replacements way too much 😅

@aborruso Fair enough, here is the important part from the usage function in the code: "usage: awksed pat repl [files...]". It's dead simple to use, that's it.

PS: sorry I missed this:

In your example you say replace "old,new" -f filename. It seems to me a search and replace not by field, but by rows in the entire content of the file.

Here the -f filename stands for column named filename, as that is the default name assigned by cat --filename.

@johnkerl johnkerl changed the title feature request - cat --filename with optional argument to trim extensions etc from filenames feature request: cat --filename with optional argument to trim extensions etc from filenames Aug 19, 2023
@johnkerl
Copy link
Owner

@janxkoci @aborruso I think easiest would be three verbs: sub, gsub, and ssub. Both with a -f flag for which field(s) to operate on. E.g. mlr ssub -f FILENAME ".csv" "" ...

@johnkerl
Copy link
Owner

@janxkoci #1361 is up, with examples

@janxkoci
Copy link
Author

I think it looks great 😃👍

@janxkoci
Copy link
Author

And thanks so much 😁

@johnkerl
Copy link
Owner

My pleasure! :)

@aborruso
Copy link
Contributor

Really great. Thank you @janxkoci and @johnkerl !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants