Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Ability to use extended regex syntax in sed #444

Closed
hdwalters opened this issue Sep 2, 2024 · 13 comments · Fixed by #453
Closed

[Feature] Ability to use extended regex syntax in sed #444

hdwalters opened this issue Sep 2, 2024 · 13 comments · Fixed by #453
Labels
enhancement New feature or request stdlib

Comments

@hdwalters
Copy link
Contributor

I note that standard library function replace_regex() transpiles to the following sed command:

echo "${source}" | sed -e "s/${pattern}/${replacement}/g"

sed basic regex syntax is horrible, as you have to unexpectedly (IMHO) escape regex tokens like capturing parentheses (...), and it doesn't support the "one or more instances" token +. I therefore use sed -E, and it would be nice to have that option.

I believe the aim of Amber is to be as widely applicable as possible, and I understand that some Bash distributions do not support sed -E or the equivalent sed -r, so we can't necessarily just change it in the standard library. I would therefore like to propose a breaking change:

pub fun replace_regex(source: Text, pattern: Text, replacement: Text, extended: Bool): Text {
    unsafe {
        if extended:
            return $echo "{source}" | sed -E -e "s/{pattern}/{replacement}/g"$
        else:
            return $echo "{source}" | sed -e "s/{pattern}/{replacement}/g"$
    }
}

It may even be possible to provide default parameters to Amber functions, but I do not believe this is currently implemented.

Thoughts?

@hdwalters hdwalters added the enhancement New feature or request label Sep 2, 2024
@Mte90
Copy link
Member

Mte90 commented Sep 2, 2024

We have the support for default value for function parameters in the alpha in development.
I thin that if we proceed with this we need a way to support all the sed available around.

@hdwalters
Copy link
Contributor Author

hdwalters commented Sep 2, 2024

if we proceed with this we need a way to support all the sed available around.

I realise there may be sed implementations which do not support the -E option, but can you name any which use a different option for the same functionality? AIUI -E is pretty universal where extended regexes are supported.

I would respectfully suggest that if anyone does have a non standard sed, they can just pass false in the new parameter, and use the standard regex syntax.

@hdwalters
Copy link
Contributor Author

We have the support for default value for function parameters in the alpha in development.

Great! Can you point me at an example please?

@Mte90
Copy link
Member

Mte90 commented Sep 2, 2024

https://github.com/amber-lang/amber/blob/master/src/std/env.ab#L4

@hdwalters
Copy link
Contributor Author

hdwalters commented Sep 2, 2024

So this is not a breaking change after all:

pub fun replace_regex(source: Text, pattern: Text, replacement: Text, extended: Bool = false): Text {
    unsafe {
        if extended:
            return $echo "{source}" | sed -E -e "s/{pattern}/{replacement}/g"$
        else:
            return $echo "{source}" | sed -e "s/{pattern}/{replacement}/g"$
    }
}

@b1ek
Copy link
Member

b1ek commented Sep 3, 2024

iirc, sed's syntax differs a bit with gnu and bsd versions. im pretty sure alpine has its own stripped down sed too. the bsd version is used on all macOS systems, and gnu is on most linux ones

@b1ek
Copy link
Member

b1ek commented Sep 3, 2024

yup, i was right about alpine:

➜  ~ docker run -it alpine:3.20 ash 
/ # sed --help
BusyBox v1.36.1 (2024-06-10 07:11:47 UTC) multi-call binary.

Usage: sed [-i[SFX]] [-nrE] [-f FILE]... [-e CMD]... [FILE]...
or: sed [-i[SFX]] [-nrE] CMD [FILE]...

	-e CMD	Add CMD to sed commands to be executed
	-f FILE	Add FILE contents to sed commands to be executed
	-i[SFX]	Edit files in-place (otherwise write to stdout)
		Optionally back files up, appending SFX
	-n	Suppress automatic printing of pattern space
	-r,-E	Use extended regex syntax

If no -e or -f, the first non-option argument is the sed command string.
Remaining arguments are input files (stdin if none).
/ # 

@b1ek
Copy link
Member

b1ek commented Sep 3, 2024

i think we could check sed version like this:

version=$(sed --version 2>&1)

if echo $version | grep 'Copyright (C) ' | grep ' Free Software Foundation, Inc.' 2>&1 > /dev/null; then
    echo is gnu
fi
if echo $version | grep 'This is not GNU sed version ' 2>&1 > /dev/null; then
    echo is alpine (busybox)
fi
if echo $version | grep 'sed: illegal option -- -' 2>&1 > /dev/null; then
    echo is bsd
fi

@hdwalters
Copy link
Contributor Author

hdwalters commented Sep 3, 2024

My proposal is to default to -E in the absence of more definitive knowledge, so I do not think we need to test explicitly for Alpine sed, as it will fall under that default. This applies to BSD sed as well, as the current version supports -E (see https://man.freebsd.org/cgi/man.cgi?sed).

So I think all we really need to test for is "GNU or not GNU"; we might as well simplify the generated Bash output, especially as Mte90 requested that we test the flavour each time replace_regex() is called.

I am happy to search for \bCopyright\b.+\bFree Software Foundation\b, but I suspect that \bGNU sed\b will be slightly faster, with little risk of false positives.

@hdwalters
Copy link
Contributor Author

hdwalters commented Sep 3, 2024

I do also wonder how standardised the actual extended regex syntax is across flavours, but quite honestly it's going to be a losing game trying to correct for any disparities, and I think we can reasonably leave that to Amber script authors (it might be handy to provide a way of testing the OS type, but that's a topic for another day).

@Mte90 Mte90 transferred this issue from amber-lang/amber Sep 3, 2024
@Ph0enixKM Ph0enixKM transferred this issue from another repository Sep 3, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Amber Project Sep 3, 2024
@hdwalters
Copy link
Contributor Author

Tested on MacOS:

xxx@xxx-MacBook-Air-2 ~ % echo 'abc123def' | sed -E -e 's/([0-9]+)/[\1]/'
abc[123]def

@b1ek
Copy link
Member

b1ek commented Sep 4, 2024

This applies to BSD sed as well, as the current version supports -E (see https://man.freebsd.org/cgi/man.cgi?sed).

they do not support the same regexes, and their engines differ a little bit

@hdwalters
Copy link
Contributor Author

they do not support the same regexes, and their engines differ a little bit

I suspected as much, but I really don't think there's a lot we can do about masking those differences. After all, script authors would have to tailor the regex syntax if they were calling sed from a Bash script.

@Mte90 Mte90 added the stdlib label Sep 5, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to 🏁 Done in Amber Project Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stdlib
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants