Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string replace should have options for both literal string replacement and regex replacement #346

Open
samer1977 opened this issue May 22, 2024 · 10 comments

Comments

@samer1977
Copy link

no doubt having the regex replace is very powerful but sometimes you want to do simple a literal string replace.Where I encountered a problem is when I wanted to replace literal string that contains regex chars. I could not find an easy way to do that but having to replace all regex reserved char first to escape them and that can get cumbersome and inefficient.

@catull
Copy link

catull commented May 22, 2024

@samer1977 Can you give an example ?

@samer1977
Copy link
Author

samer1977 commented May 23, 2024

OK, I was trying to solve the problem here: #342
using recursive function call as such:

def capture-many(json,regex,key)

   let c = capture($json, $regex)

   let res = if($c =={}) [] 
                 
             else  
                    
             [$c]+capture-many(replace($json, get-key($c,$key), ""),$regex,$key)
   $res

capture-many(.body,"<img src=\"(?<url>[a-z])\">","url")

To do that I have to get each capture , store in an array , then do the next capture recursively by purging the json through replace with empty string and so until no capture left. I understand that there is limitation where each capture has to be different and you can only have one key-value pair capture which would have worked for this scenario.

The above function would have worked on more simplistic scenario like this:

{ "body" : "<div class="intercom-container"><img src="image1">. <div class="intercom-container"><img src="image2">" }

However once you introduce more complex string like urls then it wont because regex.

@catull
Copy link

catull commented May 23, 2024

Sorry to be a nag, can you give us a challenging example ?
Are you talking about a URL that contains any one of these characters:

  • '*'
  • '(', ')', '[', ']', '{' or '}'
  • '&', ''
    ?

@samer1977
Copy link
Author

@catull
Copy link

catull commented May 23, 2024

How about this.

input

{ "body": "<div class='intercom-container'><img src=\"image1\"></img></div><div class=\"intercom-container\"><img src=\"image2\"></img></div><div class=\"intercom-container\"><img src=\"image3\" /><div class=\"intercom-container\"><img src=\"image4\"/><img src='^&*image5'/></div>"
}

You have different image tags

<img src='image1'></img>
<img src="image2"></img>
<img src="image3" />
<img src="image4"/>
<img src='^&*image5'/>

Here's the transformation:

[ for (split (.body, "<img ")[1:])
  capture (., "^src=\"(?<url>[^\"']+)\"")
]

No need to use recursion.

The trick is to split up the input with the "seperator" <img .

Yes, there can be any funny characters in the src-attribute, even regexp "reserved" characters.
We are capturing only what is in between src=" and the ending double quote.

The result is:

[ {
    "url" : "image1"
  }, {
    "url" : "image2"
  }, {
    "url" : "image3"
  }, {
    "url" : "image4"
  }, {
    "url" : "^&*image5"
  }
]

@catull
Copy link

catull commented May 23, 2024

How did I come to this solution ?

I first only used this

split (.body, "<img ")

This gave me

[
    "<div class=\"intercom-container\">",
    "src=\"image1\"></img></div><div class=\"intercom-container\">",
    "src=\"image2\"></img></div><div class=\"intercom-container\">",
    "src=\"image3\" /><div class=\"intercom-container\">",
    "src=\"image4\"/>",
    "src=\"^&*image5\"/></div>"
]

As you can see, the first element in the array, does not have a "src=" at the beginning.
So it has to be excluded, thus changing the transformation to

split (.body, "<img ")[1:]

Now that all elements start with "src=", the regexp just becomes - basically anything betwen the quotes:

  capture (., "^src=\"(?<url>[^\"']+)\"")

And now you wrap array processing around it resulting in "[ for ..... capture (...) ]".

@catull
Copy link

catull commented May 23, 2024

First of all, your input is not properly formatted; you should get plenty of errors in the sandbox alone.
You are not properly escaping the double quotes in the body attribute.

Here's how it should be:

{ "body": "<div class=\"intercom-container\"><img src=\"image1\">. <div class=\"intercom-container\"><img src=\"image2\">"
}

Second, you are not properly regexing.
All you need to do is express that you want capture all non-double quotes after <img src=" up-to before the next double quote.

image

You want to exclude anything that is blue, only capture the orange string.
The part in purple - (?<url> and ) after the '+' - is only there to tell regexp that you have a capturing group named url.

So, this below should work.

def capture-many (json, regex, key)
   let c = capture ($json, $regex)
   let res = if ($c == {}) [] 
             else [$c] + capture-many (replace($json, get-key($c,$key), ""),$regex,$key)

   $res

capture-many (.body, "<img src=\"(?<url>[^\"]+)\">", "url")

A few words of advise.

  1. Use the playground -> https://www.garshol.priv.no/jslt-demo
  2. Use a proper JSON editor, it would have pointed out the quotation problems to you.
  3. Learn about regex, think in positive / negative - what is in, what is out.
  4. Specifically, regexp character classes.

Instead of [a-z], which only captures lower-case letters of the alphabet, you have to look at it differently.

What is in: anything orange above, that means, a sequence of 1 or more characters EXCEPT for a double quote.
What is out: <img src=" at the beginning, and "...... at the end.

The "in" part is expressed as [^"]+, this is called a negated character class.

A good resource is https://www.regular-expressions.info/charclass.html

Good luck.

@samer1977
Copy link
Author

OK! Thanks for your detailed answer. I appreciate it , at least its detailed. I will take everything you said into consideration and try to be careful when posting data\code. I did not pay much attention to what I was pasting because I made it clear early on that this all based on this: #342 and that should have been your source. No excuse though I will try and do better next time. I'm using all the above and I understand regex very well but sorry Im still human.

@catull
Copy link

catull commented May 23, 2024

We are all here to learn from each other.

catull pushed a commit to catull/jslt that referenced this issue May 26, 2024
catull pushed a commit to catull/jslt that referenced this issue May 26, 2024
This PR addresses issues schibsted#346 and schibsted#347.
Added documentation.
catull pushed a commit to catull/jslt that referenced this issue May 26, 2024
This PR addresses issues schibsted#346 and schibsted#347.
Added documentation.
catull pushed a commit to catull/jslt that referenced this issue May 26, 2024
This PR addresses issues schibsted#346 and schibsted#347.
Added support for Regexp pattern "Predefined character classes".
@catull
Copy link

catull commented May 27, 2024

I created a PR, see #350.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants