Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior in version 7 #434

Closed
atarom opened this issue Oct 26, 2024 · 5 comments
Closed

Strange behavior in version 7 #434

atarom opened this issue Oct 26, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@atarom
Copy link

atarom commented Oct 26, 2024

test.txt
Hi, thanks for this excellent tool, I have been using it for some time without problems, but today when updating to version 7 I encountered this problem with some files like the attached one

$ ugrep '^addr:housenumber=' test.txt | ugrep -v '^addr:housenumber='
=286C
enumber=895

In previous versions I haven't had that problem, I have tried egrep and it works correctly with the same file

$ ugrep --version
ugrep 7.0.0 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,7z,tar/pax/cpio/zip
License: BSD-3-Clause; ugrep user manual: <https://ugrep.com>
Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
Ugrep utilizes the RE/flex regex library: <https://github.com/Genivia/RE-flex>

I only have one machine and my knowledge is very limited, can someone tell me if with version 7 and with the attached file, get the same wrong result as me?

Cheers

@genivia-inc
Copy link
Member

Found the problem in the buffer window shifting in v7, after checking your example (thanks for sharing!). It's not the regex engine, but the alignment of the shifted window is off in the buffer internally. This was changed to accommodate hex line contexts, including after buffer shifting. Will release an update soon.

@genivia-inc genivia-inc added the bug Something isn't working label Oct 26, 2024
@atarom
Copy link
Author

atarom commented Oct 27, 2024

test2.txt.zip
Hi, thank you very much for solving the previous bug so quickly. Today, with new data, I got this similar error with the current version 7.01

$ ugrep -v '^b5m:urlOrto=' test2.txt | ugrep '^drid'
drid.es/vgn-ext-templating/v/index.jsp?vgnextchannel=9e4c43db40317010VgnVCM100000dc0ca8c0RCRD&vgnextoid=924b7b05a8350410VgnVCM2000000c205a0aRCRD
$ ugrep --version
ugrep 7.0.1 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,7z,tar/pax/cpio/zip
License: BSD-3-Clause; ugrep user manual: <https://ugrep.com>
Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
Ugrep utilizes the RE/flex regex library: <https://github.com/Genivia/RE-flex>

and I don't have any line that start with "drid"

@genivia-inc
Copy link
Member

GNU grep hits ^drid too:

$ ugrep -v '^b5m:urlOrto=' test2.txt | ggrep '^drid'
drid.es/vgn-ext-templating/v/index.jsp?vgnextchannel=9e4c43db40317010VgnVCM100000dc0ca8c0RCRD&vgnextoid=924b7b05a8350410VgnVCM2000000c205a0aRCRD

Also with PCRE2 matching the same happens:

$ ugrep -vP '^b5m:urlOrto=' test2.txt | ggrep -E '^drid'
drid.es/vgn-ext-templating/v/index.jsp?vgnextchannel=9e4c43db40317010VgnVCM100000dc0ca8c0RCRD&vgnextoid=924b7b05a8350410VgnVCM2000000c205a0aRCRD

but it does not return a match with ugrep 6.5, so something seems still off with the -v and ^ combination in the first regex with v7. I was sure was fixed after more testing on our end to verify the updated logic on the position of the begin-of-line assertion. Apparently not.

@genivia-inc
Copy link
Member

Looks like a problem with the buffer-shift-event-handler logic, which depends on the buffer's shifted location at the begin-of-line. Because we keep a bit more data in the buffer to accommodate context hexlines, the handler may not always get a fix on the correct line to use, This affects option -v and likely also options -ABC when files are large.

The solution would be to revert back to the old logic, except when we have hex context lines because we don't use -v and -ABC in that case anyway.

@genivia-inc
Copy link
Member

I've tested the change over back to the old logic and added the hex context lines update mentioned above. Both work like a charm. To make sure, I tested with your examples but also with different internal buffer sizes and diff'ed the results of them to spot any deviations that can be attributed to problems with the buffer shift logic and associated event handlers or the hex context lines logic for --hexdump.

Will release an update very soon.

genivia-inc added a commit that referenced this issue Oct 27, 2024
fix #434 option -v issue with the ^ anchor caused by the v7 updated --hexdump hex context support #428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants