Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hexdump stops prematurely at LF #428

Closed
misutoneko opened this issue Sep 24, 2024 · 7 comments
Closed

Hexdump stops prematurely at LF #428

misutoneko opened this issue Sep 24, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@misutoneko
Copy link

New user here, hi :D

I've only tested ugrep for a short while, but I think I've found a bug.
To cut to the chase here's an example:

echo -e "Hello World\nThis part left unseen" |ugrep -U --hexdump=1A1 "\x57\x6f\x72"

Which will give you this:

00000000  48 65 6c 6c 6f 20 57 6f  |Hello Wo|
00000008  72 6c 64 0a -- -- -- --  |rldJ----|

So the problem is that the output doesn't continue all the way to the context boundary, it stops at the first sight of LF (0x0a).
OK sure, greps are usually line-oriented, but imho the byte values in context shouldn't matter when dealing with the binary stuff?
Tested with v3.7.2 and v6.5.0.

Btw hexdumping is THE main attraction of ugrep for me :D
I have high hopes that ugrep will help me to replace my silly old 600+ line bash/awk script that I'm now using for dumping...

@misutoneko
Copy link
Author

misutoneko commented Sep 26, 2024

Here's a workaround:

echo -e "Hello World\nThis part left unseen" \
|ugrep -y -U --color=always --hexdump=1A1 "\x57\x6f\x72" \
|grep -A1 --color=never 31m

The -y in the first ugrep is passthru.
The second (u)grep finds the lines based on the color code and prints them with the desired context added.
Note that the search pattern 31m is ok for demo purposes, but it should be refined for RL usage.

EDIT:
Actually that -y might eat all performance so it's better to restrict the first ugrep:

echo -e "Hello World\nThis part left unseen" \
|ugrep -U --color=always -A1 --hexdump=1A1 "\x57\x6f\x72" \
|ugrep --color=never -A1 31m

@genivia-inc
Copy link
Member

As you point out, the hexdump feature is line-oriented, like grep in general is a line-oriented tool. The reason is the efficient buffering with a window that shifts out at line boundaries so it is possible to quickly search several GB of input without impacting memory resources.

A way to avoid this is to match newline \n byte as part of the pattern, but there are also reasons not to do that.

Perhaps this should be moved to the ugrep discussion so we can close this eventually.

@misutoneko
Copy link
Author

Yeah I had a hunch it might be something like that.
And I agree, if this is ever fixed it should be done in a way that doesn't compromise speed.
The workaround, well, works for me well enough (it could break at the next version update though).

I guess only the repo owner can move to Discussions? Didn't spot any way to do that...

@genivia-inc
Copy link
Member

I agree that there are opportunities to improve the hexdump feature with better context control. I haven't done that, because I didn't want to mess with the search engine.

What I would like to do is improve the hexdump context to always show the given number of hex lines before and after a match, regardless of the presence of newlines. That needs a change to the search engine so that the before hex lines (i.e. bytes) are not shifted out of the window when searching large files.

If will put that on the TODO list.

@genivia-inc genivia-inc added the enhancement New feature or request label Oct 19, 2024
@genivia-inc
Copy link
Member

The upcoming v7 release will produce hex before and after context lines regardless of LF boundaries. The before hex context is guaranteed to display, as long as the before hex size specified is not crazy large like hundreds of hex lines (in that case, the hex before context may get truncated by chance if the file is several MB).

@misutoneko
Copy link
Author

Thank you! Works like a charm from what I can see :D

@genivia-inc
Copy link
Member

The v7.0.2 update improves hex context lines when these run into each other. I want to avoid overlap, which can be confusing. So it looks a lot cleaner in the update, at least it does to me :)

genivia-inc added a commit that referenced this issue Oct 27, 2024
fix #434 option -v issue with the ^ anchor caused by the v7 updated --hexdump hex context support #428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants