Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Make current and max buffer size long's #129

Closed
jamespharvey20 opened this issue Jun 19, 2022 · 5 comments
Closed

Suggestion: Make current and max buffer size long's #129

jamespharvey20 opened this issue Jun 19, 2022 · 5 comments

Comments

@jamespharvey20
Copy link

Right now, everything regarding the buffer size is an int. I'm trying to grep two block devices. One succeeds, and the other fails with:

grep: exceeded PCRE's line length limit

I was excited to learn of pcre2grep, but found that like the original pcre that grep ships with, that pcre2 only uses an int for buffer size.

Although this doesn't appear to be a very common issue, I can definitely find others who have needed to go beyond this line length limit with grep, or even sed.

It could also be nice when outputting an error that line x of file <file> is too long for the internal buffer for it to give the line's length, by reading into the buffer until the line is complete and tracking how long it is.

Regardless of if this suggestion is implemented, it would be nice to have --help and the online documentation give ranges of acceptable values, and have the command line parsing code give an error if an out of range number is given. Right now, specifying a --max-buffer-size= that is too large gives no error.

@PhilipHazel
Copy link
Collaborator

Which release of PCRE2, please? I cannot find that error message in pcre2grep, nor would it start a message with "grep". However, your point about buffer size being int is valid. I'm amazed that buffers larger than that are wanted, but then I have a history of underestimating things. I will look at changing pcre2grep to size_t in due course.

@jamespharvey20
Copy link
Author

jamespharvey20 commented Jun 19, 2022

Sorry, I should have been more clear. I started using gnu grep, and encountered the error I gave. So, that's not part of the PCRE2 source. It's part of the gnu grep source, for when it sees it has a line too long to give to PCRE(1). I ran across that PCRE2 exists, and then learned about pcre2grep, and moved onto using it to try doing what I'm trying to do. Using pcre2grep instead of gnu grep seems like the way for me to go, considering that gnu grep's bundled PCRE(1) is no longer developed.

I'm looking through (2) 1TB block devices. The first has no issue with a MAX_INT line length, but the second does. I just got "lucky" that somewhere on that second block device there is a string of more than 2.1 billion characters without a newline. :-(

EDIT: Oh, but I'm using pcre2 10.40.

@PhilipHazel
Copy link
Collaborator

Are there NULs (zero bytes) in your data? You could try pcre2grep --newline=NUL if so. (Or CR if there are CRs.)

I will have to read the pcre2grep code (because I've forgotten how it works) to see where line lengths come into it. The data itself is read in much larger chunks. But all grepish programs are line based. It sounds as though you need something that does not depend on line structure. The underlying 10.40 library can handle strings up to size_t but I don't know of any general utility program that does what you want. Maybe there's scope for some kind of "pcre2grep-binary" program.

@jamespharvey20
Copy link
Author

That was a good idea. I tried it, but there's also a sequence of bytes on the block device larger than INT_MAX that doesn't have a NUL or CR character.

I wound up being able to perform the search that I needed by using the extended grep pattern library to do an initial search for the string(s) I was looking for, with a wildcard character at the end, then piping that to PCRE to search for the string(s) I was looking for that ended with either a NUL or unix newline character. The extended grep pattern library doesn't have a line length limitation that I ran into.

So, I still think a solution here would be beneficial to others, but I managed to work around the issue and got what I needed. If PCRE2 implements size_t line lengths, I'd be happy to test it out.

@PhilipHazel
Copy link
Collaborator

I have just committed a patch to the GitHub repo that changes all (I hope) the relevant variables in pcre2grep to PCRE2_SIZE, which is defined as size_t. As far as I can tell, this is all working OK, but if you could test on your ginormous data sets that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants