-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Make current and max buffer size long's #129
Comments
Which release of PCRE2, please? I cannot find that error message in pcre2grep, nor would it start a message with "grep". However, your point about buffer size being int is valid. I'm amazed that buffers larger than that are wanted, but then I have a history of underestimating things. I will look at changing pcre2grep to size_t in due course. |
Sorry, I should have been more clear. I started using gnu grep, and encountered the error I gave. So, that's not part of the PCRE2 source. It's part of the gnu grep source, for when it sees it has a line too long to give to PCRE(1). I ran across that PCRE2 exists, and then learned about I'm looking through (2) 1TB block devices. The first has no issue with a MAX_INT line length, but the second does. I just got "lucky" that somewhere on that second block device there is a string of more than 2.1 billion characters without a newline. :-( EDIT: Oh, but I'm using pcre2 10.40. |
Are there NULs (zero bytes) in your data? You could try pcre2grep --newline=NUL if so. (Or CR if there are CRs.) I will have to read the pcre2grep code (because I've forgotten how it works) to see where line lengths come into it. The data itself is read in much larger chunks. But all grepish programs are line based. It sounds as though you need something that does not depend on line structure. The underlying 10.40 library can handle strings up to size_t but I don't know of any general utility program that does what you want. Maybe there's scope for some kind of "pcre2grep-binary" program. |
That was a good idea. I tried it, but there's also a sequence of bytes on the block device larger than INT_MAX that doesn't have a NUL or CR character. I wound up being able to perform the search that I needed by using the extended grep pattern library to do an initial search for the string(s) I was looking for, with a wildcard character at the end, then piping that to PCRE to search for the string(s) I was looking for that ended with either a NUL or unix newline character. The extended grep pattern library doesn't have a line length limitation that I ran into. So, I still think a solution here would be beneficial to others, but I managed to work around the issue and got what I needed. If PCRE2 implements size_t line lengths, I'd be happy to test it out. |
I have just committed a patch to the GitHub repo that changes all (I hope) the relevant variables in pcre2grep to PCRE2_SIZE, which is defined as size_t. As far as I can tell, this is all working OK, but if you could test on your ginormous data sets that would be great. |
Right now, everything regarding the buffer size is an int. I'm trying to
grep
two block devices. One succeeds, and the other fails with:I was excited to learn of
pcre2grep
, but found that like the original pcre that grep ships with, that pcre2 only uses an int for buffer size.Although this doesn't appear to be a very common issue, I can definitely find others who have needed to go beyond this line length limit with
grep
, or evensed
.It could also be nice when outputting an error that
line x of file <file> is too long for the internal buffer
for it to give the line's length, by reading into the buffer until the line is complete and tracking how long it is.Regardless of if this suggestion is implemented, it would be nice to have
--help
and the online documentation give ranges of acceptable values, and have the command line parsing code give an error if an out of range number is given. Right now, specifying a--max-buffer-size=
that is too large gives no error.The text was updated successfully, but these errors were encountered: