-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor slash-escaping logic with SBuf #1540
base: master
Are you sure you want to change the base?
Conversation
storeAppendPrintf(entry, "%s %.*s %s", | ||
k, key().length(), key().rawContent(), ConfigParser::QuoteString(SBufToString(v->value()))); | ||
out << k << ' ' << key() << ' ' << Format::DquoteString(v->value()); | ||
out.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not flush StoreEntry-based streams explicitly. I am not aware of any proof that such flushing is required, but, if such flushing is required, it needs to be implemented by the stream (rather than relying on us to remember to flush when needed and to remove no-longer-needed flushing as we refactor the calling code).
out.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid memory reallocations std::basic_stream
perform writes to the provided output buffer in batches. A small internal store of bytes not yet added to the buffer are only guaranteed to have written after a flush()
, or std::endl
or stream destructor occurs.
When the StoreEntry
is shared by a PackableStream
and a call to another dump function removing these flush()
will result in a heisen-bug where squid.conf syntax can have a few bytes out of order. Like so:
note key "valuuuuuuuuuuuuuuu..acl1 acl2
ue"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid memory reallocations
std::basic_stream
perform writes to the provided output buffer in batches [using] A small internal store of bytes not yet added to the buffer
There is no such thing as std::basic_stream
. I assume you meant std::basic_ostream
. What makes you think that basic_ostream has its own "small internal store of bytes", outside of its streambuf control?
A small internal store of bytes not yet added to the buffer are only guaranteed to have written after a
flush()
, orstd::endl
or stream destructor occurs.
AFAICT, the above describes how std::basic_streambuf works: When a streambuf has a controlled sequence (a.k.a. the "buffer"), then it may not write bytes into the associated character sequence (a.k.a. the "sink") until that internal buffer overflows, sync() is called, or another buffer-flushing event occurs. This optimization is entirely under streambuf control, which makes a lot of sense because streambuf (children) know what kind of optimizations (if any) work well with their sink.
For example, std::basic_filebuf (a basic_streambuf child), does not immediately write every character into the file (i.e. its sink), of course. Instead it maintains an internal buffer that is flushed to the file as needed. AFAICT, GCC implementations set that internal buffer size to system BUFSIZ
which can be as large as 8192 bytes.
In PackableStream context, streambuf is our AppendingStreamBuf. AppendingStreamBuf does not have a controlled sequence or "buffer". It is a pass-through streambuf that always writes directly to the sink (i.e. the packable object like StoreEntry).
When the
StoreEntry
is shared by aPackableStream
and a call to another dump function removing theseflush()
will result in a heisen-bug where squid.conf syntax can have a few bytes out of order. Like so:
note key "valuuuuuuuuuuuuuuu..acl1 acl2 ue"
Have you actually seen PR code produce that output after out.flush() call was removed? Or are you sketching an example of output you would expect to see (in some cases) if out.flush() is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, the above describes how std::basic_streambuf works:
Sorry for the wrong type named. I meant the std::basic_streambuf
layer inside AppendableStreamBuf
.
Are you thinking that AppendingStreamBuf::overflow()
flushes the StoreEntry
write? that appears not to be the case. All it does is is call lowAppend()
via sync()
to try and push as many characters as possible into the end of the buffer without reallocating.
In PackableStream context, streambuf is our AppendingStreamBuf. AppendingStreamBuf does not have a controlled sequence or "buffer". It is a pass-through streambuf that always writes directly to the sink (i.e. the packable object like StoreEntry).
The "buffer" I refer to is the put area
piece of memory shown in this diagram of what a std::basic_streambuf
internal pointers are pointing to. The memory area we control is labelled stream
at the bottom of the diagram.
https://upload.cppreference.com/mwiki/images/7/75/std-streambuf.svg
When the
StoreEntry
is shared by aPackableStream
and a call to another dump function removing theseflush()
will result in a heisen-bug where squid.conf syntax can have a few bytes out of order. Like so:note key "valuuuuuuuuuuuuuuu..acl1 acl2 ue"
Have you actually seen PR code produce that output after out.flush() call was removed? Or are you sketching an example of output you would expect to see (in some cases) if out.flush() is removed?
Sort of. I have actually seen that output from code writing to PackableStream
when writing to the stream was separated around a dump_foo()
call - just not this particular PR.
I suspect that although our AppendingStreamBuf
does write-through on sync()
and overflow()
it does not take control of pbase()
, pptr()
and epptr()
away from the parent class implementation to ensure that they are pointing at the real underlying memory buffer. That would be worth looking into later for performance to see if a data copy can be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking that
AppendingStreamBuf::overflow()
flushes theStoreEntry
write?
AppendingStreamBuf::overflow() calls StoreEntry::append() (via lowAppend()). Other writers in this context also call StoreEntry::append() or its wrappers, directly or otherwise. A single common sink (i.e. the StoreEntry) without intermediate buffering preserves write order. There are no caches/buffers to "flush" in this context.
that appears not to be the case. All it does is is call
lowAppend()
viasync()
to try and push as many characters as possible into the end of the buffer without reallocating.
I assume that when you say "buffer" you mean AppendingStreamBuf::buf_. That buf_
is the StoreEntry object. There are no other write buffers between out
and entry
in this context.
In PackableStream context, streambuf is our AppendingStreamBuf. AppendingStreamBuf does not have a controlled sequence or "buffer". It is a pass-through streambuf that always writes directly to the sink (i.e. the packable object like StoreEntry).
The "buffer" I refer to is the
put area
piece of memory shown in this diagram of what astd::basic_streambuf
internal pointers are pointing to.
Exactly. AppendingStreamBuf does not have "put area" or "controlled sequence". It only has the "sink". Those "internal pointers" are nil. For example, pptr() and pbase() return nullptr. For them to be non-nil, AppendingStreamBuf would have to call setp(). It never does. By default, there is no controlled sequence or "buffer".
The memory area we control is labelled
stream
at the bottom of the diagram. https://upload.cppreference.com/mwiki/images/7/75/std-streambuf.svg
Yes, that is StoreEntry. It is not really a "memory area", but it is the "associated character sequence" (a.k.a. the "sink").
When the
StoreEntry
is shared by aPackableStream
and a call to another dump function removing theseflush()
will result in a heisen-bug where squid.conf syntax can have a few bytes out of order. Like so:note key "valuuuuuuuuuuuuuuu..acl1 acl2 ue"
Have you actually seen PR code produce that output after out.flush() call was removed? Or are you sketching an example of output you would expect to see (in some cases) if out.flush() is removed?
Sort of. I have actually seen that output from code writing to
PackableStream
when writing to the stream was separated around adump_foo()
call - just not this particular PR.
Sigh. Whatever you were seeing in some other context does not relate to this change request. Obviously, I cannot explain what you were seeing in the context I do not know about.
I suspect that although our
AppendingStreamBuf
does write-through onsync()
andoverflow()
it does not take control ofpbase()
,pptr()
andepptr()
away from the parent class implementation to ensure that they are pointing at the real underlying memory buffer.
The parent class implementation (i.e. std::streambuf) does not (and must not) create a buffer. It waits for the specific stream buffer implementation (e.g., AppendingStreamBuf) to call setp() when/if internal buffering is enabled. Only a specific stream implementation knows enough to enable buffering and configure the right buffer size. Our AppendingStreamBuf never calls setp(), leaving pbase() and all those other "buffer" pointers nil, resulting in unbuffered I/O.
BTW, nil controlled sequence pointers are nothing unusual. This is what happens when the user disables buffering on, say, an ofstream by calling pubsetbuf(0,0) which calls std::basic_filebuf::setbuf(0, 0): "On unbuffered streams, all output is directly written to the file (both pptr and pbase are always null pointers, forcing output operations to call overflow)."
That would be worth looking into later for performance to see if a data copy can be avoided.
Yes, but not until we eliminate StoreEntry as a dump() parameter (at least). As you know, we are making progress with converting various dump() implementations to streams, but it will take time. Until then, we should not introduce the risk of forgetting to write os.flush(). After that, there will still be no need to write flush() in dump() contexts, but for a different reason -- there will be no transition from stream-assisted to direct-StoreEntry calls in those contexts (because there will be no StoreEntry to transition to).
src/format/Quoting.cc
Outdated
if (!tok.atEnd()) { | ||
char c = tok.buf()[0]; | ||
quotedStr.append('\\'); | ||
quotedStr.append(c); | ||
tok.skip(c); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Extra" EOF checks and raw access to being-parsed buffer are one of the primary causes of parsing bugs. We should enhance Tokenizer and refactor this code to arrive at something like this:
if (const auto c = tok.skipOne()) {
quotedStr.append('\\');
quotedStr.append(*c);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Out of scope for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. There is a way to implement this logic without improving Tokenizer and without introducing problems flagged in this change request. The suggested improvement is just the best way forward IMO (and can be done here or in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many ways to implement this. I have chosen the one used for minimal complexity given the APIs currently in Squid.
Don't get me wrong, I agree its worth having and will do as a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened PR #1545 for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have chosen the one used for minimal complexity given the APIs currently in Squid.
We should not add risky code (and then copy it all over the place) when it is easy to add safe one. If we had a Tokenizer fairy that would quickly create the missing methods for us and cleanup after our rushed PRs, OR if adding a missing Tokenizer method were difficult, it would be OK to cut this corner, but that fairy does not exist. We have to do the legwork required to move the code forward rather than fix one problem while adding two new ones.
Recent experience reminded us how difficult it is to write correct parsing code -- we spend three or four review iterations fixing fresh PR bugs because we cut a few corners earlier. We should insist on principles that make repeated bugs less likely. Avoiding intermediate EOF checks and avoiding raw buffer accesses are examples of such principles.
auto *line = SBufToCstring(t.first); | ||
ConfigParser::SetCfgLine(line); | ||
const SBuf found(ConfigParser::NextToken()); | ||
CPPUNIT_ASSERT_EQUAL(t.second, found); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not feed squid.conf parser invalid input and then expect it to return something reasonable. I know this is essentially moved code, so I do not insist on removing or improving this part of the test, but it is conceptually wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, that is exactly what we should expect of these particular functions. They are run on arbitrary garbage, and are expected to pass a specific type of "token" to the callers which are performing validation on whether those tokens are correct syntax or actual garbage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly disagree. While these squid.conf parsing functions may indeed see "arbitrary garbage" as input, they should not be expected to produce any output to the caller unless that input is valid squid.conf. For example, ConfigParser::NextToken() has the right to throw on an invalid squid.conf line. This code gives it an invalid squid.conf line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of the given quoted strings are "invalid" ?
All of them may be seen in a valid squid.conf for reasons such as;
a) a parameter to an ACL looking for this weirdness in a header field, or
b) a string passed as-is to a helper command line, or
c) a string being dumped as custom logformat blob, or
d) some weird multi-escaped string sequence being passed to a regex.
them's just the ones that I can think of in a few seconds.
57aa774
to
6a82cc4
Compare
Using logic originally in ConfigParser.
This method only needs to format notes as quoted strings.
Co-authored-by: Alex Rousskov <[email protected]>
55d10e3
to
ff76790
Compare
Move the configuration parser logic for slash-escaping quoted
strings to libformat and use SBuf instead of char* for the
string management.