-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplified istream handing #367 #764
Simplified istream handing #367 #764
Conversation
@pjkundert There is a test case failing that breaks all Travis runs, see https://travis-ci.org/nlohmann/json/builds/282450885?utm_source=github_status&utm_medium=notification. |
I ran a quick benchmark parsing file #include "json.hpp"
#include <fstream>
using json = nlohmann::json;
int main()
{
std::ifstream f("benchmarks/files/jeopardy/jeopardy.json");
json j;
f >> j;
}
compiled with The |
0174733
to
8d5a5f0
Compare
o Use std::streambuf I/O instead of std::istream; does not maintain (unused) istream flags. o Further simplify get/unget handling. o Restore original handling of NUL in input stream; ignored during token_string escaping.
Performance now within about 10%, while eliminating all over-buffering and seeking. Since we do not need to maintain the std::istream state flags, I used the underlying streambuf I/O API for performance. The original code was also clearing the state flags before returning, so that has been maintained. I noticed that the 'chars_read' (used to indicate file position where an error occurs) are incremented even when an EOF is received; this has been maintained (as it caused too many unit tests to fail when fixed). It is not a huge deal, as it will only cause error messages to contain an error location position one beyond the last character in the input. Runs clean under valgrind, and on all Linux platforms, and on Windows under Visual Studio 2015; 2017 fails for some reason I do not have the tooling to investigate. |
The VS2017 problem is fixed upstream. Try to run That should suffice! |
@theodelrieu is right about MSVC - it would be great if you could rebase so we can let MSVC 2017 check the code. I do not understand why the performance is worse (about 7% with my naive benchmark). |
src/json.hpp
Outdated
/// the start position of the current token | ||
std::size_t start_pos = 0; | ||
/// raw input token string (for error messages) | ||
std::vector<char> token_string = std::vector<char>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant initialization, default constructor is fine here.
I believe some profiling would be useful for that matter. I'm not very experienced in them though. |
@pjkundert Do you know whether this PR could also be adjusted to fix #714? |
8d5a5f0
to
46dbc46
Compare
o Use std::streambuf I/O instead of std::istream; does not maintain (unused) istream flags. o Further simplify get/unget handling. o Restore original handling of NUL in input stream; ignored during token_string escaping.
Performance now exceeds that of the develop branch, in at least some benchmarks. Removed some unnecessary attempts to NUL-terminate the yytext string (since yytext may contain NULs anyway). Converted yytext to a simple std::string, so we could return it directly via std::move(), and avoid many redundant std::string creations. Let me know how this affects #714 ; since I use the underlying std::streambuf I/O, and only look for EOF (and do not set the iostream's failbit), it will almost certainly alter the behaviour you are seeing. |
src/json.hpp
Outdated
is.setstate(flags); | ||
|
||
return result; | ||
int c = is.rdbuf()->sbumpc(); // Avoided for performance: int c = is.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to not return char
instead of int
? That's what I assumed from the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::streambuf I/O deals in ints, returning either traits_type::eof() (-1, usually) or the buffer item, converted to an int. I'm not certain why; perhaps to not assume that the underlying data can only be 8-bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was thinking about our interface: do we want to expose int or char for callers of get_character
?
I believe we should static _cast
between int/char while dealing with the std::streambuf
, and only exposing char
src/json.hpp
Outdated
is.setstate(flags); | ||
|
||
return result; | ||
int c = is.rdbuf()->sbumpc(); // Avoided for performance: int c = is.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to not return char
here? That's what I'd assume from the function name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to return values in the range [-1,255], we must continue to use int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies that there is another issue -- we (presently) allow EOF to be pushed into the token_string, which is a char vector. Must prevent this. Also allows simplifying the get_token_string. Furthermore, a NUL in the token_string should be escaped (not presently handled; they are ignored for the purposes of error messages, which is not correct).
@@ -2640,7 +2582,7 @@ class lexer | |||
const auto x = std::strtoull(yytext.data(), &endptr, 10); | |||
|
|||
// we checked the number format before | |||
assert(endptr == yytext.data() + yylen); | |||
assert(endptr == yytext.data() + yytext.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could yytext.end()
be used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we're testing the output of a function that was originally passed yytext.data(), so this is probably clearer.
src/json.hpp
Outdated
return next_unget ? (next_unget = false, current) | ||
: (current = ia->get_character()); | ||
int c = current = ia->get_character(); | ||
token_string.push_back(static_cast<char>(c)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you only use current
here? By the way I think we should have a convention about data members, either explicitly use this->
, or an underscore prefix etc...
But that's not in the PR's scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of neither namings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really care about which naming to use, but I get confused each time, is it a local variable? A global one? A symbol in the namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry, I'm ok with only using current
- I meant I don't like adding an underscore prefix or this->
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, it's just hard to know what the symbol is without performing a search in the current file.
src/json.hpp
Outdated
} | ||
|
||
/// add a character to yytext | ||
void add(int c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about int
vs char
here.
// yytext cannot be returned as char*, because it may contain a null | ||
// byte (parsed as "\u0000") | ||
return std::string(yytext.data(), yylen); | ||
return std::move( yytext ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could lead to nasty bugsif called from a lvalue, could you provide lvalue and rvalue overloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to look up the caller, but maybe we can also return a const reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can return a const ref for lvalues and move for rvalues. I remember someone on reddit discussing the rvalue support in the Library, which could be far better.
But that's not an emergency at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so; but, on review -- I noticed that the get_string function was (incorrectly) returning a 'const std::string', preventing any move re-use of the returned std::string anyway! Removing the const appears to have resulted in (another) pretty dramatic speed-up. New push forthcoming...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well simply calling get_string
twice would become problematic. It adds more boilerplate to add those overloads, but the code will always be correct. We should use std::move(lexer).get_string()
to be explicit about this side effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This get_string method is being used for keys and values; both of which are ultimately moved into a std::map key, or into a BasicJsonType value; there are no lvalue users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, appears to improve performance by close to 10% on some benchmarks. Nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, calling get_string twice would (surprisingly) result in an empty string. Perhaps renaming it to move_string()
would result in less surprising behavior? I'm not sure I understand your std::move(lexer).get_string()
recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point. Let's hope nobody will get bitten by that then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen your last answer sorry. You can overload on lvalues and rvalues:
class lexer {
std::string const& get_string() & {
return _content;
}
std::string get_string() && {
return std::move(_content);
}
};
The first function can only be called on lvalues, while the latter can only be called on rvalues.
Hence the std::move(lexer).get_string()
, which will call the second overload, while lexer.get_string()
will call the first one.
I've used this pattern on classes holding huge data, and I would call the rvalue overload at the very end of the processing, while calling the lvalue one during processing.
But a move_string
is fine with me too.
src/json.hpp
Outdated
std::vector<char> yytext = std::vector<char>(1024, '\0'); | ||
/// current index in yytext | ||
std::size_t yylen = 0; | ||
std::string yytext = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant initialization here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; but we're not doing that with any of the other members; I was just trying to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's up to personal preference, not a big deal anyway ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are relying on the optimizer to eliminate the call to strlen() here. They should eliminate the call and just do default initialization, but better to not make it have to do that. I would run the benchmarks with and without this and see if there's any difference.
Sorry for the double comment... |
I re-ran the simple benchmark with the latest commit: it is still slower: 1.24 seconds vs. 1.14 seconds parsing jeopardy.json... |
I guess it's compiler dependent. On g++ 7.2, its substantially faster on that benchmark. |
Also with GCC 8.0.0 20170919 I measure worse runtimes, but the margin became smaller. |
Further performance improvements. Since EOF is the sole -'ve value allowed through the get() interface, we can use less costly comparisons to detect its presence. Further 1-2% performance improvement. |
Just to refocus the debate -- we're bringing the parser into compliance with expected behavior on std::istream; if this requires a small performance penalty, I believe this is acceptable. The present behavior on std::istream >> is extremely surprising and non-compliant... |
The performance improvement in detecting EOF is probably due to the code re-using a "-'ve value" CPU flag left over from a previous move, instead of issuing a new comparison against a specific value. Changing that specific value to const or even constexpr wouldn't help. |
The token_string is used solely in formatting error messages; NUL values can't typically exist in well-formed JSON, so they will typically terminate parsing in an error. We want them to be output as the last character read -- so they must be escaped in the error message, which is now done. As for EOF: no, they cannot appear in token_string, because it is a char vector containing values in the range [-128,127] (or [0,255] if interpreted as unsigned char). We specifically detect -'ve values, and avoid introducing them into token_string. |
There are still more places where it compares against |
All remaining usages of std::char_traits::eof() are either A) |
Right now, compiler warnings due to -Weffc++ prevent the default-initialization of some members, as this is not possible using |
Is this not supported by 4.9.0? The README currently lists the minimum g++ version as 4.9. |
g++-4.9 (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4 is successfully used by Travis. I don't know about g++ 4.9.1 though. |
GCC's -Weffc++ is IMHO effectively obsolete, and should not be used, unless you are trying to ensure your code meets outdated advice from more than ten years ago. There are numerous bug reports about -Weffc++ in GCC's bugzilla and nobody is interested in fixing them. Among its problems are that the suggestions it makes are based on the first edition of Effective C++ and those items were heavily revised for the second edition. Some advice is simply inappropriate in modern code (C++11 or later) and some warnings are simply poorly implemented in GCC (e.g requiring a mem-initializer for all member variables even if they have default constructors that do the right thing). Changing modern projects to avoid unhelpful -Weffc++ warnings is a huge mistake. |
src/json.hpp
Outdated
@@ -1535,17 +1489,18 @@ class input_buffer_adapter : public input_adapter_protocol | |||
{ | |||
if (JSON_LIKELY(cursor < limit)) | |||
{ | |||
return *(cursor++) & 0xFF; | |||
return reinterpret_cast<int>(std::char_traits<char>::to_int_type(*(cursor++))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to fail on MSVC, see https://ci.appveyor.com/project/nlohmann/json/build/2302/job/n1ti8cx5qqafk43i.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a static_cast
should suffice here
I'm afraid I could still not reproduce this. I shall try to run some benchmarks in the weekend. |
OK, I simplified the handling of characters a bit more, removing unnecessary casts; the std::char_traits::to_int_type already handles all signed int/unsigned char conversions properly, and returns a signed value that can be handled by normal conversions to the int return type of get_character. Also, removed the unnecessary changes to member initializers; I left the brace initializers in place, to avoid any warnings should someone decide to use -Weffc++ on their code. Benchmarks remain improved on g++, essentially the same on clang++. |
@pjkundert I think I understand why my Good job on the body details BTW :) |
src/json.hpp
Outdated
@@ -1397,120 +1397,77 @@ constexpr T static_const<T>::value; | |||
/// abstract input adapter interface | |||
struct input_adapter_protocol | |||
{ | |||
virtual int get_character() = 0; | |||
virtual std::string read(std::size_t offset, std::size_t length) = 0; | |||
virtual int get_character() = 0; // returns characters in range [0,255], or eof() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be std::char_traits<char>::int_type
instead of int
so that we know that there will be no type changes later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be an acceptable change, I think. It's typically that type anyway, and being explicit about it keeps things obvious; that all implementation of the adapter class need to be consistent in producing std::char_traits<char>::int_type
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A using int_type = ...
would make it less verbose then the full thing.
contains the number of bytes in the string. | ||
scanning, bytes are escaped and copied into buffer yytext. Then the function | ||
returns successfully, yytext is *not* null-terminated (as it may contain \0 | ||
bytes), and yytext.size() is the number of bytes in the string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more accurate to say that it's null terminated but may also contain embedded nulls?
This currently sounds like there isn't guaranteed to be at least one null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a change in behavior, I see now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I assume that it used to be NUL-terminated, in the mists of history, and was probably being returned as a char*
' Since the token can contain NULs (supplied by escaping), any assumptions about NUL and termination are incorrect. So it was removed.
std::vector<char> yytext = std::vector<char>(1024, '\0'); | ||
/// current index in yytext | ||
std::size_t yylen = 0; | ||
std::string yytext { }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither of these need the { }
at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. However, since the -Weffc++ flag still exists (and arbitrary client code may still use it), and it is cost and risk free for us to still comply with its constraints, I decided to include the default initializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'd say that there is no cost. There is a cost whenever you do something unnecessary to appease a broken tool. It's not a runtime or build-time cost, more of a cognitive burden. If we intend to keep it there to support the warning, anyone who maintains the code should know why it's there, which means that there should probably be a comment.
Anyway, it's all philosophical. I'm not going to lose any sleep over it.
@@ -5205,7 +5162,7 @@ class binary_reader | |||
@brief get next character from the input | |||
|
|||
This function provides the interface to the used input adapter. It does | |||
not throw in case the input reached EOF, but returns | |||
not throw in case the input reached EOF, but returns a -'ve valued |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this "negative valued" bit, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
o We assume the same character int_type as the unerlying std::istream o There are no assumptions on the value of eof(), other than that it will not be a valid unsigned char value. o To retain performance, we do not allow swapping out the underlying std::streambuf during our use of the std::istream for parsing.
o For some unknown reason, the complexity of the benchmark platform prevented some C++ compilers from generating optimal code, properly reflective of the real performance in actual deployment. o Added the json_benchmarks_simple target, which performs the same suite of tests as json_benchmarks. o Simplified the benchmark platform, and emit an "Average" TPS (Transactions Per Second) value reflective of aggregate parse/output performance.
Is there anything left required before merging this? |
I'll have another look soon - I want to have another look whether this also fixes #714. About anything else: are we happy with this approach? I could not measure performance improvement, but I understand that less code is preferable here, and the caching always felt like a hack (though it performed better in my bechmarks, and also with the simple example of parsing jeopardy.json). |
Can one of the admins verify this patch? |
I personally have no more issues with this PR. |
Can we merge this PR? |
Thanks so much for all the patience! |
Simplified cached_input_stream_adapter<> to input_stream_adapter, to avoid redundant buffering (already buffered in underlying std::streambuf), and thus to not pre-load (and discard) a large buffer of input from the istream when parsing a smaller JSON record. Handles a leading Byte Order Mark using standard istream putback/unget.
Also, instead of trying to re-read the previous token on error (which involved seeking, which may or may not be available on an istream), simply collects the developing token into a std::string. The space underlying this string is preserved between tokens, so this collection devolves into simple copying, and is therefore quite efficient.
This also opens the door to further simplifications in the future.
Added a simple unit test to confirm operation. Did not test effect on json::parse. I feel that the approach suggested by @nlohmann (std::parse demands that the entire buffer be a single JSON object, but operator>> scans just a single upcoming JSON object) is correct... Perhaps this pull request may help move in that direction.