Improving NDJSON-style performance? #484
-
I was really impressed with DuckDB JSON parsing performance outlined in this blog post, particularly querying an NDJSON file with 172049 records in (on my macOS system) about 3.7 seconds. SELECT id, type FROM read_ndjson_auto('2023-02-08-0.json.gz') My attempt with jsoncons and using JMESpath as a query language (JSONpointer and JSONpath aren't flexible enough, I don't think...) is #include "jsoncons/json.hpp"
#include "jsoncons_ext/jmespath/jmespath.hpp"
#include <zlib.h>
const int buffer_size = 1048576; // 2^20
int main(int argc, char *argv[])
{
auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
"{id: id, type: type}");
char buffer[buffer_size];
gzFile in_file = gzopen(argv[1], "rb");
int n_lines = 0;
while (true) {
const char *line = gzgets(in_file, buffer, buffer_size);
if (line == nullptr)
break;
++n_lines; // 0.77s
const auto p = jsoncons::json::parse(line); // 7.75s
const auto q = expr.evaluate(p); // 8.42s
}
gzclose(in_file);
std::cout << n_lines << std::endl;
return 0;
} With Can I improve jsoncons parsing performance? Parsing NDJSON files seems well-suited to parallel processing, is that possible? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
If you read the lines into memory up front, you can easily parallelize it, e.g. #include "jsoncons_ext/jmespath/jmespath.hpp"
#include <string>
#include <execution>
#include <concurrent_vector.h> // microsoft PPL library
int main(int argc, char* argv[])
{
std::vector<std::string> lines = {{
R"({"name": "Seattle", "state" : "WA"})",
R"({ "name": "New York", "state" : "NY" })",
R"({ "name": "Bellevue", "state" : "WA" })",
R"({ "name": "Olympia", "state" : "WA" })"
}};
auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
R"([@][?state=='WA'].name)");
concurrency::concurrent_vector<std::string> result;
auto f = [&](const std::string& line)
{
const auto p = jsoncons::json::parse(line);
const auto q = expr.evaluate(p);
if (!q.empty())
result.push_back(q.at(0).as<std::string>());
};
std::for_each(std::execution::par, lines.begin(), lines.end(), f);
for (const auto& s : result)
{
std::cout << s << "\n";
}
return 0;
} |
Beta Was this translation helpful? Give feedback.
If you read the lines into memory up front, you can easily parallelize it, e.g.