Improving NDJSON-style performance? #484

mtmorgan · 2024-01-28T19:12:53Z

mtmorgan
Jan 28, 2024

I was really impressed with DuckDB JSON parsing performance outlined in this blog post, particularly querying an NDJSON file with 172049 records in (on my macOS system) about 3.7 seconds.

SELECT id, type FROM read_ndjson_auto('2023-02-08-0.json.gz')

My attempt with jsoncons and using JMESpath as a query language (JSONpointer and JSONpath aren't flexible enough, I don't think...) is

#include "jsoncons/json.hpp"
#include "jsoncons_ext/jmespath/jmespath.hpp"
#include <zlib.h>

const int buffer_size = 1048576; // 2^20

int main(int argc, char *argv[])
{
    auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
        "{id: id, type: type}");
    char buffer[buffer_size];
    gzFile in_file = gzopen(argv[1], "rb");

    int n_lines = 0;
    while (true) {
        const char *line = gzgets(in_file, buffer, buffer_size);
        if (line == nullptr)
            break;
        ++n_lines;                                  // 0.77s
        const auto p = jsoncons::json::parse(line); // 7.75s
        const auto q = expr.evaluate(p);            // 8.42s
    }

    gzclose(in_file);

    std::cout << n_lines << std::endl;

    return 0;
}

With -02 optimizations, this takes the times indicated in the comments -- 0.77 seconds to read the compressed file, 7.75 seconds to read and parse the file, and 8.42 second to read, parse, and query the records. So this is 2x slower than DuckDB (jq, a purpose-built tool, takes at least 8.3 seconds to parse and query the decompressed file, so DuckDB is doing really well). The blog post mentions use of yyjson for parsing.

Can I improve jsoncons parsing performance? Parsing NDJSON files seems well-suited to parallel processing, is that possible?

Answered by danielaparker

Jan 29, 2024

If you read the lines into memory up front, you can easily parallelize it, e.g.

#include "jsoncons_ext/jmespath/jmespath.hpp"
#include <string>
#include <execution>
#include <concurrent_vector.h> // microsoft PPL library

int main(int argc, char* argv[])
{
    std::vector<std::string> lines = {{
        R"({"name": "Seattle", "state" : "WA"})",
        R"({ "name": "New York", "state" : "NY" })",
        R"({ "name": "Bellevue", "state" : "WA" })",
        R"({ "name": "Olympia", "state" : "WA" })"
}};

    auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
        R"([@][?state=='WA'].name)");

    concurrency::concurrent_vector<std::string> result;

    auto f = [&](const …

View full answer

danielaparker · 2024-01-29T17:21:31Z

danielaparker
Jan 29, 2024
Maintainer

If you read the lines into memory up front, you can easily parallelize it, e.g.

#include "jsoncons_ext/jmespath/jmespath.hpp"
#include <string>
#include <execution>
#include <concurrent_vector.h> // microsoft PPL library

int main(int argc, char* argv[])
{
    std::vector<std::string> lines = {{
        R"({"name": "Seattle", "state" : "WA"})",
        R"({ "name": "New York", "state" : "NY" })",
        R"({ "name": "Bellevue", "state" : "WA" })",
        R"({ "name": "Olympia", "state" : "WA" })"
}};

    auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
        R"([@][?state=='WA'].name)");

    concurrency::concurrent_vector<std::string> result;

    auto f = [&](const std::string& line)
        {
            const auto p = jsoncons::json::parse(line); 
            const auto q = expr.evaluate(p); 
            if (!q.empty())
                result.push_back(q.at(0).as<std::string>());
        };

    std::for_each(std::execution::par, lines.begin(), lines.end(), f);

    for (const auto& s : result)
    {
        std::cout << s << "\n";
    }

    return 0;
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving NDJSON-style performance? #484

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Improving NDJSON-style performance? #484

mtmorgan Jan 28, 2024

Replies: 1 comment

danielaparker Jan 29, 2024 Maintainer

mtmorgan
Jan 28, 2024

danielaparker
Jan 29, 2024
Maintainer