Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic Oximeter query language #5273

Merged
merged 20 commits into from
Mar 29, 2024
Merged

Add basic Oximeter query language #5273

merged 20 commits into from
Mar 29, 2024

Conversation

bnaecker
Copy link
Collaborator

  • Add basic grammar for an Oximeter query language. Includes support for numeric, string, boolean, UUID, timestamp, IP address, and duration literals. Queries are constructed in a pipeline of "table operations", each of which operates on a set of timeseries and produces another set.
  • Implement temporal alignment, currently supporting one method that generates output samples from the mean of the inputs over the alignment period.
  • Add basic subquery support, for fetching multiple timeseries and joining them
  • Implement filtering on fields and timestamps, both in the DB as much as possible, and the query pipeline; and implement filtering on data values in code.
  • Implement group-by support, where we can currently reduce values within a group by summing or computing the mean.
  • Add public Nexus API endpoints for listing timeseries schema, and running an OxQL query. Both are currently restricted to fleet readers, until a more thorough authz process is fleshed out.
  • This also reorganizes the internals of the oximeter_db::client module, which were starting to get too unwieldy and littered with conditional compilation directives.

@bnaecker bnaecker force-pushed the oximeter-query-language branch 5 times, most recently from 0365db0 to e1f882c Compare March 18, 2024 16:30
Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben! This is a massive chunk of work and I can't wait to use OxQL. A few comments/questions from reading through the code - but in general I found this very approachable due to the nice structure of the code and lots of great comments and tests.

nexus/src/app/metrics.rs Show resolved Hide resolved
nexus/src/external_api/http_entrypoints.rs Show resolved Hide resolved
oximeter/db/Cargo.toml Outdated Show resolved Hide resolved
oximeter/db/src/bin/oxdb/oxql.rs Outdated Show resolved Hide resolved
oximeter/db/src/bin/oxdb/oxql.rs Outdated Show resolved Hide resolved
oximeter/db/src/oxql/point.rs Outdated Show resolved Hide resolved
oximeter/db/src/oxql/point.rs Outdated Show resolved Hide resolved
oximeter/db/src/oxql/query/ast.rs Outdated Show resolved Hide resolved
oximeter/db/src/oxql/table.rs Outdated Show resolved Hide resolved
oximeter/db/src/oxql/table.rs Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

Ok, I believe I've addressed everything except for fixing the group_by to be time- rather than index-based. I'll do that tomorrow, and hopefully we can get this integrated soon! Thanks for all the input @rcgoodfellow, it's been very helpful.

@bnaecker
Copy link
Collaborator Author

@rcgoodfellow Thanks for your patience while I tidied up the grouping operations. I've added an exhaustive test for the various kinds of inputs we expect to handle, at the end of oximeter/db/src/oxql/ast/table_ops/group_by.rs. Please let me know if you've got any other comments or concerns at this point, and thanks for your input!

@bnaecker bnaecker requested a review from rcgoodfellow March 22, 2024 03:01
Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben! 🚢 🚢 🚢

@bnaecker
Copy link
Collaborator Author

Hmmm the test_update_uninitialized test has failed. That should be unrelated, but it's hard to tell, since it seems to end unceremoniously in a server (Nexus) panic. I don't think I'm running any OxQL queries in Nexus during the integration tests, but I'm going to see if I can repro locally.

@bnaecker
Copy link
Collaborator Author

I've also found a couple of big holes in the logic being used to fetch data from the DB. The code is trying to be clever, in order to minimize the amount of data we get, but it's not currently fetching the right datasets for more complicated logical conditions. I'll probably spend a day or two working on that and adding a bunch more tests before soliciting more feedback. (In addition to trying to flush out the failed update test.)

@jgallagher
Copy link
Contributor

Hmmm the test_update_uninitialized test has failed. That should be unrelated, but it's hard to tell, since it seems to end unceremoniously in a server (Nexus) panic. I don't think I'm running any OxQL queries in Nexus during the integration tests, but I'm going to see if I can repro locally.

This is probably #4949

@bnaecker
Copy link
Collaborator Author

Thanks @jgallagher, I hadn't yet looked at existing flakes!

@bnaecker
Copy link
Collaborator Author

bnaecker commented Mar 27, 2024

Alright, I think the latest commit (5dde3c3) is in a good place. I'd like to explain what was going on, since this was one of the most mind-bending bits of code I've written in a while.

The problem

Suppose we get an OxQL query like this:

get physical_data_link:bytes_sent
    | filter link_name == "net0" || (link_name == "net1" && timestamp > @now() - 5m)

That means, select all the data for the link "net0", and only the last 1 minute for the one named "net1". When you ran this on the original PR commits, that's not what you'd get. Instead, if you check out 9d02d73, you will see this:

0x〉get physical_data_link:bytes_sent | filter (link_name == "net0") || (link_name == "net1" && timestamp > @now() - 1m)
Query summary
 ID: cd38e370-ad7a-4d79-8ecc-efb534e68745
 Total duration: 66.01312ms

 SQL queries:
  ID: 27769627-a279-410b-85ef-1e010abb195a
  Duration: 51.551714ms
  Read: 1686 rows (155672 bytes)

  ID: 3c9936c0-f2a1-47ca-a1ba-7a248ab177fb
  Duration: 11.581639ms
  Read: 41101 rows (858834 bytes)

physical_data_link:bytes_sent

 hostname: shale
 link_name: net0
 rack_id: 95abe461-b2a7-4e6c-9d91-752da5c1f237
 serial: shale
 sled_id: 736d8883-c2ee-4d4e-a085-cd2e0a28c457
   [2024-03-26 01:26:49.056689808, 2024-03-27 03:58:14.751963363]: [8078906]
   [2024-03-27 03:58:14.751963363, 2024-03-27 03:58:24.753054459]: [710]
   [2024-03-27 03:58:24.753054459, 2024-03-27 03:58:34.756708736]: [796]
   [2024-03-27 03:58:34.756708736, 2024-03-27 03:58:44.758934400]: [796]
   [2024-03-27 03:58:44.758934400, 2024-03-27 03:58:54.760457923]: [710]
   [2024-03-27 03:58:54.760457923, 2024-03-27 03:59:04.763356168]: [956]

 hostname: shale
 link_name: net1
 rack_id: 95abe461-b2a7-4e6c-9d91-752da5c1f237
 serial: shale
 sled_id: 736d8883-c2ee-4d4e-a085-cd2e0a28c457
   [2024-03-26 01:26:49.079063864, 2024-03-27 03:58:14.753352986]: [0]
   [2024-03-27 03:58:14.753352986, 2024-03-27 03:58:24.755635891]: [0]
   [2024-03-27 03:58:24.755635891, 2024-03-27 03:58:34.757309856]: [0]
   [2024-03-27 03:58:34.757309856, 2024-03-27 03:58:44.759533210]: [0]
   [2024-03-27 03:58:44.759533210, 2024-03-27 03:58:54.762080450]: [0]
   [2024-03-27 03:58:54.762080450, 2024-03-27 03:59:04.763954108]: [0]

That is, we're selecting the last 1 minute of everything. What gives!

The cause

When we run an OxQL query, specifically the get and filter table operations, we need to go fetch some data from ClickHouse. There are two parts to this: getting the right timeseries keys, and getting the measurements.

To get the keys, we take the parts of the filter that apply to the fields, and run them against all the joined field tables for that timeseries. So the query above returns two keys: (14331077795809521820,14801769476763106306). One of those is the "net0" timeseries and the other the "net1", though I have no idea which.

Armed with a set of keys, we turn to fetching the measurements. Easy, just add the timestamp filters to an IN clause that selects just the timeseries with the keys above. Something like this:

SELECT timeseries_key, start_time, timestamp, datum
FROM oximeter.measurements_cumulativeu64
WHERE greater(timestamp, '2024-03-27 03:58:13.436067166') AND
    timeseries_name = 'physical_data_link:bytes_sent' AND timeseries_key IN (14331077795809521820,14801769476763106306)
ORDER BY timeseries_key, start_time, timestamp
LIMIT 1000001
FORMAT JSONEachRow

This seems right, and it is, if the filtering clause only has conjunctions. But if you have disjunctions in there, then this is wrong -- we're running just one query against the fields, and then using all those keys together in one query that selects a time range. But the query at hand explicitly asks for different time ranges for the two keys!

The solution

This involved a long foray in to Boolean algebra, and more recursion than I care to admit. Basically, you need to rewrite the query into disjunctive normal form (DNF), which means it's expressed as only disjunctions of conjunctions: (a && b) || (c && d) || .... Each of these forms a possibly-independent query, which we need to treat as resolving to independent sets of timeseries keys.

So instead of one query for the fields, we run as many as there are terms in the DNF-reduced version of the query. We then keep track of that filter expression and the keys it resolves, and add them together into the measurements query.

That is, instead of doing one clause in the measurements query that looks like WHERE <timestamp filter> AND timeseries_key IN (key_set), we have many, each joined by an OR. That is critical, since the DNF form of the query is a bunch of disjunctions too. So you'd get something like this, for the same query:

SELECT timeseries_key, start_time, timestamp, datum
FROM oximeter.measurements_cumulativeu64
WHERE timeseries_name = 'physical_data_link:bytes_sent' AND (
    timeseries_key IN (14801769476763106306) OR
    (greater(timestamp, '2024-03-27 04:09:17.907475882') AND timeseries_key IN (14331077795809521820))
)
ORDER BY timeseries_key, start_time, timestamp
LIMIT 1000001
FORMAT JSONEachRow

So each of those consistent key groups is a separate conjunction of disjunctions itself in the SQL query, keeping the actual filter that applies to the measurements (timestamps, in this case) matched up to the timeseries keys that go with it.

And when you run the query on the corrected code, you get this:

0x〉get physical_data_link:bytes_sent | filter (link_name == "net0" && timestamp > @now() - 2m) || (link_name == "net1" && timestamp > @now() - 1m)
Query summary
 ID: 38e3cc46-fcc9-428c-8110-d97c4f4dfd55
 Total duration: 415.191922ms

 SQL queries:
  ID: 11679626-63d3-4bdf-b559-95ab57672b6e
  Duration: 277.639921ms
  Read: 1686 rows (155672 bytes)

  ID: 5e6eadf1-bbfd-4c73-a7aa-8f6c1c80a2de
  Duration: 46.856707ms
  Read: 1686 rows (155672 bytes)

  ID: f1da5a87-705e-4f00-91f4-551a66d22177
  Duration: 81.766857ms
  Read: 49293 rows (899851 bytes)


physical_data_link:bytes_sent

 hostname: shale
 link_name: net0
 rack_id: 95abe461-b2a7-4e6c-9d91-752da5c1f237
 serial: shale
 sled_id: 736d8883-c2ee-4d4e-a085-cd2e0a28c457
   [2024-03-26 01:26:49.056689808, 2024-03-27 04:39:25.385006828]: [8283667]
   [2024-03-27 04:39:25.385006828, 2024-03-27 04:39:35.389550820]: [796]
   [2024-03-27 04:39:35.389550820, 2024-03-27 04:39:45.391453469]: [796]
   [2024-03-27 04:39:45.391453469, 2024-03-27 04:39:55.393628342]: [710]
   [2024-03-27 04:39:55.393628342, 2024-03-27 04:40:05.398551691]: [870]
   [2024-03-27 04:40:05.398551691, 2024-03-27 04:40:15.400553991]: [882]
   [2024-03-27 04:40:15.400553991, 2024-03-27 04:40:25.402692234]: [710]
   [2024-03-27 04:40:25.402692234, 2024-03-27 04:40:35.406985832]: [922]
   [2024-03-27 04:40:35.406985832, 2024-03-27 04:40:45.409420660]: [710]
   [2024-03-27 04:40:45.409420660, 2024-03-27 04:40:55.411197736]: [796]
   [2024-03-27 04:40:55.411197736, 2024-03-27 04:41:05.415510815]: [956]

 hostname: shale
 link_name: net1
 rack_id: 95abe461-b2a7-4e6c-9d91-752da5c1f237
 serial: shale
 sled_id: 736d8883-c2ee-4d4e-a085-cd2e0a28c457
   [2024-03-26 01:26:49.079063864, 2024-03-27 04:40:25.403247973]: [0]
   [2024-03-27 04:40:25.403247973, 2024-03-27 04:40:35.407377548]: [0]
   [2024-03-27 04:40:35.407377548, 2024-03-27 04:40:45.409914278]: [0]
   [2024-03-27 04:40:45.409914278, 2024-03-27 04:40:55.411805986]: [0]
   [2024-03-27 04:40:55.411805986, 2024-03-27 04:41:05.416077544]: [0]

Note that I shortened the two time ranges a bit so things would fit on the page. But the important parts are:

  • We get the right data! The last 2 minutes for one timeseries, and the last one minute for the other.
  • We're now running three SQL queries. That's because the filter expression is a disjunction of 2 conjunctions, so we're splitting those into two field queries and a measurements query, exactly like we should be.

Tests

I added several more tests in response to this issue. The most relevant one is here, which tests this exactly query: selecting all of one timeseries, and part of another.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work on this Ben! Things look like they're coming together nicely. Just a few comments/questions since my previous review.

oximeter/db/src/client/oxql.rs Show resolved Hide resolved
oximeter/db/src/client/oxql.rs Show resolved Hide resolved
oximeter/db/src/oxql/ast/grammar.rs Show resolved Hide resolved
oximeter/db/src/oxql/ast/table_ops/filter.rs Show resolved Hide resolved
oximeter/db/src/oxql/ast/table_ops/filter.rs Outdated Show resolved Hide resolved
// This makes me really nervous, so I'm adding an escape hatch that we
// only allow a few iterations. If we've not simplified within that,
// we'll just declare the expression too complicated to handle.
const EXPR_COMPLEXITY_LIMIT: usize = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 seems perhaps overconservative. I could imagine fairly basic queries that need more than 6 de Morgan transforms. I'd consider increasing this by several orders of magnitude.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I've been thinking about this. I don't actually think this is the right way to impose a limit, because the function is recursive! We need to pass in the current level, and just refuse to recurse more than some number of levels deep.

This value prevents handling expressions like this: a && (b || c || d .. ), where there are more than 6 items in the right sub-expression. Each one requires a separate call to simplify_to_dnf() to resolve, but at the same layer of the recursive call stack. That I agree can be much larger, though I'm not sure by how much. The recursive limit should be a separate thing, governing how deep the parse tree get when trying to simplify it. I don't have a good intuition for a limit there either. But both being O(100) seems like way more than enough, and still reasonably safe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, O(100) seems good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp, I've found a new way you can DOS the query parser! Just make an expression that's a few right-nested ANDs, such as a == 0 && (a == 0 && (a == 0 && ...))) and BOOM, exponentially long parse times. Fun.

I think that means I need to add a stricter check in the top-level Query::new() method, which does the parsing internally. That could do something very dumb, like check the number of && or parentheses, and bail because it looks suspicious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check what's going on with the PEG/packrat parser there. I would not expect that kind of behavior for this grammar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also surprised. I've asked it to print debug information about the rules as it matches them. It doesn't seem obviously wrong, but it's also possible I've written them very poorly in a way that requires much more work than I expected. I'll keep digging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've a solution, and a small amount more insight. The Rust crate we're using here, peg, is not a caching / packrat parser by default. But you can make it that way by decorating rules you expect caching to help. I don't think I quite appreciated the degree to which this helps us. When you nest 4 right-deep AND expressions, this goes from 8.16s down to 0.1s when we cache only one rule, the factor() rule that matches pretty high-level things like !(foo == "bar"). So I'll definitely be adding that!

What I don't quite get is why the rule is being called so many times at the exact same locations. Again in that 4-level nesting case, it's being called 139 times for the most deeply-nested expression. I mean, it sort of makes sense that this would be called frequently: factor() itself as a rule shows up in several places, and is nested right inside itself. But this seems way beyond excessive. With caching, it's down to 7 times at most.

Another mark for writing our own parser. I'd like to defer that, since this PR is already way too big. But I'm on board.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added caching and new tests for complexity limits in f27906b

oximeter/db/src/oxql/ast/table_ops/filter.rs Show resolved Hide resolved
// If this is an XOR, rewrite it to a disjunction of conjunctions.
//
// If it is not, return a clone of self.
fn rewrite_xor_to_disjunction(&self) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - do we have to expand xor into more primitive boolean operators b/c we can't express it in the SQL directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can express it in SQL, but it's not easy to tell which parts of the query are independent in that case. The goal of the DNF craziness is to have a bunch of disjunctions, each of which can be run separately to get the group of timeseries keys consistent with it. So we rewrite a ^ b into (a && !b) || (!a && b) so that we have two such groups.

I could also be missing something though, and this isn't needed!

Copy link
Contributor

@rcgoodfellow rcgoodfellow Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need to transform a query in a distributed conjunctive form into a distributed disjunctive form. But if there is no distribution in play because the operation can be executed as a primitive in the target execution environment (SQL in this case) do we really need a transformation that factors a disjunctive operation out of what is essentially a primitive operation from the perspective of the executor?

Copy link
Collaborator Author

@bnaecker bnaecker Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I'm not misunderstanding your question, but I think we do need to rewrite XORs.

The purpose of the conversion to DNF is not because ClickHouse has any limitations in its SQL engine. All of the logical expressions OxQL supports can be written directly in ClickHouse SQL. We're doing the transformation so that the OxQL layer can figure out how to merge the data from the field tables with those from the measurements table, without first doing a complete join between them. I tried that with the SQL prototype, and it falls over pretty quickly as the data gets larger.

The example in one of my long-winded comments above is relevant here. Suppose you want to select all of one timeseries, and part of another. (That's not an XOR, but it is an OR, so has similar properties.) To do that, you have two choices:

  • Run a complete inner join between the field tables and the measurements table, and then just select whatever was written in the OxQL query. No transformation needed! This is how the SQL prototype from a few months works, and it's great, until the data gets to even moderate size. At that point, the JOIN is enormous -- it takes up a huge amount of memory to duplicate all the fields for every measurement, and then we have to pay the cost to move all those over the network.
  • Somehow select the subset you care about from both the field tables and measurement table, and only transfer those.

That latter choice is what I'm going for here. But to do that, you need to know which records from the field table the query implies, and which time ranges those records correspond to. Those time ranges can be different for different subsets of the field table, as they are if you want to select all of one timeseries and part of another. The code needs to store a mapping from (list of keys -> time range).

The disjunctive normal form is about knowing how to define each list of keys. Every disjunction is independent, so it is logically correct to independently fetch the keys for each of those, and map that to the time range that corresponds to it.

It's true that an XOR is a primitive in the target environment. It's also logically equivalent to an OR, meaning there could be two independent time ranges at play here. So I'm rewriting the XOR to an OR so that we can make sure we don't accidentally merge the wrong data.

I think there's a good optimization opportunity here! The conversion to DNF is really only needed if there is more than one time range at play in the query. If not, then we can do the original version of the query without any conversion. I certainly think we should do that at a later time, but it is an optimization and I wanted to get it correct first.

I hope that answers the question!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm starting to slowly come around. Maybe the thing that caused me to look at this a bit sideways was the fact that we're unconditionally factoring out the xor. But I think I see the issue where the xor is a part of a more complex expression and we need to fully normalize everything to ensure we get the filtering correct. Sorry for the confusion, still trying to wrap my head around the semantics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, it's subtle and took me a long time to be able to even describe the problem! I'll draw up a figure too, a visual explanation might be more helpful.

oximeter/db/src/oxql/ast/table_ops/filter.rs Show resolved Hide resolved
oximeter/db/src/oxql/ast/table_ops/filter.rs Show resolved Hide resolved
Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben! Latest additions all LGTM

bnaecker added 11 commits March 29, 2024 21:35
- Add basic grammar for an Oximeter query language. Includes support for
  numeric, string, boolean, UUID, timestamp, IP address, and duration
  literals. Queries are constructed in a pipeline of "table operations",
  each of which operates on a set of timeseries and produces another
  set.
- Implement temporal alignment, currently supporting one method that
  generates output samples from the mean of the inputs over the
  alignment period.
- Add basic subquery support, for fetching multiple timeseries and
  joining them
- Implement filtering on fields and timestamps, both in the DB as much
  as possible, and the query pipeline; and implement filtering on data
  values in code.
- Implement group-by support, where we can currently reduce values
  within a group by summing or computing the mean.
- Add public Nexus API endpoints for listing timeseries schema, and
  running an OxQL query. Both are currently restricted to fleet readers,
  until a more thorough authz process is fleshed out.
- This also reorganizes the internals of the `oximeter_db::client`
  module, which were starting to get too unwieldy and littered with
  conditional compilation directives.
- Typos and comment cleanup.
- Keep SQL prototype as default feature
- Consts for common strings
- Some DCE
- Fixes to point cast method, and added tests for all supported casts
- Add query ID to the OxQL `QueryResult` object
- Add the query ID to USDT probes
- Add the query ID to a child query logger, which is used in all future
  operations for the query, so we have that ID associated with all
  messages.
- Clarify that duration constants may be approximate.
This Implements two new limits on query complexity:

- Maximum amount by which we allow an alignment query to upsample the
  data, and
- Maximum number of total measurement values fetched from ClickHouse

The first is easy to check during the `align` table operation, and I've
added tests for it.

The second is checked by continually updating the total number of rows
we receive from the database on each measurement query. This also
changes the way we _ensure_ that limit. Previously, we ran the
measurements query twice: once to check that it didn't return too much
data, and the second to fetch the actual results.

That's an obvious TOCTOU, and inefficient, but it was simple. Instead,
this commit changes way we do that check, by applying a limit clause to
the next measurement query. The limit we use is one more than the
remainder in our allotment. That way, we can tell if the query is too
big, without running the query multiple times or returning far more data
than we want to allow. We don't know how _much_ bigger the actual result
set is than what we want, but we do know that it is bigger.
During early development, I implemented a short-cut for group_by
operations, which grouped values by _index_ not timestamp. That worked
because the inputs were aligned and usually from timeseries with exactly
overlapping points. It's not really correct, however. We _want_ to group
data points in a timeseries both by their group identity and _also_ by
lining up those that occur at the same time.

This commit implements correct grouping by finding matching output
timestamps in aligned input timeseries, and averaging or summing only
those at the same time. Care is taken to not include missing points when
computing the averages (or sums).

This also found a bug in the alignment code. That used the
`Iterator::sum` implementation when averaging values within a time
window. That method doesn't return `None` if there is no data, it always
returns the zero value for the input. That caused the grouping code to
convert missing values to zeros, which when you're summing is fine, but
not when averaging.
- Add XOR support. This requires a richer return type from the functions
  applying a filter to a field, since we need to know if it's true,
  false, or dont-care. The latter was not representable before.
- Support negated filtering _expressions_, not just atoms
- Year and month duration literals are capitalized, the rest are
  lowercase
@bnaecker bnaecker force-pushed the oximeter-query-language branch from 7c080d6 to 6f31b08 Compare March 29, 2024 21:36
@bnaecker bnaecker enabled auto-merge (squash) March 29, 2024 21:36
@bnaecker
Copy link
Collaborator Author

Thanks for all your feedback and testing @rcgoodfellow! I've fixed up conflicts with main and set to automerge.

@bnaecker bnaecker merged commit 17510a6 into main Mar 29, 2024
23 checks passed
@bnaecker bnaecker deleted the oximeter-query-language branch March 29, 2024 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants