Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn / Error on bad schema #16

Open
d70-t opened this issue Sep 19, 2023 · 1 comment
Open

Warn / Error on bad schema #16

d70-t opened this issue Sep 19, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@d70-t
Copy link

d70-t commented Sep 19, 2023

Is your feature request related to a problem? Please describe.

I tried building a custom schema (and custom message types, but that's irrelevant here) for FDB. I came up with a schema similar to:

[a[b?[c]]]
[a[b[d]]]

I.e. if c is present in the key, b is optional, but if d is present, b is required. I expected this to work, however the following sampe code fails:

#include <fdb5/api/FDB.h>
#include <eckit/runtime/Main.h>

int main(int argc, char** argv) {
    eckit::Main::initialise(argc, argv);
    auto fdb = fdb5::FDB{};
    fdb.archive(fdb5::Key({{"a", "1"}, {"b", "2"}, {"c", "4"}}), "foo", 4);  // works fine
    fdb.archive(fdb5::Key({{"a", "1"}, {"b", "2"}, {"d", "4"}}), "bar", 4);  // crashes
}

when compiled & run as follows:

g++ -o badschema badschema.cpp -lfdb5 -leckit && ./badschema

The reported error is:

terminate called after throwing an instance of 'eckit::SeriousBug'
  what():  SeriousBug: Key::get() failed for [c] in {a=1,b=2,d=4}  in  (/src/fdb/src/fdb5/database/Key.cc +192 get)

I.e. FDB tries to get the value for "c" in the second archive call, I guess because it accidentally tries to match it agains the first (instead of the second) schema rule.

I've been in contact with @simondsmart, who suggested to use the schema

[a[b?[c][d]]]

instead. However, the test code fails with the same issue (also the schema doesn't encode that b would be required for d).

Describe the solution you'd like

I'd like to see both of the two schemas above work with the provided example code.

Describe alternatives you've considered

If it's impossible to make those schema work, FDB should generate an understandable error message, explaining that the schema is invalid instead of accepting some keys for storage before crashing with other keys.

Additional context

No response

Organisation

MPIM

@d70-t d70-t added the enhancement New feature or request label Sep 19, 2023
@simondsmart
Copy link
Contributor

Sorry for the very slow response. I got pulled aside on some joyous internal distractions.

We need to be very careful about describing things as "bad schema" rather than "behaviour that I didn't expect".

The three levels of the schema mean different things. When using the filesystem backend:

  1. Identifies the directory the data is stored in
  2. Identifies the subsets of data that will be collocated in the same data (and index) files
  3. Identifies the values that can vary within one collocated dataset

These have consequences.

For the first and second levels, the hierarchical search pattern exists for (practical and technical) reasons. Matching is done from the top of the schema downwards. When things match, then further matching stops. This allows us to specify schemas starting with the most specific rules at the top (for instance "[ class=od, expver, stream=oper/dcda/scda, date, time, domain?") with more generic rules further down ("class, expver, stream, date, time, domain?"). A consequence of this is that we can't directly duplicate rules without making them become more generic as we go.

Once we get to the third level of the schema, this level identifies the meaning of the values stored in the index. We are no longer identifying which data file/index we are using. As such, it doesn't have any meaning to have multiple different alternatives. So we can only have one option at that level.

The error which is thrown in this case is correct. Your key has matched on [a] and [b]. Given that, it is required (by the schema) to supply key c. Key c is not supplied. And as such, this fails. To be able to use this hierarchy of keys, with (IIRC) the need for b to be optional, you would need to use the schema

[ a [ b? [ c?, d? ]]]

I presume, however, that this schema mechanism with a, b, c, d is a reduction of a real problem. And I would very much suggest that you elaborate what you are trying to do, as I suspect that this suggested schema is unlikely to be useful for a realistic problem - it just looks a bit weird. If you can let me know what keys you are trying to archive with - and crucially what the write pattern, and the distribution of the values amongst those keys are, then we can settle on something a bit more optimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants