-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate is not correctly reading online schemas/schematrons when they are not provided via command-line #879
Comments
@rgdeen I was able to replicate. We will add to the backlog. As a note, we highly recommend using the online schemas and the latest version of validate wherever possible to ensure accurate validation of products. |
Is this a problem?
Is the expectation that if the file(s) given by the user are not found, it should be a warning? It is already caught here:
Note: if the schematron exist and valid it just never happens. |
@al-niessner actually the 3.5 example fails as I initially reported. The issue is that the validation for collection_type is in the PDS core sch. So when validate is run with --schematron it loads the given sch's but, since there is no PDS4_PDS_1G00.sch in the --schematron list (or in my case, a mis-named one), it does not notice this and (apparently) loads NO sch for the PDS core. So the invalid label passes... because no rules are loaded saying it's bad. This may be unique to the core DD, I don't know. Perhaps because it's the default namespace it doesn't "notice" there's no sch for it? In any case it should at least be a warning if an expected sch or xsd is just flat out missing. |
@al-niessner, @rgdeen is correct. my repeat of this issue with the latest version of validate shows that it is still not doing this properly. Per my example above, if I give it only the IMG schemas/schematrons via command-line, it validates successfully:
But it shouldn't, the output should be the same as:
For example this is what should happen (this is not what happens now):
|
Thanks. I misunderstood the bit at the top and thought that it was saying 3.5 was doing it right. So the question is, why is 1G overriding 1B not living next to it. If my memory is correct, it revolves around people trying to override the schema with one of the their own for the next batch of improvements. Let me dig into it now that I understand it better. |
@al-niessner correct. We sometimes want to “overwrite” with newer versions of a schema or schematron |
Okay, it is back to here: validate/src/main/java/gov/nasa/pds/tools/label/LabelValidator.java Lines 554 to 593 in cbefe2f
The variable useLabelSchematron comes from the command line If we remove userLabelSchematron, then it will always load what is in the label. If developing a new schematron that is not deployed, then will get a cannot find error. If point at old, it will run and presumably fail for reasons updating schematron. Either way, false negative because failing to CLI design not label, schema, or schematron. Do we add new CLI flag that says merge online and CLI schematrons with CLI preference? Other than this test case does this make sense for any real world use case? I guess fixing false positives in the schema/schematron would benefit from the merge. Fixing false negatives cannot use the merge because the ones in the label will always generate the false positive until the fix is published. Can we get rid of CLI flags for schema and schematron and force the label to have the correct references? If the desire is to use a test schema/schematron then go from https://schematron to file:///fixed_schematron. This way validate always uses the label and all confusion in validate is gone? |
I think the default functionality for
We can't make this assumption for many reasons, e.g.
|
@al-niessner ☝️ |
I think there are two ways to go: merge with local overriding, or just warn for missing ones. Probably merge is easier for end users as long as you pull the right IM major version. Warning is okay though if you don't want to do the merge. Wonder if merge should be an option, with warning if not? |
@rgdeen we talked about this some more, and this is actually a huge can of worms trying to unravel this. We should throw some sort of warning when the PDS core schematron is not provided, but there are lots of curveballs that have come with trying to do this check that I think is going to be 1 step forward 2 steps back. In the end, to ensure your validation is correct, we recommend using what is in the label. |
If this isn't critical. I am going to close this as wontfix since there is a known workaround (use what is in the label). |
Hmmm... well, is it "critical"? Here's the scenario, you decide. The VGR LDD only just got released. Until release happens, you have no choice but to use local dictionaries. I thought I had a complete copy of all the LDDs in my local directory (I did, just misnamed the core ones). So I went all the way through development of the data products thinking validation was passing. I'm ready to hand off to the node (to myself ;-) ) and so I ran validate without the options. Only at that point did it notice the errors from the core. Fortunately, the errors were minor and easily fixed with a Velocity rerun, but I could easily see having some major structural issues that are uncaught by the data provider. Who in this case really couldn't run with the web versions, as the VGR DD was not released until the project was basically done (because we kept updating it with problems). So it kind of violates the principle of least surprise when running validate, that a simple error like misnaming the LDD (which happened because the Mac insists on adding .xml to everything when downloading and I forgot to tell it not to in this case) means you haven't really validated like you thought you had. Validate is so picky otherwise, should it be that fragile, really? Here's another interesting tidbit. If I leave out DISP from the dictionary set, I get an error: FAIL: file:/tmp/V1NA_5566630_RAW.xml However... if I leave out IMG I get no such error... I think because MARS2020 and MSSS_CAM_MH have an xs:element entry referring to img:SomeClass. The first error could lead people to believe that validate does tell you if an LDD is missing (I sure thought so)... but it's not just core but other cases also where it doesn't. So how can someone during data product development ever be sure they got all the LDD's downloaded? There's no way to tell. I'll leave it up to y'all to decide if it's "critical" or not... I really don't know how bad the worms are. But it did have non-trivial effects in this case. |
@rgdeen understood. we can try to support including the core in every validation, but the difficulty comes in the corner cases. We could enable mapping based upon filename, but that assumes the filenames are all the same, which has only become consistent in the last few years. Otherwise, reading the schematron and schemas on there own does not give a clear understanding of the namespace they pertain to. The namespace is in there, but so are potentially other namespaces. Which namespace is this one? e.g.
Which namespace pertains to this one? Option 1:
But this blows up a lot of existing pipelines. Option 2: |
@rgdeen We are moving this to the icebox for the time being. there are some significant technical challenges in the validate code involved in making this change, and we have bigger fish to fry. We will revisit this at a later time. Apologies for the inconvenience. |
oh interesting... the mapping of "img:" to "http://pds.nasa.gov/pds4/img/v1" is actually in the product label, not the schema. I guess in principle that means a product label could map a LDD to something non-canonical (e.g. "image:" instead of "img:") but hopefully everyone would reject that. If it were even noticed though, and I guess that's the crux of the problem. We really shouldn't condone LDD's that don't match the naming conventions, so if someone used "image:" for img or called it MY_DICT.xsd that should trigger a warning. Which leads to option 2 above, assume the filenames have the namespace in them because if they don't that's a bad thing. One thing to consider: this should be just a warning. So if there's a name mismatch or something, it'll do exactly what it does now... continue to validate, just issue an additional warning, which could be ignored. The point is to make sure people don't inadvertently omit a ldd (or core dd), not to "fix" the problem by falling back to another DD. Anyway, icebox is fine until you figure out a good way to deal with it. |
thanks @rgdeen
We discussed this as well. Heading into next release, we will work with the SWG to prioritize issues, and we can discuss some design options for this one if it bubbles up as a priority. In the end, hopefully the issue will eventually be caught when a data provider starts validating with released LDDs. Obviously not ideal when you are trying to test a pipeline, but any issues like you identified with your labels above should eventually get caught. |
Checked for duplicates
No - I haven't checked
🐛 Describe the bug
I have been doing bundle validation of my VGR PDART delivery using a local copy of all the DDs. Just before final delivery, I decided to try it with the delivered version of the DDs (i.e. let validate pick up the DD's from the web). Much to my surprise, it caught an error in <collection_type> that has been there since the beginning!
Looking into it, I discovered that my local DD copies had, for the PDS core only,
PDS4_PDS_1G00.xsd.xml
andPDS4_PDS_1G00.sch.xml
... i.e., a spurious .xml extension. This means the PDS core DD was not available in my local directory.Jordan said he thought it was supposed to default back to the web if a DD was not found, but that's apparently not the case, as the bad <collection_type> should have been flagged in this case. I suspect it simply had no PDS core xsd/sch files to work with!
I'm surprised this didn't cause other issues... but there should at least be an info message printed if a LDD is referenced but is not available. Perhaps it does... I seem to remember errors along those lines in the past... but if so that behavior does not seem to extend to the PDS core DD.
🕵️ Expected behavior
I expected it to notify me if the core xsd/sch was unavailable. At least a warning.
📜 To Reproduce
Command line used was:
for the local version, with the --schematron and --schema parameters removed for the use-online-DD version.
Here is another concrete example with latest version of validate (see test data for the data used):
But it fails as expected when running without those LDDs:
🖥 Environment Info
RHE Linux:
📚 Version of Software Used
Validate 2.1.4
🩺 Test Data / Additional context
test.tar.gz
🦄 Related requirements
⚙️ Engineering Details
The text was updated successfully, but these errors were encountered: