issue #1: command-line option to support content validation for every N products #587

al-niessner · 2023-01-05T22:03:20Z

🗒️ Summary

Added a command line option to process every Nth file rather than all of them.

⚙️ Test Data and/or Report

Tested using

validate --everyN 2 src/test/resources/github531/success/b.xml src/test/resources/github531/fail/b.xml src/test/resources/github531/success/b.xml src/test/resources/github531/fail/b.xml

>< snip ><

  Product Validation Summary:
    2          product(s) passed
    0          product(s) failed
    0          product(s) skipped

♻️ Related Issues

#1

al-niessner · 2023-01-05T22:03:52Z

@jordanpadams @nutjob4life @tloubrieu-jpl

Ready for review but read my review comments

src/main/java/gov/nasa/pds/validate/ValidateLauncher.java

nutjob4life

👍

jordanpadams

@al-niessner sorry the requirement is not very clear here.

in essence what this requirement is trying to accomplish is to run validation on every target input, and follow the rules as expected, but skip content validation only every N products. So I think we will want to do this check for product count vs. everyN within TableValidator and ArrayValidator just before we validate the data contents.

For Table Validator, I think that is here:

https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java#L97

al-niessner · 2023-01-09T18:55:21Z

@al-niessner sorry the requirement is not very clear here.

in essence what this requirement is trying to accomplish is to run validation on every target input, and follow the rules as expected, but skip content validation only every N products. So I think we will want to do this check for product count vs. everyN within TableValidator and ArrayValidator just before we validate the data contents.

For Table Validator, I think that is here:

https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java#L97

@jordanpadams @nutjob4life @tloubrieu-jpl

Are ArrayValidator or TableValidator the only two contents? In other words, I think the requirement is still unclear. After all, everything in an XML file that are not the actual tag characters is content.

jordanpadams · 2023-01-09T19:01:46Z

@al-niessner basically, we want to skip the IO for opening the actual data product. so anywhere that happens in the code, we want to skip it for every N products. I am not positive, but I believe all content validation / IO of data products occurs through DataDefinitionAndContentValidationRule, which then calls those 2 classes. definitely check that my understanding is correct, because if not, we should try to encapsulate the content validation better so we can more easily maintain "content validation" vs. "referential integrity" vs. "XML validation" vs. etc.

al-niessner · 2023-01-09T19:13:17Z

@al-niessner basically, we want to skip the IO for opening the actual data product. so anywhere that happens in the code, we want to skip it for every N products. I am not positive, but I believe all content validation / IO of data products occurs through DataDefinitionAndContentValidationRule, which then calls those 2 classes. definitely check that my understanding is correct, because if not, we should try to encapsulate the content validation better so we can more easily maintain "content validation" vs. "referential integrity" vs. "XML validation" vs. etc.

@jordanpadams

Now that is clear. Thanks. Funny that comparing the XML "what it should be" to the data file "what it is" is probably the most important checks.

Just out of curiosity, has anyone done a profile to see what takes so long? I mean the tiny b.xml with a 1000 byte data file takes 7 seconds. I doubt it is the file IO.

jordanpadams · 2023-01-09T19:50:40Z

@al-niessner

Now that is clear. Thanks. Funny that comparing the XML "what it should be" to the data file "what it is" is probably the most important checks.

from an archival perspective, definitely. the label describes the bits, and if its wrong, you can't read/use/analyze those bits. the other stuff is pretty much the lead-up to perform this check. the referential integrity stuff is important for search and provenance.

Just out of curiosity, has anyone done a profile to see what takes so long? I mean the tiny b.xml with a 1000 byte data file takes 7 seconds. I doubt it is the file IO.

no... my guess is the schema/schematron validation and/or the downloading of those schemas/schematrons to actually perform that validation. we have a ticket in the backlog to hopefully fix the latter with caching

jordanpadams · 2023-01-09T19:51:05Z

@al-niessner feel free to have a poke around if you are interested

jordanpadams · 2023-01-09T19:52:11Z

@al-niessner we also have some performance metrics from a while back here: https://nasa-pds.github.io/validate/operate/index.html#Performance . @galenhollins did some performance analysis a few years back, but I can't seem to track down where that may be documented beyond those metrics.

al-niessner · 2023-01-09T22:36:59Z

@jordanpadams

I am finding this to be the time sink:

validate/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/LabelValidationRule.java

Line 184 in 125d84a

public void validateLabel() {

Is it what you expected?

jordanpadams · 2023-01-10T17:29:44Z

@al-niessner hmmm probably? but now I am interested to know where within that label validation is the time sync? the download of the schemas to perform the validation? the schematron validation? something else?

al-niessner · 2023-01-10T20:53:43Z

@jordanpadams @nutjob4life @tloubrieu-jpl

FYI so that we are all at the same place. Seems that there is randomness in the timing that can vary significantly and it is all in LabelValidationRule.validateLabel(). Will track it down now that it is becoming clear that the IO is NOT a computational heavyweight.

The first file always takes a long time in LabelValidationRule.validateLabel(). If the same file is repeated, the time is smaller. Sometimes other files are small too but the first always takes 3+ seconds. Other files after the first can, but do not always, require another large chunk of time. Ran validate with seven input files but only four originals. Here are the relevant timing results in the order they were executed:

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
2855ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
14ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
12ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
112ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
158ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
4ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
53ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xml

Note: this checks a 280 MB table

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
2144ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
804ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
40ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
146ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
63ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
91ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
2ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
36ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
117ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
4ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
39ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml

1ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
3033ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
2ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
312ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

jordanpadams · 2023-01-11T19:02:37Z

Per #1 (comment), developing the every nth file requirement and closing this PR. Will create new ticket to track performance improvements for label validation.

al-niessner · 2023-01-18T22:02:14Z

@jordanpadams @nutjob4life @tloubrieu-jpl

Found a bug in the large table that prevented validate from processing each row. It increases the overall processing time substantially (40 seconds for the table processing) but 20 seconds still spent doing the other XML. Moving to investigate why processing the row takes so much time and why LabelValidationRule takes so long.

note: When fixed table bug it errored on each row because date-time format not correct. Altered pds4-jparser to allow the table date-time format and processsed again. Interestingly, they process in about the same time (less than a second difference) allow the recording of errors to be ruled out for the delay.

src/test/resources/github499/success/M7_217_044546_N.xml

1ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
8888ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
15ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
14ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
141ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
187ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
5ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
6ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
55ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xml

Note: this checks a 280 MB table

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
2618ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
803ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
39571ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
774ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
46ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
128ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
5ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
59ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
162ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
37ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
4160ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
5ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
282ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record()

jordanpadams · 2023-01-19T05:09:52Z

@al-niessner very interesting. thanks for keeping us updated. interested to see what comes out of this

al-niessner · 2023-01-23T19:09:25Z

@jimmie @jordanpadams @nutjob4life @tloubrieu-jpl

The table and array time is just a multiplication problem. The 280 MB file takes so long because it has 2000+ fields and 2000+ rows. Despite each check taking microseconds, the multiplication wins and the whole table takes just shy of a minute. Regular expression has similar problem and they solve it using a "compile" like approach where it converts the regex string into something more efficient (I do not know what) then process data faster. Since the fields are constant, rather than iterating over them each time, maybe do a compile like trick that gives faster comparison to remove the multiplication from the problem. I guess it would be something fun to like take the fields, write the code to check those fields without iteration, compile it, then apply it to each row. Sounds fun really, but back to doing every N instead.

jordanpadams · 2023-01-25T15:51:15Z

@al-niessner I'm sure there could be some algorithm we could develop some super performant algorithm here, but right now, yes, all we do is do rows X columns validation and it just takes that long. Instead of improving that algorithm, it may even just be simpler to eventually parallelize the row validation (or chunks of rows) by farming it out to X workers to do that validation.

let's dig a little further on performance improvements through Thursday, but if we don't find the smoking gun by then, I think we should jump back into #1 and then onto #519

al-niessner · 2023-01-25T17:02:45Z

@jimmie @jordanpadams @tloubrieu-jpl

Update on performance: large time delays for validation of label is downloading from the net. Repeats to the URL are not downloaded a second time, but they are tested over and over adding a quarter second per file that uses the same URL. Could be bigger for more complex checks.

jordanpadams · 2023-01-25T17:05:48Z

@al-niessner good to know! if we provide the schemas/schematrons locally (I think the -S and -x flags?), does this speed up?

if so, then the caching ticket we have in the queue may help solve this?

al-niessner · 2023-01-25T17:40:46Z

@jordanpadams @tloubrieu-jpl

Ready for review. It does improve things but not as expected. I did a every 100 rows and the time improved by 10x as (maybe) expected -- tables are square and changed only one axis so square root improvement.

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java

…g else.

al-niessner · 2023-01-25T20:32:23Z

@jordanpadams @tloubrieu-jpl

Ready again. Hopefully this is where you wanted it to skip.

Note: had to do some wacky check for null that I do not understand. It works just fine on my command line and again in my cucumber. I have no idea why sonatype was failing with the null. Maybe somebody is using a property sheet that does not everyN in it yet and does not through ValidateLauncher? I have no idea but it is curious.

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java

Always do first one found

jordanpadams · 2023-01-26T16:46:49Z

@al-niessner closer, but I think we still need to go one level deeper in the code execution.

we still want to validate the data object definition, e.g. here in TableValidator. the only thing we want to skip is the content validation specifically, e.g. here in TableValidator.

per the comments regarding Table and Arrays, I believe everything we validate content-wise comes down to tables vs. arrays. I guess we could include our PDF validation in this check as well, but it's the large data files we are most worried about spot checking here.

al-niessner · 2023-01-26T17:10:21Z

@jordanpadams

Moved. Put it at the top of content checks for table and array (they are independent) so that every N tables or every N arrays. It will require a shared static (singleton) to do every N (array or table} so let me know which every N you really want.

jordanpadams · 2023-01-30T17:37:47Z

@al-niessner sorry for the confusion. so I think we want the everyN count to be at the product level (within DataDefinitionAndContentValidationRule.java) where we maybe set a flag or something, which is then passed down to the ArrayValidator, TableValidator, x Validator to skip validateDataObjectContents.

al-niessner · 2023-01-30T17:45:21Z

@jordanpadams

To be clear: you want every N for (table or array). Correct?

al-niessner · 2023-01-30T18:04:20Z

@jordanpadams

Added a global counter so that tables and arrays are counted together.

src/main/java/gov/nasa/pds/tools/util/EveryNCounter.java

sonatype-lift · 2023-01-30T18:10:29Z

🛠 Lift Auto-fix

Some of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.¹

# Download the patch
curl https://lift.sonatype.com/api/patch/github.com/NASA-PDS/validate/587.diff -o lift-autofixes.diff

# Apply the patch with git
git apply lift-autofixes.diff

# Review the changes
git diff

Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command:

curl https://lift.sonatype.com/api/patch/github.com/NASA-PDS/validate/587.diff | git apply

Once you're satisfied commit and push your changes in your project.

You can preview the patch by opening the patch URL in the browser. ↩

jordanpadams · 2023-01-31T16:53:01Z

@al-niessner to go back to one of the questions above:

Wondered why it seemed to be there already. Be more clear because not sure I understand what you are trying to avoid. There 100 XML files. You want to process all them. Of the 100 XML 20 have data files associated with them that need to have their content checked. You want to do everyN on just the 20, correct?

if everyN=20, and target=100 XML files (or "products")
- validate the XML of all 100 products
- validate the content of the associated data product files for every 20th product we come across

al-niessner · 2023-02-06T21:07:44Z

@jordanpadams @nutjob4life @tloubrieu-jpl

Now does every N table or array.

src/main/java/gov/nasa/pds/tools/validate/rule/pds4/ArrayValidator.java

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java

added a new command line option

b7bbcc4

al-niessner requested a review from a team as a code owner January 5, 2023 22:03

al-niessner self-assigned this Jan 5, 2023

al-niessner commented Jan 5, 2023

View reviewed changes

src/main/java/gov/nasa/pds/validate/ValidateLauncher.java Outdated Show resolved Hide resolved

nutjob4life approved these changes Jan 6, 2023

View reviewed changes

jordanpadams requested changes Jan 6, 2023

View reviewed changes

jordanpadams changed the title ~~issue 1: added a new command line option~~ issue #1: added a new command line option Jan 6, 2023

jordanpadams changed the title ~~issue #1: added a new command line option~~ issue #1: support content validation for every N products Jan 6, 2023

jordanpadams changed the title ~~issue #1: support content validation for every N products~~ issue #1: command-line option to support content validation for every N products Jan 6, 2023

jordanpadams closed this Jan 11, 2023

jordanpadams reopened this Jan 12, 2023

change from file to row

c03e56f

spell out GE

f58179b

sonatype-lift bot reviewed Jan 25, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java Outdated Show resolved Hide resolved

sonatype-lift bot reviewed Jan 25, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java Show resolved Hide resolved

works from the command line but apparently sonalift is doing somethin…

5e8e2db

…g else.

sonatype-lift bot reviewed Jan 25, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java Show resolved Hide resolved

Update DataDefinitionAndContentValidationRule.java

ddddf46

Always do first one found

moved deeper which unfortunately makes for duplicate code

6876437

use a global counter

6a07197

sonatype-lift bot reviewed Jan 30, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/util/EveryNCounter.java Show resolved Hide resolved

sonatype-lift bot reviewed Jan 30, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/util/EveryNCounter.java Show resolved Hide resolved

al-niessner added 2 commits January 30, 2023 10:20

Update EveryNCounter.java

8fd13aa

Update EveryNCounter.java

0759c1d

jordanpadams requested changes Feb 7, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/validate/rule/pds4/ArrayValidator.java Outdated Show resolved Hide resolved

Al Niessner added 2 commits February 8, 2023 10:14

a file can contain many data objects

3584659

a file can contain many data objects

135a4e1

sonatype-lift bot reviewed Feb 8, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java Show resolved Hide resolved

sonatype-lift bot reviewed Feb 8, 2023

View reviewed changes

src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java Show resolved Hide resolved

jordanpadams approved these changes Feb 9, 2023

View reviewed changes

jordanpadams merged commit 103aee6 into main Feb 9, 2023

jordanpadams deleted the issue_1 branch February 9, 2023 18:08

miguelp1986 mentioned this pull request Apr 6, 2023

As a user, I want to execute content validation against every nth file #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue #1: command-line option to support content validation for every N products #587

issue #1: command-line option to support content validation for every N products #587

al-niessner commented Jan 5, 2023

al-niessner commented Jan 5, 2023

nutjob4life left a comment

jordanpadams left a comment

al-niessner commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

al-niessner commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

al-niessner commented Jan 9, 2023

jordanpadams commented Jan 10, 2023

al-niessner commented Jan 10, 2023

jordanpadams commented Jan 11, 2023

al-niessner commented Jan 18, 2023

jordanpadams commented Jan 19, 2023

al-niessner commented Jan 23, 2023

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023

al-niessner commented Jan 25, 2023

jordanpadams commented Jan 26, 2023

al-niessner commented Jan 26, 2023

jordanpadams commented Jan 30, 2023

al-niessner commented Jan 30, 2023

al-niessner commented Jan 30, 2023

sonatype-lift bot commented Jan 30, 2023

jordanpadams commented Jan 31, 2023

al-niessner commented Feb 6, 2023

issue #1: command-line option to support content validation for every N products #587

issue #1: command-line option to support content validation for every N products #587

Conversation

al-niessner commented Jan 5, 2023

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

al-niessner commented Jan 5, 2023

nutjob4life left a comment

Choose a reason for hiding this comment

jordanpadams left a comment

Choose a reason for hiding this comment

al-niessner commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

al-niessner commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

jordanpadams commented Jan 9, 2023

al-niessner commented Jan 9, 2023

jordanpadams commented Jan 10, 2023

al-niessner commented Jan 10, 2023

src/test/resources/github499/success/M7_217_044546_N.xml

src/test/resources/github531/success/b.xml

/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xml

src/test/resources/github499/success/M7_217_044546_N.xml

src/test/resources/github531/success/b.xml

src/test/resources/github499/success/M7_217_044546_N.xml

src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml

jordanpadams commented Jan 11, 2023

al-niessner commented Jan 18, 2023

src/test/resources/github499/success/M7_217_044546_N.xml

src/test/resources/github531/success/b.xml

/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xml

src/test/resources/github499/success/M7_217_044546_N.xml

src/test/resources/github531/success/b.xml

src/test/resources/github499/success/M7_217_044546_N.xml

src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml

jordanpadams commented Jan 19, 2023

al-niessner commented Jan 23, 2023

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023

al-niessner commented Jan 25, 2023

jordanpadams commented Jan 26, 2023

al-niessner commented Jan 26, 2023

jordanpadams commented Jan 30, 2023

al-niessner commented Jan 30, 2023

al-niessner commented Jan 30, 2023

sonatype-lift bot commented Jan 30, 2023

🛠 Lift Auto-fix

Footnotes

jordanpadams commented Jan 31, 2023

al-niessner commented Feb 6, 2023