Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue #1: command-line option to support content validation for every N products #587

Merged
merged 13 commits into from
Feb 9, 2023

Conversation

al-niessner
Copy link
Contributor

🗒️ Summary

Added a command line option to process every Nth file rather than all of them.

⚙️ Test Data and/or Report

Tested using

validate --everyN 2 src/test/resources/github531/success/b.xml src/test/resources/github531/fail/b.xml src/test/resources/github531/success/b.xml src/test/resources/github531/fail/b.xml

>< snip ><

  Product Validation Summary:
    2          product(s) passed
    0          product(s) failed
    0          product(s) skipped

♻️ Related Issues

#1

@al-niessner al-niessner requested a review from a team as a code owner January 5, 2023 22:03
@al-niessner al-niessner self-assigned this Jan 5, 2023
@al-niessner
Copy link
Contributor Author

@jordanpadams @nutjob4life @tloubrieu-jpl

Ready for review but read my review comments

Copy link
Member

@nutjob4life nutjob4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@jordanpadams jordanpadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@al-niessner sorry the requirement is not very clear here.

in essence what this requirement is trying to accomplish is to run validation on every target input, and follow the rules as expected, but skip content validation only every N products. So I think we will want to do this check for product count vs. everyN within TableValidator and ArrayValidator just before we validate the data contents.

For Table Validator, I think that is here:

https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java#L97

@jordanpadams jordanpadams changed the title issue 1: added a new command line option issue #1: added a new command line option Jan 6, 2023
@jordanpadams jordanpadams changed the title issue #1: added a new command line option issue #1: support content validation for every N products Jan 6, 2023
@jordanpadams jordanpadams changed the title issue #1: support content validation for every N products issue #1: command-line option to support content validation for every N products Jan 6, 2023
@al-niessner
Copy link
Contributor Author

@al-niessner sorry the requirement is not very clear here.

in essence what this requirement is trying to accomplish is to run validation on every target input, and follow the rules as expected, but skip content validation only every N products. So I think we will want to do this check for product count vs. everyN within TableValidator and ArrayValidator just before we validate the data contents.

For Table Validator, I think that is here:

https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java#L97

@jordanpadams @nutjob4life @tloubrieu-jpl

Are ArrayValidator or TableValidator the only two contents? In other words, I think the requirement is still unclear. After all, everything in an XML file that are not the actual tag characters is content.

@jordanpadams
Copy link
Member

@al-niessner basically, we want to skip the IO for opening the actual data product. so anywhere that happens in the code, we want to skip it for every N products. I am not positive, but I believe all content validation / IO of data products occurs through DataDefinitionAndContentValidationRule, which then calls those 2 classes. definitely check that my understanding is correct, because if not, we should try to encapsulate the content validation better so we can more easily maintain "content validation" vs. "referential integrity" vs. "XML validation" vs. etc.

@al-niessner
Copy link
Contributor Author

@al-niessner basically, we want to skip the IO for opening the actual data product. so anywhere that happens in the code, we want to skip it for every N products. I am not positive, but I believe all content validation / IO of data products occurs through DataDefinitionAndContentValidationRule, which then calls those 2 classes. definitely check that my understanding is correct, because if not, we should try to encapsulate the content validation better so we can more easily maintain "content validation" vs. "referential integrity" vs. "XML validation" vs. etc.

@jordanpadams

Now that is clear. Thanks. Funny that comparing the XML "what it should be" to the data file "what it is" is probably the most important checks.

Just out of curiosity, has anyone done a profile to see what takes so long? I mean the tiny b.xml with a 1000 byte data file takes 7 seconds. I doubt it is the file IO.

@jordanpadams
Copy link
Member

@al-niessner

Now that is clear. Thanks. Funny that comparing the XML "what it should be" to the data file "what it is" is probably the most important checks.

from an archival perspective, definitely. the label describes the bits, and if its wrong, you can't read/use/analyze those bits. the other stuff is pretty much the lead-up to perform this check. the referential integrity stuff is important for search and provenance.

Just out of curiosity, has anyone done a profile to see what takes so long? I mean the tiny b.xml with a 1000 byte data file takes 7 seconds. I doubt it is the file IO.

no... my guess is the schema/schematron validation and/or the downloading of those schemas/schematrons to actually perform that validation. we have a ticket in the backlog to hopefully fix the latter with caching

@jordanpadams
Copy link
Member

@al-niessner feel free to have a poke around if you are interested

@jordanpadams
Copy link
Member

@al-niessner we also have some performance metrics from a while back here: https://nasa-pds.github.io/validate/operate/index.html#Performance . @galenhollins did some performance analysis a few years back, but I can't seem to track down where that may be documented beyond those metrics.

@al-niessner
Copy link
Contributor Author

@jordanpadams

I am finding this to be the time sink:

Is it what you expected?

@jordanpadams
Copy link
Member

@al-niessner hmmm probably? but now I am interested to know where within that label validation is the time sync? the download of the schemas to perform the validation? the schematron validation? something else?

@al-niessner
Copy link
Contributor Author

@jordanpadams @nutjob4life @tloubrieu-jpl

FYI so that we are all at the same place. Seems that there is randomness in the timing that can vary significantly and it is all in LabelValidationRule.validateLabel(). Will track it down now that it is becoming clear that the IO is NOT a computational heavyweight.

The first file always takes a long time in LabelValidationRule.validateLabel(). If the same file is repeated, the time is smaller. Sometimes other files are small too but the first always takes 3+ seconds. Other files after the first can, but do not always, require another large chunk of time. Ran validate with seven input files but only four originals. Here are the relevant timing results in the order they were executed:

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
2855ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
14ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
12ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
112ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
158ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
4ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
53ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xml

Note: this checks a 280 MB table

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
2144ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
804ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
40ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
146ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
63ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
91ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
2ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
36ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
117ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
4ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
39ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml

1ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
3033ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
2ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
312ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

@jordanpadams
Copy link
Member

Per #1 (comment), developing the every nth file requirement and closing this PR. Will create new ticket to track performance improvements for label validation.

@al-niessner
Copy link
Contributor Author

@jordanpadams @nutjob4life @tloubrieu-jpl

Found a bug in the large table that prevented validate from processing each row. It increases the overall processing time substantially (40 seconds for the table processing) but 20 seconds still spent doing the other XML. Moving to investigate why processing the row takes so much time and why LabelValidationRule takes so long.

note: When fixed table bug it errored on each row because date-time format not correct. Altered pds4-jparser to allow the table date-time format and processsed again. Interestingly, they process in about the same time (less than a second difference) allow the recording of errors to be ruled out for the delay.

src/test/resources/github499/success/M7_217_044546_N.xml

1ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
8888ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
15ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
14ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
2ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
141ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
187ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
5ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
6ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
55ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xml

Note: this checks a 280 MB table

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
2618ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
803ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
4ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
39571ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
774ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
46ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github531/success/b.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
128ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
5ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
59ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github499/success/M7_217_044546_N.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
162ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
3ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
37ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
0ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml

0ms gov.nasa.pds.tools.validate.rule.RegisterTargets.registerTargets() 
0ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.checkLabelExtension() 
4160ms gov.nasa.pds.tools.validate.rule.pds4.LabelValidationRule.validateLabel() 
3ms gov.nasa.pds.tools.validate.rule.pds4.FileReferenceValidationRule.validateFileReferences() 
5ms gov.nasa.pds.tools.validate.rule.pds4.ContextProductReferenceValidationRule.checkContextReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.LocalIdentifierReferencesRule.validateLocalIdentifiers() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerFileReferences() 
0ms gov.nasa.pds.tools.validate.rule.pds4.RegisterTargetReferences.registerDocumentFileReferences() 
1ms gov.nasa.pds.tools.validate.rule.pds4.RegisterLabelIdentifiers.registerIdentifiers() 
282ms gov.nasa.pds.tools.validate.rule.pds4.DataDefinitionAndContentValidationRule.validate() 
1ms gov.nasa.pds.tools.validate.rule.RecordValidationResults.record() 

@jordanpadams
Copy link
Member

@al-niessner very interesting. thanks for keeping us updated. interested to see what comes out of this

@al-niessner
Copy link
Contributor Author

@jimmie @jordanpadams @nutjob4life @tloubrieu-jpl

The table and array time is just a multiplication problem. The 280 MB file takes so long because it has 2000+ fields and 2000+ rows. Despite each check taking microseconds, the multiplication wins and the whole table takes just shy of a minute. Regular expression has similar problem and they solve it using a "compile" like approach where it converts the regex string into something more efficient (I do not know what) then process data faster. Since the fields are constant, rather than iterating over them each time, maybe do a compile like trick that gives faster comparison to remove the multiplication from the problem. I guess it would be something fun to like take the fields, write the code to check those fields without iteration, compile it, then apply it to each row. Sounds fun really, but back to doing every N instead.

@jordanpadams
Copy link
Member

@al-niessner I'm sure there could be some algorithm we could develop some super performant algorithm here, but right now, yes, all we do is do rows X columns validation and it just takes that long. Instead of improving that algorithm, it may even just be simpler to eventually parallelize the row validation (or chunks of rows) by farming it out to X workers to do that validation.

let's dig a little further on performance improvements through Thursday, but if we don't find the smoking gun by then, I think we should jump back into #1 and then onto #519

@al-niessner
Copy link
Contributor Author

@jimmie @jordanpadams @tloubrieu-jpl

Update on performance: large time delays for validation of label is downloading from the net. Repeats to the URL are not downloaded a second time, but they are tested over and over adding a quarter second per file that uses the same URL. Could be bigger for more complex checks.

@jordanpadams
Copy link
Member

@al-niessner good to know! if we provide the schemas/schematrons locally (I think the -S and -x flags?), does this speed up?

if so, then the caching ticket we have in the queue may help solve this?

@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Ready for review. It does improve things but not as expected. I did a every 100 rows and the time improved by 10x as (maybe) expected -- tables are square and changed only one axis so square root improvement.

@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Ready again. Hopefully this is where you wanted it to skip.

Note: had to do some wacky check for null that I do not understand. It works just fine on my command line and again in my cucumber. I have no idea why sonatype was failing with the null. Maybe somebody is using a property sheet that does not everyN in it yet and does not through ValidateLauncher? I have no idea but it is curious.

@jordanpadams
Copy link
Member

@al-niessner closer, but I think we still need to go one level deeper in the code execution.

we still want to validate the data object definition, e.g. here in TableValidator. the only thing we want to skip is the content validation specifically, e.g. here in TableValidator.

per the comments regarding Table and Arrays, I believe everything we validate content-wise comes down to tables vs. arrays. I guess we could include our PDF validation in this check as well, but it's the large data files we are most worried about spot checking here.

@al-niessner
Copy link
Contributor Author

@jordanpadams

Moved. Put it at the top of content checks for table and array (they are independent) so that every N tables or every N arrays. It will require a shared static (singleton) to do every N (array or table} so let me know which every N you really want.

@jordanpadams
Copy link
Member

@al-niessner sorry for the confusion. so I think we want the everyN count to be at the product level (within DataDefinitionAndContentValidationRule.java) where we maybe set a flag or something, which is then passed down to the ArrayValidator, TableValidator, x Validator to skip validateDataObjectContents.

@al-niessner
Copy link
Contributor Author

@jordanpadams

To be clear: you want every N for (table or array). Correct?

@al-niessner
Copy link
Contributor Author

@jordanpadams

Added a global counter so that tables and arrays are counted together.

@sonatype-lift
Copy link
Contributor

sonatype-lift bot commented Jan 30, 2023

🛠 Lift Auto-fix

Some of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1

# Download the patch
curl https://lift.sonatype.com/api/patch/github.com/NASA-PDS/validate/587.diff -o lift-autofixes.diff

# Apply the patch with git
git apply lift-autofixes.diff

# Review the changes
git diff

Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command:

curl https://lift.sonatype.com/api/patch/github.com/NASA-PDS/validate/587.diff | git apply

Once you're satisfied commit and push your changes in your project.

Footnotes

  1. You can preview the patch by opening the patch URL in the browser.

@jordanpadams
Copy link
Member

@al-niessner to go back to one of the questions above:

Wondered why it seemed to be there already. Be more clear because not sure I understand what you are trying to avoid. There 100 XML files. You want to process all them. Of the 100 XML 20 have data files associated with them that need to have their content checked. You want to do everyN on just the 20, correct?

  • if everyN=20, and target=100 XML files (or "products")
    • validate the XML of all 100 products
    • validate the content of the associated data product files for every 20th product we come across

@al-niessner
Copy link
Contributor Author

@jordanpadams @nutjob4life @tloubrieu-jpl

Now does every N table or array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants