-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue #1: command-line option to support content validation for every N products #587
Conversation
@jordanpadams @nutjob4life @tloubrieu-jpl Ready for review but read my review comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@al-niessner sorry the requirement is not very clear here.
in essence what this requirement is trying to accomplish is to run validation on every target input, and follow the rules as expected, but skip content validation only every N products. So I think we will want to do this check for product count vs. everyN within TableValidator
and ArrayValidator
just before we validate the data contents.
For Table Validator, I think that is here:
@jordanpadams @nutjob4life @tloubrieu-jpl Are ArrayValidator or TableValidator the only two contents? In other words, I think the requirement is still unclear. After all, everything in an XML file that are not the actual tag characters is content. |
@al-niessner basically, we want to skip the IO for opening the actual data product. so anywhere that happens in the code, we want to skip it for every N products. I am not positive, but I believe all content validation / IO of data products occurs through |
Now that is clear. Thanks. Funny that comparing the XML "what it should be" to the data file "what it is" is probably the most important checks. Just out of curiosity, has anyone done a profile to see what takes so long? I mean the tiny b.xml with a 1000 byte data file takes 7 seconds. I doubt it is the file IO. |
from an archival perspective, definitely. the label describes the bits, and if its wrong, you can't read/use/analyze those bits. the other stuff is pretty much the lead-up to perform this check. the referential integrity stuff is important for search and provenance.
no... my guess is the schema/schematron validation and/or the downloading of those schemas/schematrons to actually perform that validation. we have a ticket in the backlog to hopefully fix the latter with caching |
@al-niessner feel free to have a poke around if you are interested |
@al-niessner we also have some performance metrics from a while back here: https://nasa-pds.github.io/validate/operate/index.html#Performance . @galenhollins did some performance analysis a few years back, but I can't seem to track down where that may be documented beyond those metrics. |
I am finding this to be the time sink: validate/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/LabelValidationRule.java Line 184 in 125d84a
Is it what you expected? |
@al-niessner hmmm probably? but now I am interested to know where within that label validation is the time sync? the download of the schemas to perform the validation? the schematron validation? something else? |
@jordanpadams @nutjob4life @tloubrieu-jpl FYI so that we are all at the same place. Seems that there is randomness in the timing that can vary significantly and it is all in LabelValidationRule.validateLabel(). Will track it down now that it is becoming clear that the IO is NOT a computational heavyweight. The first file always takes a long time in LabelValidationRule.validateLabel(). If the same file is repeated, the time is smaller. Sometimes other files are small too but the first always takes 3+ seconds. Other files after the first can, but do not always, require another large chunk of time. Ran validate with seven input files but only four originals. Here are the relevant timing results in the order they were executed: src/test/resources/github499/success/M7_217_044546_N.xml
src/test/resources/github531/success/b.xml
/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xmlNote: this checks a 280 MB table
src/test/resources/github499/success/M7_217_044546_N.xml
src/test/resources/github531/success/b.xml
src/test/resources/github499/success/M7_217_044546_N.xml
src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml
|
Per #1 (comment), developing the every nth file requirement and closing this PR. Will create new ticket to track performance improvements for label validation. |
@jordanpadams @nutjob4life @tloubrieu-jpl Found a bug in the large table that prevented validate from processing each row. It increases the overall processing time substantially (40 seconds for the table processing) but 20 seconds still spent doing the other XML. Moving to investigate why processing the row takes so much time and why LabelValidationRule takes so long. note: When fixed table bug it errored on each row because date-time format not correct. Altered pds4-jparser to allow the table date-time format and processsed again. Interestingly, they process in about the same time (less than a second difference) allow the recording of errors to be ruled out for the delay. src/test/resources/github499/success/M7_217_044546_N.xml
src/test/resources/github531/success/b.xml
/tmp/JAD_L50_LRS_ELC_ANY_DEF_2017142_V01.xmlNote: this checks a 280 MB table
src/test/resources/github499/success/M7_217_044546_N.xml
src/test/resources/github531/success/b.xml
src/test/resources/github499/success/M7_217_044546_N.xml
src/test/resources/github529/success/m0154651923f6_2p_cif_gbl.xml
|
@al-niessner very interesting. thanks for keeping us updated. interested to see what comes out of this |
@jimmie @jordanpadams @nutjob4life @tloubrieu-jpl The table and array time is just a multiplication problem. The 280 MB file takes so long because it has 2000+ fields and 2000+ rows. Despite each check taking microseconds, the multiplication wins and the whole table takes just shy of a minute. Regular expression has similar problem and they solve it using a "compile" like approach where it converts the regex string into something more efficient (I do not know what) then process data faster. Since the fields are constant, rather than iterating over them each time, maybe do a compile like trick that gives faster comparison to remove the multiplication from the problem. I guess it would be something fun to like take the fields, write the code to check those fields without iteration, compile it, then apply it to each row. Sounds fun really, but back to doing every N instead. |
@al-niessner I'm sure there could be some algorithm we could develop some super performant algorithm here, but right now, yes, all we do is do rows X columns validation and it just takes that long. Instead of improving that algorithm, it may even just be simpler to eventually parallelize the row validation (or chunks of rows) by farming it out to X workers to do that validation. let's dig a little further on performance improvements through Thursday, but if we don't find the smoking gun by then, I think we should jump back into #1 and then onto #519 |
@jimmie @jordanpadams @tloubrieu-jpl Update on performance: large time delays for validation of label is downloading from the net. Repeats to the URL are not downloaded a second time, but they are tested over and over adding a quarter second per file that uses the same URL. Could be bigger for more complex checks. |
@al-niessner good to know! if we provide the schemas/schematrons locally (I think the if so, then the caching ticket we have in the queue may help solve this? |
Ready for review. It does improve things but not as expected. I did a every 100 rows and the time improved by 10x as (maybe) expected -- tables are square and changed only one axis so square root improvement. |
src/main/java/gov/nasa/pds/tools/validate/rule/RuleContext.java
Outdated
Show resolved
Hide resolved
Ready again. Hopefully this is where you wanted it to skip. Note: had to do some wacky check for null that I do not understand. It works just fine on my command line and again in my cucumber. I have no idea why sonatype was failing with the null. Maybe somebody is using a property sheet that does not everyN in it yet and does not through ValidateLauncher? I have no idea but it is curious. |
Always do first one found
@al-niessner closer, but I think we still need to go one level deeper in the code execution. we still want to validate the data object definition, e.g. here in TableValidator. the only thing we want to skip is the content validation specifically, e.g. here in TableValidator. per the comments regarding Table and Arrays, I believe everything we validate content-wise comes down to tables vs. arrays. I guess we could include our PDF validation in this check as well, but it's the large data files we are most worried about spot checking here. |
Moved. Put it at the top of content checks for table and array (they are independent) so that every N tables or every N arrays. It will require a shared static (singleton) to do every N (array or table} so let me know which every N you really want. |
@al-niessner sorry for the confusion. so I think we want the |
To be clear: you want every N for (table or array). Correct? |
Added a global counter so that tables and arrays are counted together. |
🛠 Lift Auto-fixSome of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1 # Download the patch
curl https://lift.sonatype.com/api/patch/github.com/NASA-PDS/validate/587.diff -o lift-autofixes.diff
# Apply the patch with git
git apply lift-autofixes.diff
# Review the changes
git diff Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command: curl https://lift.sonatype.com/api/patch/github.com/NASA-PDS/validate/587.diff | git apply Once you're satisfied commit and push your changes in your project. Footnotes |
@al-niessner to go back to one of the questions above:
|
@jordanpadams @nutjob4life @tloubrieu-jpl Now does every N table or array. |
🗒️ Summary
Added a command line option to process every Nth file rather than all of them.
⚙️ Test Data and/or Report
Tested using
♻️ Related Issues
#1