-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New repairkit sweeper for updating metadata loaded with older versions of Harvest to use lists instead of single-value strings #54
Conversation
Sonatype Lift is retiringSonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console. |
@al-niessner I'll do a code-review this afternoon, but in the meantime could you please lint and push that commit? EDIT Functional tests too, please - no way it works as-is. The testing mocks aren't necessarily intuitive, so ping/email me if you want it broken down |
I do not know what CR is and I am letting sonatype give me pylint complaints. |
Never mind. I condensed the code a little and mypy seems happy. |
Wow. I spent more time doing whitespace than developing working code, at least with the test db. Congratulations on making more tedious hurdles than I do. There it is in all of glorious ugliness. I would recommend commenting out the write updates bit until you have run it once against a larger DB just to see what the repairs look like. My db does not have any problems, so hard to check that. I did add a functional test for allarrays.repair() and it does create a dictionary that looks like it will work given all documentation, provenance etc. Of course, never really know until opensearch does something with it. Oh yeah, sonatype is having problems but python, tox, etc are all happy with the import so do not know what you want to do with sonatype except maybe ignore it. |
@al-niessner re linting, that should never/rarely be a manual process. Run the linter with Re functional tests, the expectation is that an end-to-end test exists (for ends "mocked json doc input" and "update output json sent to OpenSearch", respectively) that covers at least the happy path, and preferably any obvious-ish edge cases. In this case, that would be
At that point we can feel safe that if everything outside the sweeper is skookum, so shall the sweeper behaviour be. |
@al-niessner reviewing the codebase and test-supporting code, the primary purpose of the Db mock and functional tests was to confirm correctness for ancestry, because of its relatively complex behavioural specification. Since that is not at all the case here, your unit tests are close-enough to good-enough. There are some necessary changes to make before merge - currently |
It is git and github. Make the changes you need/want right here. I like bulk too, but with scripts like these I would go more inefficient rather than an miles of runtime lost due to tripping inches before the finish line. If you do each document and something breaks before the finish line, then you rerun the script it has to do less work on the next iteration. Experience tells me that live data, like what in our dbs, will surprise you constantly; meaning you will crash for the unknown unknowns more often than not. Also, running it now will alleviate the broken documents pressure. You can then add efficiency while it is running making the whole process even more efficient. |
This is amenable to broad try-catch-log-continue, but I agree with the rest of the reasoning.. Approving now. |
@al-niessner just realised I forgot
|
Yes. At least in the interim until we can verify everyone has upgraded to the new tools. It may only need to run once...
+1 @al-niessner can you run it and show the output? I would also like to see the log outputs. |
Requested run against test DB:
|
@al-niessner did you check the db state after to confirm that the expected changes were made? EDIT I'd expect to see logs from both the repairkit and the write calls themselves, neither of which are present in those logs. |
Building off of NASA-PDS/registry-api#368
Then after repairkit:
Output from repairkit:
|
Quality noice 👍 👍 EDIT @al-niessner don't forget to add it to sweepers_driver.py prior to merge. @jordanpadams do you want @sjoshi-jpl and I to deploy this version to delta->prod once merged? |
Hi @al-niessner , I think you still need to add the repair kit to the sweepers_driver there registry-sweepers/docker/sweepers_driver.py Line 111 in 1a92b53 Thanks, Thomas |
as requested
🗒️ Summary
Script that looks for current problems from historical harvest runs.
⚙️ Test Data and/or Report
None
♻️ Related Issues
Closes NASA-PDS/registry-api#349