Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checksums for CircleCI integration tests #59

Open
rmarkello opened this issue May 17, 2018 · 22 comments
Open

Implement checksums for CircleCI integration tests #59

rmarkello opened this issue May 17, 2018 · 22 comments
Labels
enhancement issues describing possible enhancements to the project testing issues related to improving testing in the project
Milestone

Comments

@rmarkello
Copy link
Member

The current mechanism for integration testing with CircleCI is to run the full tedana pipeline and then compare the content of the output files to previously generated files using np.allclose. This procedure ensures that the results of the tedana pipeline haven't changed in any appreciable way, which is critical as refactoring and updating occurs.

I was recently inspired by @WhitakerLab's BrainNetworksInPython repository, however, which uses Python's hashlib library to implement checksum comparisons of their generated outputs. I think this procedure could be applied with great success to the tedana repository!

Rather than having CircleCI re-download the previously generated outputs from Dropbox each time, they could be hashed once and kept in a JSON file in the tests/data folder. The outputs generated during testing could then be hashed and compared against this reference. This would also make testing locally a lot easier (no need to download the 1.9GB output directory!).

@KirstieJane
Copy link
Member

Eeeeeexcellent - I’m glad that’s helpful! I’m tagging @Islast because she did all that work so she could potentially pass on any advice etc ✨

@Islast
Copy link

Islast commented May 18, 2018

I'm thrilled that you like our method @rmarkello 😄
My warning is that because hashing doesn't allow you to use tolerances as np.close does you have to be very delicate about random states (unless tedana is deterministic). We have a different hash for each environment we test in.
This is manageable but it often feels like our continuous integration is more effective at alerting us to dependency updates interfering with the random state or the formatting of written outputs than it is at checking for behavioural changes in BrainNetworksInPython.

@rmarkello
Copy link
Member Author

This point is well met, @Islast! Thank you so much for raising it (and thanks for pinging her in, @KirstieJane!).

While tedana is not deterministic, we do try to ensure consistent seeding for test purposes. However, you're completely right: given the sheer number of floating point operations that occur on the data in tedana, we might need numpy's ability to set tolerance thresholds!

I'm hopeful that with testing occurring inside a Docker container, things will be more-or-less consistent from run-to-run—enough so to at least give this a shot. Indeed, in some respects being made aware of dependency changes or alterations to output formats may be a feature of using the hashing approach, not a bug!

@pbellec
Copy link

pbellec commented May 19, 2018

Very cool. We have a similar testing strategy in NIAK and this idea will help a lot. BTW in our case, with proper seeding and a container we exactly replicate outputs.

@emdupre
Copy link
Member

emdupre commented May 21, 2018

This is a great idea, @rmarkello !

I think I'd like to wait on implementing it until we resolve #14 (which is yielding issues in #18), since the ideal workflow will allow developers to test locally by just downloading a json of the hash values, rather than the current 2GB of dropbox data. This is definitely something for our 0.1.0 milestone, though !

@emdupre emdupre added this to the 0.1.0 milestone May 21, 2018
@emdupre emdupre added enhancement issues describing possible enhancements to the project testing issues related to improving testing in the project labels May 21, 2018
@tsalo
Copy link
Member

tsalo commented Jan 11, 2019

Lately we've been making a bunch of breaking changes, so we've been fine with just checking for the existence of specific output files. However, once the final breaking PRs (#164 and #152) are merged in, I want to work on refactoring a lot of the more confusing code (e.g., fitmodels_direct and the component selection). I think we should try implementing this before that point, because I would like to make sure that the contents of the files are correct after cleaning up the code.

@jbteves
Copy link
Collaborator

jbteves commented Apr 20, 2019

@tsalo @KirstieJane is this formally going to be part of the GSoC student's work if accepted?

@tsalo
Copy link
Member

tsalo commented Apr 22, 2019

If they're able to do it, then definitely. It will have to come after unit tests though (since they're the low-hanging fruit).

@jbteves
Copy link
Collaborator

jbteves commented Apr 22, 2019

Oh, no, totally, unit testing is higher priority. I'm just trying to make sure that there's a path forward for this issue. It seems like we should just see if the student can get to it this summer, and if not then we'll have to revisit it ourselves. Seem okay?

@effigies
Copy link
Contributor

Just a quick note that recent nibabel now has a nib-diff CLI tool that lets you set np.allclose-style tolerances on neuroimages, in case you do need something looser than a checksum.

@jbteves
Copy link
Collaborator

jbteves commented Oct 29, 2019

Added to the sprint agenda.

@emdupre
Copy link
Member

emdupre commented Nov 8, 2019

Is this still being considered @rmarkello ? I didn't see it in #418 ?

@rmarkello
Copy link
Member Author

Still being considered, for sure! We're going to need to add regression tests to actually compare the outputs of the integration tests to pre-generated data, but it doesn't (didn't) make sense to do that when so many PRs were going to totally mess with the actual data generation! All our tests would consistently be failing 😢

Once #435, #436, and @handwerkerd's eventual PR to overhaul the decision tree are integrated we can do this. In the interim I can open a PR with an example of what this would look like, though?

@stale
Copy link

stale bot commented Feb 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !

@stale stale bot added the stale label Feb 8, 2020
@jbteves
Copy link
Collaborator

jbteves commented Feb 13, 2020

@rmarkello do you still think this is workable?

@stale stale bot removed the stale label Feb 13, 2020
@rmarkello
Copy link
Member Author

I do, especially if we use nib-diff rather than checksums. That said, I've been totally delinquent on tedana so I'm not sure about the status of some of the more major changes (e.g., the decision tree overhaul). I think it would probably best to hold off on implementing these until then because we know those are going to be breaking changes. But if others want to see these done sooner than that's fine; we'll just have to acknowledge those PRs will fail all the tests until we update the data to compare against!

@jbteves
Copy link
Collaborator

jbteves commented Feb 14, 2020

Forgive me if this is incorrect, but doesn't nib-diff require that both files be in memory? In other words, don't you still have to download or upload a large amount of data to use it?

@rmarkello
Copy link
Member Author

You're totally right, they would both have to be in memory. I think checksums are just gonna be a real pain with floating point data, though, so I think we need to do something that's a bit more np.allclose-like. Since the test data are all down-sampled pretty aggressively do you think it would be a problem to load two images in memory at a time? My thought is that it would never be more than two (the new image and the comparison / baseline).

As stands, we are (or used tobe? at one point we definitely were) downloading the comparison output data on CircleCI every time (and, if I remember correctly, just not using it)!

@jbteves
Copy link
Collaborator

jbteves commented Feb 14, 2020

We're definitely uploading comparison output to CircleCI every time, and not using it. I thought the point of the checksums was to reduce the download/upload need, but I forgot about the upload of artifacts to CI rather than just checksums.
I don't think that there's a memory constraint problem, especially with the truncated data. You would have to go pretty extreme to get a memory issue.
I wonder about simply running np.allclose then on the floating-point data. You could add this to the integration test file pretty straightforwardly, correct? What's the advantage of a checksum over that strategy?

@rmarkello
Copy link
Member Author

At this point I'm definitely not recommending checksums! The benefit of nib-diff over a straight np.allclose is that the former checks things like the file header, etc, to make sure you're not altering other things besides the actual data content.

Finally, it's worth noting we don't have to call nib-diff from the command line—we could just import the relevant functions (nibabel.cmdline.diff.diff) and use those in the integration tests!

@jbteves
Copy link
Collaborator

jbteves commented Feb 14, 2020

I'm following this now. I think we'd discussed it briefly at the hackathon, but unfortunately my memory is trash and I didn't write it down so unfortunately we're having this conversation a second time. 🤦‍♂️
Okay, awesome, thanks @rmarkello . Other stuff is higher priority but I think we have a pretty good way forward for this.

@tsalo
Copy link
Member

tsalo commented Aug 6, 2020

It seems like we've converged on nib-diff over other possible checks. I noted that in #452. Once we have tests working for T2* estimation and optimal combination, I think we should at least look at doing it for the full integration tests.

@stale stale bot removed the stale label Aug 6, 2020
@stale stale bot added the stale label Nov 5, 2020
@tsalo tsalo removed the stale label Feb 5, 2021
@ME-ICA ME-ICA deleted a comment from stale bot Nov 19, 2022
@ME-ICA ME-ICA deleted a comment from stale bot Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement issues describing possible enhancements to the project testing issues related to improving testing in the project
Projects
None yet
Development

No branches or pull requests

8 participants