-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving processing parameters & Nexus output, etc #318
Comments
This is a big macro-task, so here's the list of individual tasks as I see it currently:
Additionally:
|
Thinking on this some more: At the moment, geometry and phase parameters are linked in a .par file.
I propose something like the following:
We could write a check in The phases colfile could be formatted like so:
I think this approach meets all our requirements. Consistent phase ID support is I believe required for me to merge my What do you think? |
Sounds promising. A few quick thoughts: When we load an "old" parameters file, we should call "unitcell_from_parameters" on the parameters object so that old files come with one phase (only) by default. Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering. HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something. It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact. |
100% agree, this will nicely act as a dict key
Let's go with json as it's built-in to Python. {
"versions": {
"v1": {
"geometry": "geometry.par",
"phases": "phases.col"
},
"v2": {
"geometry": "geometry.par",
"phases": "phases.col"
}
}
} Or we could remove the phases colfile altogether and use the JSON directly (I prefer this actually). {
"versions": {
"v1": {
"geometry": "geometry.par",
"phases": {
"ferrite": "ferrite.par",
"austenite": "austenite.par"
}
},
"v2": {
"geometry": "geometry.par",
"phases": {
"ferrite": "ferrite.par",
"austenite": "austenite.par",
"epsilon": "epsilon.par"
}
}
}
}
Great idea! They'll be tiny files anyway. |
For versioning it might be a separate issue. I think we need something like an md5 in there to validate version of a parameter file (does this match what is on disk?) Then just save a backup when we save a new version. In VMS (or using GSAS) you get files with names like geometry.par_00, geometry.par_01, etc.
There is some related code in the old "project" folder: It was expecting diffractometer geometry to come as well. |
Something like this? {
"geometry": "geometry.par",
"phases": {
"ferrite": "ferrite.par",
"austenite": "austenite.par"
},
"phase_hashes": {
"ferrite": "MD5HASH1",
"austenite": "MD5HASH2"
},
} |
A few updates on this. From my perspective we have sorted the requirements for how the multiphase files live on disk. {
"geometry": {
"file":"geometry.par",
"hash": "MD5HASHGEO"
},
"phases": {
"ferrite": {
"file": "ferrite.par",
"hash": "MD5HASH1"
},
"austenite": {
"file": "austenite.par",
"hash": "MD5HASH2"
}
}
} I've now focused on what our in-code requirements are, which I have outlined:
Therefore, I have arranged things in the following way: Added new class to
|
Sorry, I'm a bit out of the loop for this week. In general, I would prefer to avoid touching xfab because it means both packages need to upgrade together. Having something in ImageD11 only is just easier while developing. For now I'm wondering where/how these JSON files will live on disk. It seems to fit for holding a whole analysis project. Also as the drivers for an ewoks pipeline. Probably becomes clear with some examples? We might use the same phases and parameters for processing a series of different samples... |
In order to import the new json transparently when calling I worry that extending the capabilities of the json too far at this stage will slow this development down, and I have future code (TensorMap) waiting to be merged that needs well-established phases to work well. If you can write a schema/example of a json file that is expandable in the way you want (so we can add features to it in the future), I can modify the current json parsing accordingly? |
Yes please! It simplifies the install from git script and debugging broken environments (only one thing to add to sys.path). For now, it just means copying the parameters.py file back into ImageD11 and/or monkeypatching inside ImageD11.parameters. In the longer term, we can merge the changes back into xfab if that makes sense. Another random comment: I am not keen on "Json" in a class name. The class/object can be serialised into another object storage format, so the class/object (dict of phases?) does not depend on the choice of format. We should be able to load/save from different formats and get the same result back. Try to merge something and we can go on from there... |
Significant progress made in #322 for this - new schema can be modified or extended to contain more processing parameters (right now it's just geometry and phases...) |
As discussed with @jadball, it would be good to have a way to save the data processing steps done together with the output. Also to clean up the command line arguments (etc).
One idea was:
Historically, peaksearch.py was supposed to print the options used on stdout. Ideally, you would want that to be a "reproducible" block of code for re-running the job. The old Tk gui had something for this in help->history. For most of our jobs there is:
The main problem to "fix" is making a convenient way to compare processing runs with different sets of parameters. But without needing to repeat steps that are already done.
The NXprocess has an NXnote which seems quite flexible to hold just about anything. Within sinograms this could mean setting up the files as a 'chain' with the "sequence_index" https://manual.nexusformat.org/classes/base_classes/NXprocess.html. They can also be entries (==hdf group) in one big file. For example:
When you load grains, the code would work back along the chain -> columnfile -> peakstable -> sparsefile -> dataset -> masterfile, etc.
This means changing a bunch of things. The dataset object could be simplified so that it does not need to handle every kind of output or processing that comes later. Each new output that comes up just has it's own output file, and a pointer to locate the input data that was used to create it.
The text was updated successfully, but these errors were encountered: