Critical bugs with reporting results #209

nsheff · 2024-02-17T12:40:20Z

I just found 2 critical bugs with pypiper result reporting. Actually it may just be 2 outcomes from the same critical bug.

When you report a result it doesn't overwrite the result if there's already one there. This means there's no way to update something if an incorrect result was entered. In the past, results would be reported as many times as you ran the pipeline. All of these results are recorded. Now, the pipeline only reports the first result. If you re-report that result, it's simply lost.

When I'm re-running a pipeline to correct something, I get these:

These results exist for 'demo4': seqcol_digest
These results exist for 'demo4': Time
These results exist for 'demo4': Success

And they're not updated. So, that's the first problem: reporting of results is limited to a single value now, and multiple reports is not recorded.

The second issue is related to it:

The result is not only not reported to the results.yaml file, it's also not reported to screen! So it doesn't even show up in the log file anymore. Results should be reported to screen every time a result report is requested.

Solutions

The second issue should be easy to solve, just make sure pypiper is always printing the result. I'm not sure if this should happen in pypiper or in pipestat, but somewhere it has to happen.

The first problem is harder to solve, because it's a consequence of the switch from a simple tab-delimited stats.tsv file pypiper used to use, to the pipestat file. In pipestat, results are recorded per sample, not per pipeline run, which means there's going to be issues when you re-run a sample. The best I can think of is this:

In the short term, at the very least, the last reported result should be the one used, not the first reported result, as is currently happening. I guess this is as simple as forcing updates whenever you report an existing result.
In the longer term, pipestat should offer the option to include a history of results, and these should be stored somehow in the file (and database). This may not actually be too hard to implement; just add a 'history' function, and when something is overwritten, just move the old values into the history in a way that is an array, rather than a single value. Then, pipestat could offer a clear history function to remove old stuff, if desired, but otherwise, repeated reports of the same result will simply add to the history.

The text was updated successfully, but these errors were encountered:

donaldcampbelljr · 2024-02-19T14:57:57Z

I believe the first issue is related to #201
I solved it by adding a force_overwrite flag to both report_result and report_object

donaldcampbelljr · 2024-04-04T18:22:21Z

I switched force_overwrite to default to True and confirmed that it will overwrite the results and these interactions are saved in log.md and only the most recent is in the results.yaml file (as expected).

import pypiper

pm = pypiper.PipelineManager(name="TEST_OVERWRITING",outfolder="/home/drc/PythonProjects/pipestat/opt_dependencies",
                             pipestat_record_identifier="pypiperRecordIdentifier1", pipestat_schema="sample_output_schema.yaml", 
                             pipestat_pipeline_type="sample", pipestat_results_file="results.yaml")
print("pm made")
pm.report_result(key="number_of_things", value=100)
print("\ndone reporting a result 1st time")
pm.report_result(key="number_of_things", value=200)
print("\ndone reporting a result 2nd time")
pm.report_result(key="number_of_things", value=300)
print("\ndone reporting a result 3rd time")
pm.complete()

Log excerpt:

pm made

> `number_of_things`	100	_RES_

> `pipestat_created_time`	2024-04-04 14:16:54	_RES_

> `pipestat_modified_time`	2024-04-04 14:16:54	_RES_

done reporting a result 1st time
These results exist for 'pypiperRecordIdentifier1': number_of_things
Overwriting existing results: number_of_things

> `number_of_things`	200	_RES_

> `pipestat_modified_time`	2024-04-04 14:16:54	_RES_

done reporting a result 2nd time
These results exist for 'pypiperRecordIdentifier1': number_of_things
Overwriting existing results: number_of_things

> `number_of_things`	300	_RES_

> `pipestat_modified_time`	2024-04-04 14:16:54	_RES_

done reporting a result 3rd time

Results file:

test_pipe:
  project: {}
  sample:
    pypiperRecordIdentifier1:
      number_of_things: 300
      pipestat_created_time: '2024-04-04 14:16:54'
      pipestat_modified_time: '2024-04-04 14:16:54'

Next steps will be to add history enhancements to Pipestat via pepkit/pipestat#177

donaldcampbelljr · 2024-04-19T15:34:50Z

Closed with 0.14.1 release and pipestats 0.9.0 release.

nsheff added bug priority-high labels Feb 17, 2024

nsheff mentioned this issue Feb 17, 2024

Does looper destroy also clear pipestat results? pepkit/looper#469

Closed

nsheff mentioned this issue Feb 22, 2024

Should force_overwrite default to true? pepkit/pipestat#161

Closed

donaldcampbelljr added a commit that referenced this issue Apr 4, 2024

#209 Switch force_overwrite to default to True

04abc1f

donaldcampbelljr mentioned this issue Apr 19, 2024

v0.14.1 release #219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical bugs with reporting results #209

Critical bugs with reporting results #209

nsheff commented Feb 17, 2024

donaldcampbelljr commented Feb 19, 2024

donaldcampbelljr commented Apr 4, 2024

donaldcampbelljr commented Apr 19, 2024

Critical bugs with reporting results #209

Critical bugs with reporting results #209

Comments

nsheff commented Feb 17, 2024

Solutions

donaldcampbelljr commented Feb 19, 2024

donaldcampbelljr commented Apr 4, 2024

donaldcampbelljr commented Apr 19, 2024