Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify an input glob pattern #7

Closed
ccunningham101 opened this issue Mar 23, 2022 · 4 comments · Fixed by #16
Closed

Specify an input glob pattern #7

ccunningham101 opened this issue Mar 23, 2022 · 4 comments · Fixed by #16

Comments

@ccunningham101
Copy link

Currently if you produce measures files in different subdirectories within the same project, for example:

output/subdir1/measures_*.csv
output/subdir2/measures_*.csv

you need to specify the action twice because --input-dir does not resolve wildcards and does not recurse through the directory

Do we think the action should be specified once for each subdirectory, or should the action have the ability to look across multiple subdirectories?

@iaindillingham
Copy link
Member

Indeed, we need to specify the action twice! But we can change that. Before opening Vim, and making it so, it would be good to think about the cases where we'd have measure files in different subdirectories. What are the cases where we'd want to create deciles charts with the same configuration:1

  • for all measure files in all subdirectories.
  • for some, but not all, measure files? These could be in the same subdirectory, in different subdirectories, but not necessarily all subdirectories.

Have you encountered these cases in the wild? Or can you imagine encountering them there?

Footnotes

  1. At present, the only configuration is the --output-dir argument. However, we may wish to introduce more configuration in the future.

@ccunningham101
Copy link
Author

My current Depression QOF project has two study populations and two study definitions: the entire population, and those with learning disabilities and autism. And I currently keep all the output separate in /output/qof and /output/lda
Arguably these could/should be different repositories with different project yamls, but they have a number of shared variables, so I started the project with them in the same repository.

In my case, we would like decile charts for any variable that was grouped by practice (but not for the demographic subgroups) in both subdirectories.
If we have good automatic extraction of Decile chart titles then there would be less need to manually specify the configuration for each decile chart.

@iaindillingham iaindillingham changed the title Should one action map to one subdirectory? Specify an input glob pattern Apr 5, 2022
@iaindillingham
Copy link
Member

I've renamed this issue to "Specify an input glob pattern", following #4. However, I haven't addressed the question: "Should one action map to one subdirectory?" 🙂

I think that one invocation of deciles-charts should map to one input glob pattern and one output subdirectory. I think that the input glob pattern shouldn't recurse. Why?

One input glob pattern and one output subdirectory allows deciles-charts to read a subset of the measure tables, writing the deciles charts to the output subdirectory. This is an improvement over current behaviour, where deciles-charts reads the set of the measure tables: if you want some, but not all, deciles charts with current behaviour, then too bad!

However, if the input glob pattern recursed, then we'd need to consider what we wrote to the output subdirectory. Using antidepressant-prescribing-lda as an example:

>>> glob.glob("output/**/measure_*practice*.csv", recursive=True)
['output/lda/joined/measure_new_antidepressant_tricyclic_practice_rate.csv',
 'output/lda/joined/measure_new_antidepressant_ssri_practice_rate.csv',
 'output/lda/joined/measure_antidepressant_other_practice_rate.csv',
 'output/lda/joined/measure_antidepressant_maoi_practice_rate.csv',
 'output/lda/joined/measure_depression_practice_rate.csv',
 'output/lda/joined/measure_antidepressant_ssri_practice_rate.csv',
 'output/lda/joined/measure_new_depression_practice_rate.csv',
 'output/lda/joined/measure_qof_practice_rate.csv',
 'output/lda/joined/measure_new_antidepressant_maoi_practice_rate.csv',
 'output/lda/joined/measure_new_antidepressant_any_practice_rate.csv',
 'output/lda/joined/measure_antidepressant_tricyclic_practice_rate.csv',
 'output/lda/joined/measure_antidepressant_any_practice_rate.csv',
 'output/lda/joined/measure_new_antidepressant_other_practice_rate.csv',
 'output/qof/joined/measure_qof_practice_rate.csv']

We'd expect a deciles chart for each of the above measure tables. Would we expect them to be written as siblings in the output subdirectory? Or would we expect subdirectories within the output subdirectory? If the latter, then how would we handle collisions? Determining the subdirectories within the output subdirectory and handling collisions means writing more code, and makes it less clear for the user. For these reasons, I think that the input glob pattern shouldn't recurse.

@ccunningham101
Copy link
Author

Agreed!

iaindillingham added a commit that referenced this issue Apr 6, 2022
This replaces the `--input-dir` argument with the `--input-files`
argument. Whereas the former accepts a path to a directory, the latter
accepts a glob pattern.

This commit addresses the substantive issue, but some tidying up would
be worthwhile.

Closes #7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants