Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better understanding the dataflow within a pipeline #4119

Open
iaindillingham opened this issue Feb 13, 2024 · 7 comments
Open

Better understanding the dataflow within a pipeline #4119

iaindillingham opened this issue Feb 13, 2024 · 7 comments

Comments

@iaindillingham
Copy link
Member

In a recent Bristol-Cambridge-Oxford meeting, a researcher said that it was hard to determine when an action was last run, especially when there were many actions in a pipeline. The researcher said this was important because it was hard to determine whether a downstream action, let's call it B, needed to be rerun because an upstream action, let's call it A, had been rerun. In other words, it wasn't clear that the timestamp of A was after the timestamp of B, and hence that outputs from B may not reflect outputs from A.

I've summarized the issue as "Better understanding the dataflow within a pipeline", but should emphasize that what's hard to determine is when the dataflow is invalid with respect to the dependency graph represented by the pipeline.

This discussion also surfaced a related issue: the researcher asked about the difference between an action and a job. (Conceptually, an action is the class; a job is the instance of the class.)

@Jongmassey
Copy link
Contributor

Conceptually, an action is the class; a job is the instance of the class.

Until fairly recently, I was under the impression that a job is a collection of actions that have been run, but I now know that this is a JobRequest.

@iaindillingham
Copy link
Member Author

@benbc also pointed out that this issue is related to opensafely-core/job-runner#196.

@lucyb
Copy link
Contributor

lucyb commented Feb 29, 2024

This could potentially be a fairly big piece of work. Is there something smaller we can do to help the user in question for now?

@iaindillingham
Copy link
Member Author

I think a candidate solution would be to display a table on the workspace detail page that contained the set of jobs associated with the workspace in one column, and their last successfully completed time in another column. This table should be sortable by either column. Sorting by last successfully completed time would help the user determine whether action B successfully completed after action A, meaning the dataflow matched the dependency graph, or whether action A successfully completed after action B, meaning it didn't.

From a user's perspective, this solution would involve them remembering an edge in the dependency graph -- the edge between vertices A and B in the example -- relating this to the table, and then inferring the dataflow. That seems reasonable for small dependency graphs, but the Bristol-Cambridge-Oxford group are not known for their small dependency graphs 🙂.

From a technical perspective, this solution would involve joining Workspace to JobRequest to Job, which may be expensive. It would also involve displaying all actions associated with the workspace, including those actions that no longer exist (i.e. have been removed from project.yaml), which may be confusing. (Whilst it could involve displaying some actions, it's not clear which.)

This solution may require special handling of the run_all job. This job exists in the DB, but I'm unclear whether each instance was expanded into its constituent jobs. I assume so, but it would be good to check.

This solution would require a sortable table, but a sortable table already exists as a UI component.

Ultimately, this solution isn't small (although it isn't large, either!) and it should involve user-testing; it would be sensible to create a mock-up, before working on the implementation, to facilitate this. However, this solution could be a first step to actually visualizing the dependency graph and the dataflow, which may satisfy opensafely-actions/.github#7.

@iaindillingham
Copy link
Member Author

@lucyb and I discussed this issue on Slack1 and agreed to move it to Later. Although there are several candidate solutions, it's clear we need to know more about the problem to be able to choose a candidate solution with confidence. Indeed, it may be that the separation of actions, jobs, and job requests needs rethinking.

Footnotes

  1. https://bennettoxford.slack.com/archives/C069SADHP1Q/p1709309916338619

@LFISHER7
Copy link
Contributor

LFISHER7 commented Mar 4, 2024

For when you think about this again, #3566 highlights a slightly different (and I think more common) use case for surfacing action-specific logs.

@iaindillingham
Copy link
Member Author

That's really useful, thanks @LFISHER7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants