-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data flow diagram for various ETL steps in pipelines #4465
Conversation
Full-stack documentation: https://docs.openverse.org/_preview/4465 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. New files ➕: Changed files 🔄: |
Excited to spend more time reviewing this! Tiny error detail I noticed to begin with, but the API pulls from ES, and does dead link filtering on the ES hits (meaning, only has ES's data) + Redis (which is being both read from and written to during this process), not the API DB. The API only pulls the API DB results just before passing to the serializer, after all the The change to the graph would be to say, the API pulls from ES, filters and processes the ES results, and then only pulls the final presented subset from the API DB. So There's a cyclical relationship between the API DB and ES in that one informs the way the other is used, either to decide what data to load into ES to begin with, but then also what results to present from the API DB, with the API DB results being those that are presented, not the ES hits. Additionally, it would be good to add the fact that Redis is both written to and read from during dead link filtering (which is basically itself an ETL process, I guess, we're at the very least generating data, if not extracting it, though we do save the specific status code, not just a dead/not-dead boolean). Really awesome, though, very excited to dig into this more, and to get a more cohesive view of how each part of our data lifecycle works with the other parts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! This LGMT aside from the issue I commented on before (the slight inaccuracy of leaving Redis out of the API picture, and clarifying where results come from, + adding serialization as a pre-client transformation step). Some of that is being a bit nit-picky, but I think it's worth especially clarifying what data is sent in API responses, including what transformations the API itself may do to them in serialization. No need to outline the specifics of those transformations here. As you've done with the other sections, just mentioning that it happens with a link to the relevant code (serializers) it is more than enough and will prevent documentation drift 👍
With documentation drift in mind, though, I suppose we'll need to update this document once the three mentioned projects are complete!
Otherwise, this is great. It does not answer every question I have, but it's also purely descriptive, which I think is good for now. I personally would really like a prescriptive version of this document in the future, a "vision" of the data lifecycle, but I think that will come with time as we all discuss and refine what our goal is with the data lifecycle and each part of it.
Staci and I had an interesting conversation earlier this week about the difference between approaching problems in Openverse (generally, not just with data eng. stuff), with a "keep the changes minimal" mindset. While that has some benefits (e.g., it makes some things go faster), it has the effect of us treating the way things are as the way they should be, and eliminating chances for us to talk about what we want (our vision) for these systems in concrete terms. That, in turn, stops us from making moves towards that vision. It also, I believe, introduces significant opportunities for miscommunication and misalignment. It is probably the case that all of us have some ideas of how things should be (and each of us to varying degrees for each area depending on our areas of focus in Openverse), but because we very rarely have that clearly defined, we end up talking past each other, or getting confused about one choice or another. I often feel like we (meaning the whole team as individuals) end up ever so slightly (or sometimes even more) misaligned in our visions, leading to spinning wheels in some places. Here, for example, we see that there's a convergence of three understandings of the intention of each piece of our data lifecycle that were planned independently of each other, with only really offhand mentions of the others in planning, rather than true integration. Ideally, these three projects would all work together to move our data engineering and data lifecycle towards our shared vision. But that presupposed the shared vision is defined, and indeed shared between us all!
Anyway, just musing on the fact that this document is a step in that direction, and I'm really glad for that. The conclusion Staci and I came to with respect to how we've planned projects to aim for minimal changes is that if we had a grander vision overall, we could make decisions in project planning that set us up for better situations in the future. Especially with the ingestion server removal project, we're pulling things more inline with treating Airflow as the coordinator of movement in our data infrastructure. Using the EC2 workers as workers is standard practice (I've learned this recently), but something like the ingestion server never was! It was doing Airflow's job, and way more tediously. Moving towards more Airflow-y way of coordinating our Catalog -> API DB -> ES ETL was not an explicit intention (we instead wanted to simplify data refresh to make it easier to work on, without ever explicitly saying that the reason it would be simpler is because it would be more in line with how Airflow should be used to coordinate these things!), but is a significant and meaningful realignment and readjustment of how the ETL pipelines are thought about. That's excellent, because it means we are making prescriptive changes towards a vision of "how things should be for Openverse", by saying "Airflow is the place for this".
That's powerful! I'm excited for the possibilities that brings up, especially with respect to other things we can use Airflow for aside from data engineering task management. My view of Airflow is personally expanded by learning more about it and speaking with Staci this week.
TL;DR: This is great, looking forward to more discussions about these things, and I'm very excited to learn more about all of this before and through those discussions with y'all. Always exciting to see what we come up with when we give ourselves permission to think beyond the minimal constraints of a specific single need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AetherUnbound this is so awesome! The structure of the document itself is lovely, with the graphic linking off to the the descriptive sections. ➕ to Sara's suggestions but I'll approve now to streamline things.
end | ||
|
||
A --> DL(Dead link filtering):::ETL | ||
DL --> CL[Client] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DL --> CL[Client]
Obviously from a computing and data flow standpoint this name makes sense, but we don't refer to the API as a "client" anywhere else, really. Is there a way we could expand this (maybe a parenthetical including "Django API") so that others can "connect the dots" between this graphic and our stack as understood by developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, see I meant for "Client" here to be someone using the API (which could be the Gutenberg integration, a WP plugin, etc). I'll try to clarify that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't have the Django API on here anywhere, and I think I should! I'll expand on it a bit more 😄
It's interesting to hear you say this. I feel like I've been trying to communicate this point and push for exactly this way of thinking about Airflow for a while 🙂 It was part of the reason I wanted to see if we could use the What I'm hearing, though, is that it might have been useful to make that intention explicit in some form of policy or "vision" document. That might have helped make it clearer to communicate "this is what we want to use Airflow for" rather than me saying it over and over during the occasions where an opportunity to leverage it cropped up organically. That's helpful feedback, and hopefully that can also be part of the things we make explicit:
|
@AetherUnbound the changes look great 👍 |
I think the specific thing that didn't connect to me is that Airflow isn't even a data engineering tool, per se. It's just a task scheduler and you can do anything with it. The data engineering happens with other tools, like Postgres, or Python scripts. It didn't click to me until I read about data engineering tools generally and realised that Airflow doesn't have much to do with it aside from being a task coordinator. With respect to the ingestion server removal, I did not understand the difference. I still don't think ECS is a good option (FARGATE's pricing structure is actively hostile to resource cost/utilisation optimisations and wildly expensive), but AWS Batch with spot instances would be a tremendous change and I'd support going in that direction. Yeah, for whatever reason it never clicked that Airflow is just a task coordinator 😕. Sorry! I haven't been that involved in data things until the ingestion server removal project, so I'm not sure I encountered that idea before, and when reviewing that IP I did not understand the idea of removing the API from the indexer workers. We can still do that if we want, and swap out the ASG for AWS Batch, even without spot instances, and it would be an improvement and move towards a more airflow-y model for sure. I think I was also specifically hung up on ECS FARGATE, which as I said above I think is a specifically bad solution, but it's possible to structure the containerised workload such that the execution environment doesn't matter, so swapping ECS for kubernetes or AWS Batch becomes more or less trivial. I apologise for not understanding that, something just didn't click until I learned more about data engineering as a whole and Airflow in particular in the past two weeks. A synchronous conversation about these things would probably have helped a lot! I'm realising there are a lot more important concerns tied up between data engineering and infrastructure than I knew before, which is why I am spending a large amount of time outside of work learning about these things more, as well. |
Co-authored-by: sarayourfriend <[email protected]> Co-authored-by: zack <[email protected]>
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @sarayourfriend Excluding weekend1 days, this PR was ready for review 9 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @AetherUnbound, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and is so nice to have! The only thing that struck me as potentially confusing is placing the temp table
within the ingestion server visually, when this is really a temp table in the API database that later gets promoted. Not sure how best to reflect that via the diagram but maybe we could add a section explaining what the temp table is, and have it be clickable like the other elements? Not blocking.
This is fantastic @AetherUnbound :)
This was definitely a tough thing to convey 😅 we already had enough connections to the API database that I wanted to try and have it be separate. I'll add a note for it at the bottom of the diagram, thanks! |
Not sure where or how to document this, but I just remembered the provider occurrance tallying we aggregate on a weekly basis into Redis: openverse/api/api/utils/tallies.py Line 26 in 0891970
It doesn't fit into and current ETL wants (like dead link detection does) but it strikes me that it could be useful to have a place where we've described the data we're generating overall, regardless of whether it's currently part of or intended to be part of an entire ETL pipeline. Provider occurrance rate might be a useful relevancy metric, for example, and it'd be a shame if we forgot about it because we aren't referring to it regularly (that I know of). Just mentioning in this PR because of it's tangential relevancy to the subject, not because I think it should be included in this documentation. I needed to write it down somewhere and this was the easiest and most sensible place in the immediate. |
Fixes
Fixes #4455 by @AetherUnbound
Description
This PR adds a document which visualizes and describes the current and proposed data flows for the project. It also enumerates the ETL steps at each stage to make it easy to understand which operations are happening at which step.
I didn't include any transformations that happen between the client and the frontend because I wasn't aware of them, but please feel free to point them out and I can add them!
Testing Instructions
Check out the document preview and make sure it makes sense.
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin