-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data flow diagram for various ETL steps in pipelines #4465
Merged
Merged
Changes from 4 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
d90c655
Add first pass at document and diagram
AetherUnbound 97704d6
Fill out other details
AetherUnbound 76aba4f
Add proposed steps next
AetherUnbound b86161f
Fix docs warning
AetherUnbound c33c7a6
Update graphs to include Redis more explicitly
AetherUnbound 40fdffb
Add the new documentation sections for this relationship
AetherUnbound b7bd983
Remove accidental connection
AetherUnbound 3c1b4bb
Title spacing and typo
AetherUnbound 9a67ef9
Add note about temporary tables
AetherUnbound File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
# Extract-Transform-Load flows | ||
|
||
The diagrams below show how data flows through Openverse, and the various | ||
transformation steps that might happen at each stage. | ||
|
||
**Guide**: | ||
|
||
- **Blue boxes** represent places where data sits at rest. | ||
- **Purple boxes** represent transformation steps. Each box links to a longer | ||
description below. | ||
|
||
## Current flow | ||
|
||
```{mermaid} | ||
flowchart TD | ||
|
||
%% Graph definitions | ||
subgraph AF [Airflow] | ||
P{{Provider}} --> M(MediaStore):::ETL | ||
M --> T{{TSV}} | ||
T --> IC1(Index constraints):::ETL | ||
IC1 --> C[(Catalog DB)] | ||
C --> B(Batched update):::ETL --> C | ||
end | ||
|
||
subgraph IS [Ingestion Server] | ||
C --> ING(Deleted media removal):::ETL | ||
ING --> TT[(Temp Table)] | ||
TT --> CF(Cleaning & filtering):::ETL --> TT | ||
TT --> A[(API DB)] | ||
A --> IC2(Index constraints):::ETL --> A | ||
A --> ESF(Index creation & deleted media filtering):::ETL | ||
ESF --> ES[(Elasticsearch)] | ||
end | ||
|
||
A --> DL(Dead link filtering):::ETL | ||
DL --> CL[Client] | ||
CL --> F[Frontend] | ||
|
||
%% Style definitions | ||
style IS fill:#ffc83d | ||
style AF fill:#00c7d4,margin-left:12em | ||
classDef ETL fill:#fa7def | ||
|
||
%% Reference definitions | ||
click M "#mediastore-transformations" | ||
click IC1 "#index-constraints-catalog" | ||
click B "#batched-update" | ||
click ING "#deleted-media-removal" | ||
click CF "#cleaning-filtering" | ||
click IC2 "#index-constraints-api" | ||
click ESF "#es-index-creation-filtering" | ||
click DL "#dead-link-filtering" | ||
``` | ||
|
||
### MediaStore transformations | ||
|
||
These transformations occur as part of the provider ingestion scripts as records | ||
are being saved from the provider's API response into a TSV. The exhaustive list | ||
of shared transformations can be found | ||
[in the `MediaStore` definition](https://github.com/WordPress/openverse/tree/main/catalog/dags/common/storage/media.py), | ||
as well as individual transformations for | ||
[images](https://github.com/WordPress/openverse/tree/main/catalog/dags/common/storage/image.py) | ||
and | ||
[audio](https://github.com/WordPress/openverse/tree/main/catalog/dags/common/storage/image.py). | ||
|
||
A non-exhaustive list includes: | ||
|
||
- Filtering out tags based on licenses or tags which add no useful search | ||
information | ||
- Removing integer values that would exceed the Postgres integer maximum | ||
- Validating filetype and collapsing common types | ||
- Formatting tags appropriately (see | ||
[the media properties definition](catalog.md#tags)) | ||
- Enriching `meta_data` the license URLs | ||
- Validating URLs | ||
- Determining `source` is not explicitly provided | ||
|
||
### Index constraints (catalog) | ||
|
||
As the data from a saved provider TSV gets inserted into the database, it also | ||
must be compliant with the existing indices. The indices for each primary media | ||
table can be found here: | ||
[image](https://github.com/WordPress/openverse/tree/main/docker/upstream_db/0003_openledger_image_schema.sql), | ||
[audio](https://github.com/WordPress/openverse/tree/main/docker/upstream_db/0006_openledger_audio_schema.sql). | ||
In general, these include: | ||
|
||
- Uniqueness on `provider` + `foreign_identifier` combination | ||
- Uniqueness on `identifier` | ||
- Uniqueness on `url` | ||
|
||
### Batched update | ||
|
||
Updates against the catalog database may also be initiated using the | ||
[batched update DAG](/catalog/reference/DAGs.md#batched-update-dag). These | ||
updates may be automated (such as the popularity refresh process) or manual | ||
(such as data correction). | ||
|
||
### Deleted media removal | ||
|
||
Presently, we | ||
[filter out deleted media from each of the primary media tables](https://github.com/WordPress/openverse/blob/b4ab20cdb220f442949d9c06a99ced5e80c1e1e1/ingestion_server/ingestion_server/queries.py#L173-L182) | ||
while the `SELECT {columns} FROM {upstream_table}` query is being run. | ||
|
||
### Cleaning & filtering | ||
|
||
We also iterate through the copied data in batches and update the values in the | ||
new table based on several cleaning/filtering steps (all of which can be seen in | ||
the ingestion server's | ||
[cleanup definitions](https://github.com/WordPress/openverse/tree/main/ingestion_server/ingestion_server/cleanup.py)). | ||
At present, these operations include: | ||
|
||
- Ensuring URLs have a scheme | ||
- Filtering tags out (similar to | ||
[the `MediaStore` transformations](#mediastore-transformations)) | ||
- Filtering machine-generated tags below a certain confidence value | ||
|
||
### Index constraints (API) | ||
|
||
Once the ingestion server data is copied and cleaned/filtered, the API indices | ||
are applied to the new table. There are several more indices on the API than on | ||
the catalog database, many of which are for query optimization. These include | ||
(all are `btree` unless specified): | ||
|
||
- Category | ||
- Foreign identifier | ||
- Last synced with source | ||
- Provider | ||
- Source | ||
|
||
### ES index creation & filtering | ||
|
||
An additional filtering step happens when the records are indexed from the API | ||
database into Elasticsearch. In addition to the filtering of | ||
[sensitive terms using two separate indices](/catalog/reference/DAGs.md#create_filtered_media_type_index) | ||
and | ||
[deleted media](https://github.com/WordPress/openverse/blob/b1a1f6eb711a3d6999b2d2bbe4b8cc1d1435c3fa/ingestion_server/ingestion_server/indexer.py#L121), | ||
there are also transformations that occur when constructing the ES documents. | ||
The | ||
[exhaustive list can be seen in the `elasticsearch_models` module](https://github.com/WordPress/openverse/tree/main/ingestion_server/ingestion_server/elasticsearch_models.py), | ||
which includes: | ||
|
||
#### Common | ||
|
||
- Computing the authority boost | ||
- Computing the popularity | ||
- Parsing a description from the media, if available | ||
- Parsing tags | ||
- Determining the `mature` field | ||
|
||
#### Images | ||
|
||
- Computing the aspect ratio | ||
- Computing the size | ||
|
||
#### Audio | ||
|
||
- Computing the duration | ||
|
||
### Dead link filtering | ||
|
||
When records are retrieved from Elasticsearch, they also go through a process | ||
called "dead link filtering". This operation attempts to remove links to media | ||
which the API may have trouble accessing in order to prevent dead or invalid | ||
links from being surfaced in API results. The full set of dead link logic can be | ||
found in the | ||
[`check_dead_link` module](https://github.com/WordPress/openverse/tree/main/api/api/utils/check_dead_links/__init__.py). | ||
|
||
## Proposed | ||
|
||
The projects #430, #431, and #3925 all intent to modify the above process. Below | ||
AetherUnbound marked this conversation as resolved.
Show resolved
Hide resolved
|
||
is the diagram of what this process might look like after all steps are taken. | ||
Most blocks reference the same sections above, with the exception being | ||
[Deleted media & tag filtering](#deleted-media--tag-filtering). | ||
|
||
```{mermaid} | ||
flowchart TD | ||
|
||
%% Graph definitions | ||
subgraph AF [Airflow] | ||
P{{Provider}} --> M(MediaStore):::ETL | ||
M --> T{{TSV}} | ||
T --> IC1(Index constraints):::ETL | ||
IC1 --> C[(Catalog DB)] | ||
C --> B(Batched update):::ETL --> C | ||
C --> ING(Deleted media & tag filtering):::ETL | ||
ING --> A[(API DB)] | ||
A --> IC2(Index constraints):::ETL --> A | ||
A --> ESF(Index creation & deleted media filtering):::ETL | ||
ESF --> ES[(Elasticsearch)] | ||
end | ||
|
||
A --> DL(Dead link filtering):::ETL | ||
DL --> CL[Client] | ||
CL --> F[Frontend] | ||
|
||
%% Style definitions | ||
style AF fill:#00c7d4,margin-left:12em | ||
classDef ETL fill:#fa7def | ||
|
||
%% Reference definitions | ||
click M "#mediastore-transformations" | ||
click IC1 "#index-constraints-catalog" | ||
click B "#batched-update" | ||
click ING "#deleted-media-tag-filtering" | ||
click IC2 "#index-constraints-api" | ||
click ESF "#es-index-creation-filtering" | ||
click DL "#dead-link-filtering" | ||
``` | ||
|
||
### Deleted media & tag filtering | ||
|
||
The data normalization project (#430) intends to remove the | ||
[cleanup steps](#cleaning--filtering) from the ingestion server. Subsequent | ||
ingestion server removal project (#3925) will remove the ingestion server | ||
entirely. This will be fleshed out more as part of #4456, but the _filtering_ | ||
steps (e.g. demographic tags, tags below a confidence level, etc.) will be | ||
present within the pipeline prior to the API indexing values into Elasticsearch. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,14 @@ | ||
# Media properties | ||
|
||
This part of the documentation documents media properties at each layer of the | ||
Openverse stack. | ||
Openverse stack. It also includes a document describing the entire data flow and | ||
the various Extract-Transform-Load (ETL) steps that happen at each touch point. | ||
|
||
```{toctree} | ||
:titlesonly: | ||
|
||
catalog | ||
api | ||
frontend | ||
etls | ||
``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously from a computing and data flow standpoint this name makes sense, but we don't refer to the API as a "client" anywhere else, really. Is there a way we could expand this (maybe a parenthetical including "Django API") so that others can "connect the dots" between this graphic and our stack as understood by developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, see I meant for "Client" here to be someone using the API (which could be the Gutenberg integration, a WP plugin, etc). I'll try to clarify that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't have the Django API on here anywhere, and I think I should! I'll expand on it a bit more 😄