Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing example DAGs/system tests for Google services #8280

Closed
21 of 56 tasks
mik-laj opened this issue Apr 13, 2020 · 23 comments
Closed
21 of 56 tasks

Missing example DAGs/system tests for Google services #8280

mik-laj opened this issue Apr 13, 2020 · 23 comments
Labels
good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@mik-laj
Copy link
Member

mik-laj commented Apr 13, 2020

Description

Hello,

We have a rule that every GCP operators should have an example DAG and system test. This is true in many cases, but there are minor exceptions.
https://github.com/apache/airflow/blob/master/tests/always/test_project_structure.py#L155-L162

  • airflow/providers/google/ads/operators/ads_to_gcs.py
  • airflow/providers/google/cloud/operators/text_to_speech.py
  • airflow/providers/google/cloud/operators/gcs_to_bigquery.py
  • airflow/providers/google/cloud/operators/adls_to_gcs.py
  • airflow/providers/google/cloud/operators/sql_to_gcs.py
  • airflow/providers/google/cloud/operators/s3_to_gcs.py
  • airflow/providers/google/cloud/operators/translate_speech.py
  • airflow/providers/google/cloud/operators/bigquery_to_mysql.py
  • airflow/providers/google/cloud/operators/speech_to_text.py
  • airflow/providers/google/cloud/operators/cassandra_to_gcs.py
  • airflow/providers/google/cloud/operators/bigquery_to_bigquery.py
  • airflow/providers/google/cloud/operators/mysql_to_gcs.py
  • airflow/providers/google/cloud/operators/mssql_to_gcs.py
  • airflow/providers/google/cloud/operators/bigquery_to_gcs.py
  • airflow/providers/google/cloud/operators/local_to_gcs.py
  • airflow/providers/google/cloud/operators/sheets_to_gcs.py
  • airflow/providers/google/suite/operators/gcs_to_sheets.py

We also lack examples for individual operators.
https://github.com/apache/airflow/blob/master/tests/always/test_project_structure.py#L164-L235

  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueueDeleteOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueueResumeOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueuePauseOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueuePurgeOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksTaskGetOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksTasksListOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksTaskDeleteOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueueGetOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueueUpdateOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.tasks.CloudTasksQueuesListOperator (Add more operators to example DAG for Cloud Tasks #13235)
  • airflow.providers.google.cloud.operators.dataproc.DataprocInstantiateInlineWorkflowTemplateOperator
  • airflow.providers.google.cloud.operators.dataproc.DataprocInstantiateWorkflowTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPGetStoredInfoTypeOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPReidentifyContentOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPCreateDeidentifyTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPCreateDLPJobOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPUpdateDeidentifyTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPDeidentifyContentOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPGetDLPJobTriggerOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPListDeidentifyTemplatesOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPGetDeidentifyTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPListInspectTemplatesOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPListStoredInfoTypesOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPUpdateInspectTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPDeleteDLPJobOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPListJobTriggersOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPCancelDLPJobOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPGetDLPJobOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPGetInspectTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPListInfoTypesOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPDeleteDeidentifyTemplateOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPListDLPJobsOperator
  • airflow.providers.google.cloud.operators.dlp.CloudDLPRedactImageOperator
  • airflow.providers.google.cloud.operators.datastore.CloudDatastoreDeleteOperationOperator
  • airflow.providers.google.cloud.operators.datastore.CloudDatastoreGetOperationOperator
  • airflow.providers.google.cloud.sensors.gcs.GCSObjectExistenceSensor
  • airflow.providers.google.cloud.sensors.gcs.GCSObjectUpdateSensor
  • airflow.providers.google.cloud.sensors.gcs.GCSObjectsWtihPrefixExistenceSensor
  • airflow.providers.google.cloud.sensors.gcs.GCSUploadSessionCompleteSensor

If you decide to finish this ticket you don't have to do all the work yourself. One PR can only deal with a single operator and it's ok.

These example DAGs are key to ensuring high-quality integration.

  • If used in system tests, they prevent regression and facilitate testing.
  • If used in the documentation, they allow us to learn about operators in a real example. Users can easily do CTRL + C, CTRL + V, which makes it easier to write new DAGs.

If you haven't used the GCP yet, after creating the account you will get $300, which will allow you to get to know these services better.

The implementation of this task will allow a better understanding of GCP services, as well as learn methods of testing that is required by the community. If anyone is interested in this task, I am willing to provide all the necessary tips and information.

Are you wondering how to start contributing to this project? Start by reading our contributor guide

Related Issues

N/A

@mik-laj mik-laj added provider:google Google (including GCP) related issues kind:feature Feature Requests good first issue labels Apr 13, 2020
@SlowSnowFox
Copy link

I would be happy to give it a try. I've only worked with airflow and AWS so far but I'm sure I should be able to come up with at least some examples.

@mik-laj mik-laj added kind:bug This is a clearly a bug and removed kind:feature Feature Requests labels Apr 13, 2020
@mik-laj
Copy link
Member Author

mik-laj commented Apr 22, 2020

@SlowSnowFox I'm glad you want to work on it. Have you encountered any difficulties? Can I help you?

@SlowSnowFox
Copy link

Hey, Sorry for the late reply. I'll be able to make a pull request for some of them on the weekend. Is it ok if I just package all examples I've made in a pull request and list the specific tests in the commit message?

@mik-laj
Copy link
Member Author

mik-laj commented Apr 23, 2020

Hey.
That sounds good. We'll see how we work. In case of problems, I will show you how to quickly divide it into many PR.
Best regards,

@joppevos
Copy link
Contributor

Happy to try the gcs_to_bigquery and bigquery_to_gcs examples if there is not being worked on yet.

@mik-laj
Copy link
Member Author

mik-laj commented Apr 24, 2020

Go ahead. Get started. I look forward to your example. I would be happy if you also added a system test, because it will allow us to check the example more easily.

@joppevos
Copy link
Contributor

joppevos commented Apr 24, 2020

@mik-laj great! correct me if I'm wrong, but it seems there is already a gcs_to_bigquery example over here. https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/example_dags/example_gcs_to_bq.py

@mik-laj
Copy link
Member Author

mik-laj commented Apr 24, 2020

Fantastic. It is now enough to change the file name and add a system test.
We want operator and sample file names to match.

airflow/providers/google/cloud/example_dags/example_gcs_to_bq.py
airflow/providers/google/cloud/operators/gcs_to_bigquery.py

This is the main requirement for now.

@ephraimbuddy
Copy link
Contributor

There's an example DAG and system tests for text_to_speech.py, speech_to_text.py and translate_speech.py.
The examples are all in https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/example_dags/example_speech.py
and the system tests are also all in https://github.com/apache/airflow/blob/master/tests/providers/google/cloud/operators/test_speech_system.py

@mik-laj
Copy link
Member Author

mik-laj commented May 7, 2020

@ephraimbuddy Would you like to split this file? Each module should have a separate sample DAG and a separate system test. This is important because of code maintenance. When everything adheres to one principle, it is easier to make changes automatically.

@ephraimbuddy
Copy link
Contributor

Ok. I'll be separating them into different files. Thanks for explanation

@irvifa
Copy link

irvifa commented Jul 6, 2020

I saw there's many PR that have been merged. Is this issue already closed then?

@mik-laj
Copy link
Member Author

mik-laj commented Jul 6, 2020

@irvifa Some examples are still missing. I updated the first post.
https://github.com/apache/airflow/blob/master/tests/test_project_structure.py#L125

@irvifa
Copy link

irvifa commented Jul 11, 2020

@mik-laj can you please review my PR: #9760
I was wondering on how I can make sure e2e test works when I can't create an instance or db in the example.

irvifa added a commit to irvifa/airflow that referenced this issue Jul 11, 2020
Previously there's already example of how to run export from CloudSQL
to GCS described in https://airflow.readthedocs.io/en/stable/_modules/airflow/contrib/example_dags/example_gcp_sql.html.
However, based on apache#8280 the test itself
is not available yet.
@mik-laj
Copy link
Member Author

mik-laj commented Nov 11, 2020

We have a lot of DAG examples, but sometimes these DAGs don't contain all operators. More information in the first post.

@rachael-ds
Copy link
Contributor

Happy to work on the documentation/example dags for the Data Loss Prevention operator :)

@mik-laj
Copy link
Member Author

mik-laj commented Jan 27, 2021

@rachael-ds I assigned you to this ticket 🐈

@Bowrna
Copy link
Contributor

Bowrna commented Nov 23, 2021

@mik-laj I am a newbie here and I am trying to add mssql_to_gcs example dag. GCP is also new to me and I have created an account and enabled service account. I have to add/edit the connection details in GCP connection to check if my example dag is working. I have created an example dag with mssql already. When configuring connections for GCP, i had to add keyfile path, keyfile secret name, keyfile json. But i am not sure how to get these values. Is there any docs that helps me to understand this part? Thanks.
Screenshot 2021-11-23 at 4 47 52 PM

@potiuk
Copy link
Member

potiuk commented Nov 23, 2021

Here are the relevant GCP instructions @Bowrna https://cloud.google.com/docs/authentication/production

@Bowrna
Copy link
Contributor

Bowrna commented Nov 23, 2021

@potiuk I have followed the instructions and enabled the service account. I have the JSON key after enabling the service account. But I am not sure what values I have to give in the keyfile path, keyfile secret name, keyfile json in the Google Cloud connections form. I only have the JSON generated out by enabling service account.

@potiuk
Copy link
Member

potiuk commented Nov 24, 2021

In Breeze you can put the files in "files" dir and it will be visible inside as "/files/*" and then in the connection you should specify path to that file :). I think you can specify either Json orh "Keyfile + Secret" - you do not have to specify all three. I think this page has good explanation of what is in the key. You can also - as exercise look at the unit tests of GcpBaseHook - it should have tests for all the different authentication options and should show you which combinations are valid.

@Bowrna
Copy link
Contributor

Bowrna commented Nov 25, 2021

In Breeze you can put the files in "files" dir and it will be visible inside as "/files/*" and then in the connection you should specify path to that file :). I think you can specify either Json orh "Keyfile + Secret" - you do not have to specify all three. I think this page has good explanation of what is in the key. You can also - as exercise look at the unit tests of GcpBaseHook - it should have tests for all the different authentication options and should show you which combinations are valid.

thanks @potiuk. Checking the unit test would be great way to understand this configuration.

@eladkal
Copy link
Contributor

eladkal commented Apr 21, 2022

Closed in favor of AIP 47
https://github.com/apache/airflow/projects/15
We have a dedicated issue per each example dag

@eladkal eladkal closed this as completed Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

10 participants