- Start: 17 January 2022
- Registration link: https://airtable.com/shr6oVXeQvSI5HuWD
- Register in DataTalks.Club's Slack
- Join the
#course-data-engineering
channel - Subscribe to our public Google Calendar (it works from Desktop only)
- The videos are published to DataTalks.Club's YouTube channel in the course playlist
- Leaderboard
Note: This is preliminary and may change
- Course overview
- Introduction to GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment for the course
- Homework
- Data Lake
- Workflow orchestration
- Setting up Airflow locally
- Ingesting data to GCP with Airflow
- Ingesting data to local Postgres with Airflow
- Moving data from AWS to GCP (Transfer service)
- Homework
- Data Warehouse
- BigQuery
- Partitoning and clustering
- BigQuery best practices
- Internals of BigQuery
- Integrating BigQuery with Airflow
- BigQuery Machine Learning
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- Postgres and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualising the data with google data studio and metabase
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
Goal:
Instructor: Ankush
- Basics
- What is Kafka
- Internals of Kafka, broker
- Partitoning of Kafka topic
- Replication of Kafka topic
- Consumer-producer
- Schemas (avro)
- Streaming
- Kafka streams
- Kafka connect
- Alternatives (PubSub/Pulsar)
Duration: 1.5h
- Putting everything we learned to practice
Duration: 2-3 weeks
- Upcoming buzzwords
- Delta Lake/Lakehouse
- Databricks
- Apache iceberg
- Data mesh
- KSQLDB
- Streaming analytics
- Mlops
Duration: 30 mins
- Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
- Terraform: Infrastructure-as-Code (IaC)
- Docker: Containerization
- SQL: Data Analysis & Exploration
- Airflow: Pipeline Orchestration
- dbt: Data Transformation
- Spark: Distributed Processing
- Kafka: Streaming
To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.
Prior experience with data engineering is not required.
- Ankush Khanna (https://linkedin.com/in/ankushkhanna2)
- Sejal Vaidya (https://linkedin.com/in/vaidyasejal)
- Victoria Perez Mola (https://www.linkedin.com/in/victoriaperezmola/)
- Alexey Grigorev (https://linkedin.com/in/agrigorev)
For this course you'll need to have the following software installed on your computer:
- Docker and Docker-Compose
- Python 3 (e.g. via Anaconda)
- Google Cloud SDK
- Terraform
See Week 1 for more details about installing these tools
You can ask any questions in the #course-data-engineering
channel in DataTalks.Club slack
Please follow these recommendations when asking for help
- Q: I registered, but haven't received a confirmation email. Is it normal? A: Yes, it's normal. It's not automated. But you will receive an email eventually
- Q: At what time of the day will it happen? A: Office hours will happen on Mondays at 17:00 CET. But everything will be recorded, so you can watch it whenever it's convenient for you
- Q: Will there be a certificate? A: Yes, if you complete the project
- Q: I'm 100% not sure I'll be able to attend. Can I still sign up? A: Yes, please do! You'll receive all the updates and then you can watch the course at your own pace.
- Q: Do you plan to run a ML engineering course as well? A: Glad you asked. We do :)
Big thanks to other communities for helping us spread the word about the course:
Check them out - they are cool!