Lecturer: Riccardo Tommasini, PhD
Special Thanks to Emanuele Della Valle and Marco Brambilla from Politecnico di Milano to letting me "steal" some of their great slides.
- Data Engineer
- [[Data Lifecycle]]
- Data Collection (Taught not tested)
- Data Processing
- [[Data Analysis]] (Extra Points - Suggestion for Data Science Project in Spring)
- Data Processing (from ETL to Data Pipelines, Introduce Big Data)
- Data Ingestion (Files, HDFS, MongoDB)
- [[Cleansing | Data Pre-Processing ]] (Python)
- Data Transformation (Airflow)
- [[Data Modeling]]
- [[ systems/Apache Hadoop | Parallel Processing ]]
- [[Data Serving]]
- [[Data Visualisation]]
- [[Querying]]
- Steaming Data Pipelines
- [[Apache Kafka | Streaming Data Ingestion ]]
- [[Streaming Data Pre-Processing | Cleansing]] (Java)
- Streaming Data Transformation (Java/SQL)
- [[Event Sourcing | Data Modelling for Data Streams]]
Date | Title | Material | Mandatory Reads | Extras |
---|---|---|---|---|
01/09 | Course Intro | Slides - pdf slide 45-109) | ||
03/09 | Data Modeling | Slides - pdf slide 1-44 | Chp 4 p111-127, Chp 5 p151-156, Chp 6 p199-205 of [3] | |
10/09 | DM for Relational Databases | Slides - pdf slide 45-109 | Chp 2, 6, and 7 (Normal Forms) of [1] | Relational Model |
10/09 | DM for Data Warehouse | Slides - pdfslide 109-118 | pdf video | Chp 2 of [2] |
17/09 | DM for Big Data | Slides - pdf | Chp 2 of [3], video | paper |
17/09 | Key Value Stores | Slides 1,Slides 2pdf | nosql | |
24/10 | Column Oriented Databases | Slides 1 Slides 2 pdf | nosql | |
24/10 | Document Databases | Slides 1 Slides 2 pdf | nosql | |
01/10 | Graph Databases | Slides 1 Slides 2 pdf1 pdf2 | Chp 3 and 5 of [5] | book |
08/10 | Data Ingestion | Slides 1 Slide 2 Slide 3 Slide 4 | ||
15/10 | Part 1 Recap | Slides 1 pdf | ||
22/10 | Midterm | |||
29/10 | Data Engineering Pipelines (Part1) | Slides 1 slide 2 pdf | ||
05/11 | Data Engineering Pipelines (Part2) | Slides 1 Slides 2 Slides 3 | Chp 10 of 3 R. Chang Pt 2 R. Chang Pt 3 | |
12/11 | Streaming Data (Part 1) | Slide 1 Slide 2 | Chp 11 of 3 Streaming 101 Streaming 102 | |
19/11 | Data Journey | Slides | ||
26/11 | Streaming Data (Part 2) | Slide 1 Slide 2 | ||
03/12 | Data Wrangling (Part 1) | |||
10/12 | Data Wrangling (Part 2) |
Date | Title | Material | Reads | Videos | Branch | Notes |
---|---|---|---|---|---|---|
07-8/09 | Docker | Slides - | Video GP1 Video GP2 | Lab Branch | QA GP2 only | |
14-15 /09 | Modeling and Querying Relational Data with Postgres | Slides | Chp 32 of [1]§ | Video | Homework 1 | |
21-22 /09 | Modeling and Querying Key Value Data with Redis | Slides | Video | Homework 2 | ||
28-29/09 | Modeling and Querying Document Data with MongoDB | Slides | Video | Homework 3 | ||
5-6/10 | Modeling and Querying Graph Data with Neo4J | Slides | CypherManual | Video | Homework 4 | |
19-20-26-27/10 | Data Ingestion with Apache Kafka | Slides | Video 1 Video 2 Video 3 Video 4 | Homework 5 | ||
10-11/11 | Apache Airflow Data Pipelines | Slides | Video 1 Video 2 | Homework 6 | ||
16-17/11 | Stream Processing with Kafka Streams | Slides | Video 1 Video 2 | Homework 7 | ||
23-24/11 | Stream Processing with KSQL | Slides | Video 1 Video 2 | Homework 7 | ||
07-8/12 | Data Cleansing | Slides | Video 1 Video 2 | Homework8 | ||
14-15/12 | Data Augmentation | Slides | Video1Video2 | Homework8 |
- Modeling and Querying RDF data: SPARQL
- Domain Driven Design: a summary
- Event Sourcing: a summary
- Data Pipelines with Luigi
- Data Pipelines with Apachi Nifi
- Data Processing with Apache Flink
- What is (Big) Data?
- The Role of Data Engineer
- Data Modeling
- Data Replication
- Data Partitioning
- Transactions
- Relational Data
- NoSQL
- Document
- Graph
- Data Warehousing
- Star and Snowflake schemas
- Data Vault
- (Big) Data Pipelines
- Big Data Systems Architectures
- ETL and Data Pipelines
- Best Practices and Anti-Patterns
- Batch vs Streaming Processing
- Data Cleansing
- Data Augumentation
- [1] Database System Concepts 7th Edition Avi Silberschatz Henry F. Korth S. Sudarshan McGraw-Hill ISBN 9780078022159
- [2] The Data Warehouse Toolkit - The Definitive Guide to Dimensional Modeling Third Edition Ralph Kimball Margy Ross
- [3] Designing Data-Intensive Applications - Martin Kleppmann
- [4] Designing Event-Driven Systems
- [5] Graph Databases [[slides/Slides]]