Within this repository, I will document my evolution as a software engineer. Maintaining a record of our progress is invaluable. When I first embarked on my programming journey, I envisioned a way to reflect on my past decisions. This repository stands as a testament to that vision, a chronicle of my growth in the world of software engineering.
Today I ws working on a Spark issue related with parallelism. I was working with a data pipeline that reads the data from a parquet file, next apply some modifications to this data to finally save the data into the same location with overwrite mode. The problem was an error happens while saving the data into the location, the reason error was Spark while spark is reading the data from the file another workers are removing and saving the data in the same location like a race condition. The solution was create a cache for the RDD and next from that cache saved the data to the location. This issue was very crazy because I have been working on that for more than 6 days.
Today I completed the Airflow 101 Learning Path. I completed this course to prepare for the Airflow Fundamentals exam. I learned the basic of Airflow and some advances concepts, I think that I am ready for the exam, and I will take the exam in the next days.
Today I continue studying about Airflow Sensors. I created examples DAGs to test the sensor functionality. I learned that by default sensors have a timeout of 7 days to wait for the condition to be true, this can be a problem for some use cases, also the poke_interval by default is 60 seconds. Sensors are a good feature to wait that something happen to perform another action. I want to create a pipeline that implements sensors but for the moment I don't have an idea about what can I do with sensors.
On Saturday I continue studying the Airflow 101 course, I learned about how works Airflow sensors, sensors are a way to wait for a condition to be true before continue with the next execution task, is like waiting that something happens before continue with the work, for example: wait for a file to be created in a directory, wait for a HTTP request to return a 200 status code, etc. I also I learned some debugging DAG techniques for the most common problems in Airflow, like the DAG is not running because the scheduler is not running or in busy processing other tasks, the DAG is not running because the start date is in the future, the DAG is not running because the schedule interval is not correct, etc.
Today I learned how to share data between tasks in Airflow using XCOMs. XCOMs values are like having two functions and function B uses the returned value from function A. XCOMs values is a good way to share data between tasks, but is not recommended for share large amounts of data. Depending of the database you are using as a meta-database is the maximum size of the XCOMs values, for example if you are using SQLite you can share values with 2gb of size. Also the values shared by as XCOMs needs to be JSON serializable.
Today I practice multiple ways to filter a DataFrame with PySpark and Python. I needed to do this because I need to remove from a DataFrame some not desired records, I learned how to filter using pyspark operations like where
and filter
using pyspark functions like col
, equal
, etc, also I create a temp view in Spark to apply the filter using SQL syntax, I discovered later that if you use df.filter()
, you can pass the SQL query in string format as argument for the filter
method. The pyspark flexibility to perform this operations is good but having multiples ways to do the same operation sometimes is a challenge because you are always looking for new ways to do their operations.
Today I learned how the left anti join works in SQL and Spark. I needed to clean some data within a Spark dataframe using a JOIN, I needed to use a JOIN because for filter a dataframe I need to use another DF to know the ID records to clean. I have problems to understand how the left anti join works, but with practice and drawing I can complete the task spark-tutorial. In other topics I continue working on my Airflow certification today.
On Monday I finished a DAG that generates a random user every 5 minutes and save the user in a PostgreSQL database, for the moment the database have 31920 records, I have the plan to save the records in a CSV file and upload the file in hugging face to share the data. In other topics I completed 5 practice tests for the IBM Cloud Advocate Certification achieving a general score 90%, I am happy with my current knowledge about IBM cloud.
I am having problems maintaining synchronized my two Airflow environments. I execute Airflow with pip in my development environment (Personal Laptop), and I execute Airflow with Docker in my server, for an unknown reason when I configured a Git repository to have my two environments sync, I have problems with the docker environment. I am going to debug the issue tomorrow and maybe consider execute both environments in the same way.
Today I learned how the Bash operator works in Airflow. The BashOperator is used to execute Bash commands or scripts on the host machine where Airflow is installed, this means maybe you want to save the content of a HTTP request in a file on disk, with the BashOperator you can create a script to do this action and save the file in the /tmp directory as an example. I also created a DAG to understand how the BashOperator works. My first task is to execute a bash script to make an HTTP request and get a JSON file, the second task is to debug if the file was saved correctly, and the third and final task is to upload the content of the JSON file into a MongoDB collection. I could not configure MongoDB with the Mongo provider, I will have to continue configuring the connection tomorrow.
Today, I continued studying more about Airflow. I refresh my knowledge in Airflow concepts like a DAG run is the instance of a entire DAG that was executed and a Task Instance is an instance of a Task that was executed, remember a DAG can be composed from one or more Tasks. Tomorrow I have the plan to continue studying my Airflow course and finish the prerequisite course to start with the exam.
Today, I continued doing practice exam for the AWS certification, in other topics I write a bash script to perform an action every ten minutes in an infinite while loop, the idea was avoid execute the command that triggers another process every time when the process is completed, with this way the script attempts every ten minutes execute the process.
Today, I continued studying for the IBM Cloud Advocate certification. At the moment, I am focusing on answering quizzes for this certification. I needed to revisit this study because the exam is scheduled for the last days of November or the first days of December.
I feel tired—why? Because I am working on achieving four certifications simultaneously:
- IBM Cloud Advocate
- AWS Cloud Practitioner
- Airflow Fundamentals
- Airflow DAG Authoring
The last two were not originally planned in my schedule, but I received a coupon to take both for free (with a 30-day expiration). I have organized a schedule to study for these certifications on specific days to avoid burnout and as a study technique. I’ll share more details in the future if this strategy proves effective. But please avoid work in multiple software certifications 😅.
Today I continue studying about Airflow, I am working in the next Airflow course. I obtained the Airflow Fundamentals Exam free with a promotion, so I am studying Airflow to get the Airflow Fundamentals Certification. I will need to pause my other current certifications because this promotion has a 30-days expiration time, so I only have 30 days to study and prepare for the exam.
Today I continue with the AWS practice exams, for the moment I am getting better exam results, the key is to memorize the service definitions to match a use case with a definition. For the Software Engineering at Google book I continue reading the testing book part, the book have 4 chapters dedicated to testing, looks like Google loves good software tested.
Today I created a new GitHub repository to store my Dockerfile and docker-compose files. The idea behind this repository is to have all my container files in one place because I have several Dockerfiles in my personal computer and my two Linux servers. I love to work with Docker and set up my own resources like databases, message brokers, etc., in containerized environments. You can find the repository in the next link. Also, I continued working on configuring my PostgreSQL database. I was working on setting up a separate container to create a backup of my database every day and store the backup in my server’s disk drive.
Today I worked into setting up a PostgreSQL database with Docker. I create a docker compose to deploy a PostgreSQL with special configurations, I was learning about the best ways to deploy and manage this database within a container, with configurations for performance, security and health checks. I created and configure the PostgreSQL database to save some data from my Airflow DAG into the database.
Today I continue working on my Udemy course about practice exams by the AWS Certified Cloud Practitioner exam. Also I was working into my Airflow DAG, for the moment I am configuring the DAG to load the extracted data into a PostgreSQL database.
Today I discovered some amazing things reading the Software Engineering at Google book, I have been reading the chapter of testing at google, looks like testing in google is a mature process and is required that any engineer working on Google work with a test methodology. I discover the Google Testing Blog a good place to find useful information about testing, also I learned the Beyonce rule for testing If you liked it, then you shoulda put a test on it
, this rule is amazing xD. In other words I continue with my DAG development in Airflow.
Today, I continued working through the Udemy course and completed practice test number 3. Unfortunately, I didn’t pass, scoring 61% (the minimum passing score is 70%). I struggled particularly with questions related to Billing & Pricing and Cloud Concepts. The challenge with AWS is that it has so many services, making it difficult to remember all of them. Since I don’t work with AWS services in my daily activities, it’s harder for me to recall the most common service usages and descriptions. I’ll need to keep practicing and studying to feel fully prepared for the exam.
Today reading the Software Engineer at Google book, I review an important topic in software engineering Documentation. Documentation focusing on software products and code is one of the most important aspect of good code, when you write good documentation for your API code, is easy for a user understand how your code works and the things that the code do.
Today I continue working in the 6 Practice Exam | AWS Certified Cloud Practitioner Udemy course. I completed two of the six practice test. The first was completed with 82% grade and the second with 72% grade. The minimum to pas the exam is 70%, really this exam is a little crazy, AWS have a lot of services to Studying the idea is study the most important services and have luck like a coding interview in which you know how to solve a good number of problems waiting for find that problems in your coding interview.
During this day I only studied a brief time for my AWS certification, for the moment I am only focusing on solving example exams to practice for the real, I completed the required courses. I have been reading the Software Engineering at Google book, for the moment in chapter 10. Documentation.
Day 1 - 50 HTML - CSS - Javascript here
Day 51 - 100 Javascript here
Day 101 - 150 Javascript here
Day 151 - 200 Javascript here
Day 201 - 250 Javascript and start React JS here
Day 251 - 300 Javascript - React JS here
Day 301 - 350 Javascript - React JS and start Node JS here
Day 351 - 400 Start learn MERN stack here
Day 401 - 450 MERN stack - Docker - Typescript - TDD here
Day 451 - 500 Object Oriented Programming here
Day 501 - 550 Data Structures and Algorithms here
Day 551 - 600 Java
Day 601 - 650 Java here
Day 651 - 700 Java and Spring Boot here
Day 701 - 750 Java | Spring Boot | MySQL here
Day 751 - 800 Java | Spring Boot | MySQL here
Day 801 - 850 Java | Spring Boot | OCI here
Day 851 - 900 Java | Design Patterns here
Day 901 - 950 Java | Design Patterns here
Day 951 - 1000 Java | Design Patterns here
Day 1001 - 1050 Java | New Job here
Day 1051 - 1100 Java | Python | Cloud Computing here
Day 1101 - 1150 Java | Containers | Cloud Native Development | OpenShift here
Day 1151 - 1200 Java | Containers | Python | Security here
Day 1201 - 1250 Java | Python | Security | English here
Day 1251 - 1300 Java | Python | Security | English | Soft-Skills here
Day 1301 - 1350 Java | Python | Spark | Microservices here
Day 1351 - 1400 Java | Python | Design Patterns | Microservices here
Day 1401 - 1450 Java | Design Patterns | CI/CD here
Day 1451 - 1500 Java | Design Patterns | CI/CD here
Day 1501 - 1550 Java | Python | Data Analytics | CI/CD here
Day 1551 - 1600 Java | Python | Data Analytics | CI/CD here
Day 1601 - 1700 Java | Python | ReactJS | Data Structures and Algorithms here
Day 1701 - 1750 Java | Go | DSA | A Common-Sense Guide to DSA Book here
Day 1751 - 1800 Java | Python | DSA | AWS Cloud Practitioner | Data Engineering here
Day 1801 - 1850 Java | Python | Airflow | AWS Cloud Practitioner | Data Engineering here