Mahdi Karabiben mahdiqb

Hi there 👋

I'm Mahdi, a PM building products in the data space. Before transitioning to product, I spent seven years designing and building petabyte-scale data platforms wearing different hats (data engineer, tech lead, data architect, and ML Ops engineer). I'm very passionate about open-source projects and enjoy working with data and designing scalable solutions. You can also read my content on Medium and via the Data Espresso newsletter.

The technologies I'm most familiar with:

Apache Spark: I used it on a daily basis for nearly four years (and so we know each other pretty well).
dbt: It's the tool I'm working with the most currently. I'm mainly working on defining and implementing standards, frameworks, and automation to better leverage dbt at scale. (Article from the Zendesk Engineering blog)
AWS Ecosystem: Worked on it for 2 years, for various data and ML projects (mostly worked with Glue, EMR, Athena, ECS, SageMaker, and the AWS CI/CD stack).
GCP Ecosystem: Using it currently on a daily basis (mostly working with BigQuery and GKE).
Hadoop: Worked with Hadoop data lakes for two and a half years (it was the ecosystem that first introduced me to distributed systems and the paradigms/concepts behind them).
Other notable projects/tools: Apache Superset, Apache Airflow, Apache Zeppelin, Apache Hive, Dremio, Databricks, Jupyter, and D3.js.
Languages I'm fluent in: Python, Java, and SQL.
Other languages I used in the past: C++, C#, JavaScript (Angular, Node.js), and HTML+CSS.
IaC: Terraform and CloudFormation.

Notable published work:

End-to-End Batch Data Pipeline with Spark: A series of four projects that I authored for Manning Publications as part of their liveProjects platform. The series goes through the different steps of building an end-to-end Big Data pipeline. Learners get to use Apache Spark, Delta Lake, and Apache Superset.
Building an End-to-End Open-Source Modern Data Platform: Proposes an exhaustive design (accompanied by the necessary Infrastructure-as-Code) to build a modern data platform solely using open-source projects and the resources offered by cloud providers.
Writing design docs for data pipelines: Exploring the what, why, and how of design docs for data components — and why they matter.
Navigating Your Data Platform’s Growing Pains: A Path from Data Mess to Data Mesh: A set of strategies and guiding principles to effectively scale your data platform while maximizing its business impact.
A Simple (Yet Effective) Approach to Implementing Unit Tests for dbt Models: Proposes an innovative unit testing approach for dbt models - relying on standards and dbt best practices.
Creating Notebook-based Dynamic Dashboards: A design (accompanied by a POC) in which notebooks are leveraged to generate dynamic dashboards, to support a Google-like metadata search engine.

Notable presentations and podcasts:

Data Innovation Summit 2023: The Data Engineer's Guide to Data Quality Testing: The Fun, Easy, and Scalable Way
Big Data Expo 2022: A Practical Case Study for Data Engineers: Performing Data Quality at Scale
The Modern Data Show (S01E02): The third wave of data technologies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mahdi Karabiben mahdiqb

Achievements

Achievements

Block or report mahdiqb

Hi there 👋

The technologies I'm most familiar with:

Notable published work:

Notable presentations and podcasts:

Pinned Loading