My solutions to the mid-chapter and end-of-chapter exercises in Jonathan Rioux' Data Analysis with Python and PySpark.
To prepare for tasks where data becomes too large to work with locally, I wanted to learn how to use at least one distributed computing framework reasonably well, and Spark remains a popular choice in the data science and data analytics community. All three major cloud providers (Amazon Web Services, Google Cloud Platform and Microsoft Azure) have a managed Spark cluster as part of their offerings, making it easy to get up and running with a fully provisioned cluster quickly.
PySpark provides an entry point to Python in the computational model of Spark. It provides access not just to the core Spark API, but also to bespoke functionality for scaling out regular Python code, as well as Pandas transformations.
Having used PySpark on Databricks in a product management role to answer questions about the datasets underpinning our product, I wanted to refresh my skills while approaching Spark from an analyst's perspective.