Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.
For the list of available Spark images, see Available SageMaker Spark Images.
This project is licensed under the Apache-2.0 License.
The simplest way to get started with the SageMaker Spark Container is to use the pre-built images via the SageMaker Python SDK.
Amazon SageMaker Processing — sagemaker 2.5.3 documentation
To get started building and testing the SageMaker Spark container, you will have to setup a local development environment.
See instructions in DEVELOPMENT.md
To contribute to this project, please read through CONTRIBUTING.md