title | hide_title | sidebar_label | description |
---|---|---|---|
Development Setup and Building From Source |
true |
Development Setup |
SynapseML Development Setup |
-
- You may need an Oracle login to download.
-
Install Apache Spark
-
Download and install Apache Spark with version >= 3.2.0. (SynapseML v0.11.1 only supports spark version >= 3.2.0)
-
Extract downloaded zipped files (with 7-Zip app on Windows or
tar
on linux) and remember the location of extracted files, we takeC:\bin\spark-3.2.0-bin-hadoop3.2
or~/bin/spark-3.2.0-bin-hadoop3.2/
as an example here. -
On Windows, run the following commands to set the environment variables used to locate Apache Spark. Make sure to run the command prompt in administrator mode.
setx /M HADOOP_HOME C:\bin\spark-3.2.0-bin-hadoop3.2\ setx /M SPARK_HOME C:\bin\spark-3.2.0-bin-hadoop3.2\ setx /M PATH "%PATH%;%HADOOP_HOME%;%SPARK_HOME%bin" # Warning: Don't run this if your path is already long as it will truncate your path to 1024 characters and potentially remove entries! </TabItem>
- On Linux, add the following to your .bashrc:
export SPARK_HOME=~/bin/spark-3.2.0-bin-hadoop3.2/ export PATH="$SPARK_HOME/bin:$PATH" </TabItem>
-
-
Fork the repository on GitHub
- See how to here: Fork a repo - GitHub Docs
-
Clone your fork
git clone https://github.com/<your GitHub handle>/SynapseML.git
- This command will automatically add your fork as the default remote, called
origin
-
Add another Git Remote to track the original SynapseML repo. It's recommended to call it
upstream
:git remote add upstream https://github.com/microsoft/SynapseML.git
- See more about Git remotes here: Git - Working with remotes
-
Go to the directory where you cloned the repo (for instance,
SynapseML
) withcd SynapseML
-
Run sbt to compile and grab datasets
sbt setup
-
Configure IntelliJ
- Install Scala plugin during initialization
- OPEN the SynapseML directory from IntelliJ
- If the project doesn't automatically import, click on
build.sbt
and import the project
-
Prepare your Python Environment
- Install Miniconda
- Note: if you want to run conda commands from IntelliJ, you may need to select the option to add conda to PATH during installation.
- Activate the
synapseml
conda environment by runningconda env create -f environment.yml
from thesynapseml
directory. :::note If you're using a Windows machine, removehorovod
requirement in the environment.yml file, because horovod installation only supports Linux or macOS. Horovod is used only for namespacesynapse.ml.dl
. :::
-
On Windows, install WinUtils
- Download WinUtils.exe and copy it into the
bin
directory of your Spark installation, e.g. C:\Users\user\AppData\Local\Spark\spark-3.3.2-bin-hadoop3\bin
- Download WinUtils.exe and copy it into the
-
Install pre-commit
- This repository uses the pre-commit tool to manage git hooks and enforce linting/coding styles.
- The hooks are configured in .pre-commit-config.yaml.
- To use the hooks, run the following commands:
pip install pre-commit pre-commit install
- Now
pre-commit
should automatically run on everygit commit
operation to find AND fix linting issues.
NOTE
If you will be regularly contributing to the SynapseML repo, you'll want to keep your fork synced with the upstream repository. Please read this GitHub doc to know more and learn techniques about how to do it.
To use secrets in the build, you must be part of the synapsemlkeyvault
and Azure subscription. If you're MSFT internal and would like to be
added, reach out to [email protected]
Compiles the main, test, and integration test classes respectively
Runs all synapsemltests
Runs scalastyle check on main
Runs scalastyle check on test
Generates documentation for scala sources
Creates a conda environment synapseml
from environment.yml
if it doesn't already exist.
This env is used for python testing.
Activate this env before using python build commands.
Removes synapseml
conda env
Compiles scala, runs python generation scripts, and creates a wheel
Generates documentation for generated python code
Installs generated python wheel into existing env
Generates and runs python tests
Downloads all datasets used in tests to target folder
Combination of compile
, test:compile
, it:compile
, getDatasets
Packages the library into a jar
Publishes Jar to SynapseML's Azure blob-based Maven repo. (Requires Keys)
Publishes library to the local Maven repo
Publishes scala and python doc to SynapseML's Azure storage account. (Requires Keys)
Publishes the library to Sonatype staging repo
Promotes the published Sonatype artifact