Skip to content

michelou/spark-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Playing with Spark on Windows

Spark project This repository gathers Spark code examples coming from various websites and books.
It also includes several build scripts (Bash scripts, batch files, Make scripts) for experimenting with Spark on a Windows machine.

Ada, Akka, C++, COBOL, Dafny, Dart, Deno, Docker, Erlang, Flix, Go, GraalVM, Haskell, Kafka, Kotlin, LLVM, Modula-2, Node.js, Rust, Scala 3, Spring, Standard ML, TruffleSqueak, WiX Toolset and Zig are other topics we are continuously monitoring.

Read the document "What is Apache Spark™?" from the Spark documentation to know more about the Spark ecosystem.

Project dependencies

This project depends on two external software for the Microsoft Windows platform:

Optionally one may also install the following software:

Installation policy
When possible we install software from a Zip archive rather than via a Windows installer. In our case we defined C:\opt\ as the installation directory for optional software tools (similar to the /opt/ directory on Unix).

For instance our development environment looks as follows (January 2025) 2:

C:\opt\apache-maven\                       ( 10 MB)
C:\opt\ConEmu\                             ( 26 MB)
C:\opt\Git\                                (391 MB)
C:\opt\gradle\                             (140 MB)
C:\opt\jdk-temurin-11.0.25_9\              (306 MB)
C:\opt\jdk-temurin-17.0.13_11\             (304 MB)
C:\opt\jdk-temurin-21.0.5_11\              (329 MB)
C:\opt\msys64\                             (2.8 GB)
C:\opt\sbt\                                (135 MB)
C:\opt\scala-2.13.15\                      ( 24 MB)
C:\opt\spark-3.5.4-bin-hadoop3\            (423 MB)
C:\opt\spark-3.5.4-bin-hadoop3-scala2.13\  (432 MB)
C:\opt\VSCode\                             (381 MB)

🔎 Git for Windows provides a BASH emulation used to run git from the command line (as well as over 250 Unix commands like awk, diff, file, grep, more, mv, rmdir, sed and wc).

Directory structure

This project has the following directory structure :

bin\
docs\
examples\{README.md, HelloWorld, etc.}
README.md
QUICKREF.md
RESOURCES.md
setenv.bat

where

We also define a virtual drive – e.g. drive K: – in our working environment in order to reduce/hide the real path of our project directory (see article "Windows command prompt limitation" from Microsoft Support).

🔎 We use the Windows external command subst to create virtual drives; for instance:

> subst K: %USERPROFILE%\workspace\spark-examples

In the next section we give a brief description of the batch files present in this project.

Batch/Bash commands

setenv.bat 3

We execute command setenv.bat once to setup our development environment; it makes external tools such as mvn.cmd, sbt.bat or sh.exe directly available from the command prompt.

> setenv
Tool versions:
   java 11.0.25, sbt 1.10.7, scalac 2.13.15, spark-shell 3.5.4,
   gradle 8.12, mvn 3.9.9, make 4.4.1,
   git 2.47.1, diff 3.10, bash 5.2.37(1)

> where mvn sbt sh
C:\opt\apache-maven\bin\mvn
C:\opt\apache-maven\bin\mvn.cmd
C:\opt\Git\bin\sh.exe
C:\opt\Git\usr\bin\sh.exe
C:\opt\sbt\bin\sbt
C:\opt\sbt\bin\sbt.bat

Footnotes

[1] Scala 2.13 Support

Spark 3.2.0 and newer add support for Scala 2.13 (see PR#34218).

[2] Downloads

In our case we downloaded the following installation files (see section 1):
apache-maven-3.9.9-bin.zip                         ( 10 MB)
ConEmuPack.230724.7z                               (  5 MB)
gradle-8.12-bin.zip                                (118 MB)
msys2-x86_64-20240727.exe                          ( 86 MB)
OpenJDK11U-jdk_x64_windows_hotspot_11.0.25_9.zip   (194 MB)
OpenJDK17U-jdk_x64_windows_hotspot_17.0.13_11.zip  (191 MB)
OpenJDK21U-jdk_x64_windows_hotspot_21.0.5_11.zip   (191 MB)
PortableGit-2.47.1-64-bit.7z.exe                   ( 41 MB)
sbt-1.10.7.zip                                     ( 17 MB)
scala-2.13.15.zip                                  ( 21 MB)
spark-3.5.4-bin-hadoop3.tgz                        (285 MB)
spark-3.5.4-bin-hadoop3-scala2.13.tgz              (292 MB)
VSCode-win32-x64-1.96.2.zip                        (131 MB)
winutils-master.zip                                ( 24 MB)
Note: If not yet done our batch file setenv.bat also install the winutils tools for Windows to avoid the "no native library" and "access0" error.
> setenv -verbose
Assign drive J: to path "%USERPROFILE%\workspace-perso\spark-examples"
Download Zip file to directory "%TEMP%"
Uncompress Zip file to directory "%TEMP%"
Copy files from "%TEMP%\winutils-master\hadoop-3.3.6\bin" to directory "C:\opt\spark-3.5.4-bin-hadoop3-scala2.13\bin"
Tool versions:
   java 11.0.25, sbt 1.10.7, scalac 2.13.8, spark-shell 3.5.4,
   gradle 8.12, mvn 3.9.9, make 4.4.1,
   git 2.47.1, diff 3.10, sh 5.2.37(1)
Tool paths:
   [...]

[3] setenv.bat usage

Batch file setenv.bat has specific environment variables set that enable us to use command-line developer tools more easily.
It is similar to the setup scripts described on the page "Visual Studio Developer Command Prompt and Developer PowerShell" of the Visual Studio online documentation.
For instance we can quickly check that the two scripts Launch-VsDevShell.ps1 and VsDevCmd.bat are indeed available in our Visual Studio 2019 installation :
> where /r "C:\Program Files (x86)\Microsoft Visual Studio" *vsdev*
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\Launch-VsDevShell.ps1
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\VsDevCmd.bat
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\vsdevcmd\core\vsdevcmd_end.bat
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\Common7\Tools\vsdevcmd\core\vsdevcmd_start.bat
Concretely, in our GitHub projects which depend on Visual Studio (e.g. michelou/cpp-examples), setenv.bat does invoke VsDevCmd.bat (resp. vcvarall.bat for older Visual Studio versions) to setup the Visual Studio tools on the command prompt.

mics/January 2025  

About

Playing with Spark on Windows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published