diff --git a/README.md b/README.md
index bb276cefa6..8fcd639473 100644
--- a/README.md
+++ b/README.md
@@ -29,24 +29,41 @@ For more information, see our [paper](http://www.arxiv.org/abs/1902.10345).
See an example SDFG [in the standalone viewer (SDFV)](https://spcl.github.io/dace/sdfv.html?url=https://spcl.github.io/dace/examples/gemm.sdfg).
-Tutorials
----------
+Quick Start
+-----------
+
+Install DaCe with pip: `pip install dace`
+
+Using DaCe in Python is as simple as adding a `@dace` decorator:
+```python
+import dace
+import numpy as np
+
+@dace
+def myprogram(a):
+ for i in range(a.shape[0]):
+ a[i] += i
+ return np.sum(a)
+```
+
+Calling `myprogram` with any NumPy array or `__{cuda_}array_interface__`-supporting tensor (e.g., PyTorch, Numba) will generate data-centric code, compile, and run it. From here on out, you can _optimize_ (interactively or automatically), _instrument_, and _distribute_ your code. The code creates a shared library (DLL/SO file) that can readily be used in any C ABI compatible language (C/C++, FORTRAN, etc.).
+
+For more information on how to use DaCe, see the [samples](samples) or tutorials below:
* [Getting Started](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/getting_started.ipynb)
+* [Benchmarks, Instrumentation, and Performance Comparison with Other Python Compilers](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/benchmarking.ipynb)
* [Explicit Dataflow in Python](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/explicit.ipynb)
* [NumPy API Reference](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb)
* [SDFG API](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/sdfg_api.ipynb)
* [Using and Creating Transformations](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/transformations.ipynb)
* [Extending the Code Generator](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/codegen.ipynb)
-Installation and Dependencies
------------------------------
-
-To install: `pip install dace`
+Dependencies
+------------
Runtime dependencies:
* A C++14-capable compiler (e.g., gcc 5.3+)
- * Python 3.6 or newer
+ * Python 3.7 or newer (Python 3.6 is supported but not actively tested)
* CMake 3.15 or newer
Running
diff --git a/tutorials/benchmarking.ipynb b/tutorials/benchmarking.ipynb
new file mode 100644
index 0000000000..4e7e048bca
--- /dev/null
+++ b/tutorials/benchmarking.ipynb
@@ -0,0 +1,1511 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "6d9dce90",
+ "metadata": {},
+ "source": [
+ "# Benchmarking\n",
+ "\n",
+ "In this tutorial we will compare DaCe with other popular Python-accelerating libraries. The NumPy results should be a bit faster if an optimized version is installed (for example, compiled with Intel MKL).\n",
+ "\n",
+ "Table of Contents:\n",
+ "* [Dependencies](#Dependencies)\n",
+ "* [Simple programs](#Simple-programs-with-multiple-operators)\n",
+ "* [Loops](#Loops)\n",
+ " * [Varying sizes](#Varying-sizes)\n",
+ "* [Auto-parallelization](#Auto-parallelization)\n",
+ "* [Example: 3D Heat Diffusion](#3D-Heat-Diffusion)\n",
+ "* [Benchmarking and Instrumentation API](#Benchmarking-and-Instrumentation-API)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d0a07e84",
+ "metadata": {},
+ "source": [
+ "TL;DR DaCe is fast:\n",
+ "\n",
+ "![performance](performance.png \"performance\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "93f7d90c",
+ "metadata": {},
+ "source": [
+ "## Dependencies\n",
+ "\n",
+ "First, let's make sure we have all the frameworks ready to go:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "dbb480ef",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "...\n"
+ ]
+ }
+ ],
+ "source": [
+ "%pip install jax jaxlib\n",
+ "%pip install numba\n",
+ "%pip install pythran\n",
+ "# Your library here"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "b2b10c42",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "...\n"
+ ]
+ }
+ ],
+ "source": [
+ "# MKL for performance\n",
+ "%conda install mkl mkl-include mkl-devel\n",
+ "\n",
+ "# matplotlib to draw the results\n",
+ "%pip install matplotlib"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "927781f2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Setup code for plotting\n",
+ "import matplotlib.pyplot as plt\n",
+ "\n",
+ "def barplot(title, labels=False):\n",
+ " x = ['numpy'] + list(sorted(TIMES.keys() - {'numpy'}))\n",
+ " bars = [np.median(TIMES[key].timings) for key in x]\n",
+ " yerr = [np.std(TIMES[key].timings) for key in x]\n",
+ " color = [('#86add9' if 'dace' in key else 'salmon') for key in x]\n",
+ "\n",
+ " p = plt.bar(x, bars, yerr=yerr, color=color)\n",
+ " plt.ylabel('Runtime [s]'); plt.xlabel('Implementation'); plt.title(title); \n",
+ " if labels:\n",
+ " plt.gca().bar_label(p)\n",
+ " pass"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "317721fd",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Setup code for benchmarked frameworks\n",
+ "import numpy as np\n",
+ "import jax\n",
+ "import numba\n",
+ "import dace"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "46238b6e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Pythran loads in a separate cell\n",
+ "%load_ext pythran.magic"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e44fb94",
+ "metadata": {},
+ "source": [
+ "## Simple programs with multiple operators\n",
+ "\n",
+ "Let's start with a basic program with three different operations. This example program was taken from the [JAX README](https://github.com/google/jax#compilation-with-jit):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "d9828ae7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def slow_f(x):\n",
+ " return x * x + x * 2.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "74047bc8",
+ "metadata": {},
+ "source": [
+ "First, let's measure the performance of NumPy as-is on this function:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "afe8910d",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "68.6 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "a = np.random.rand(5000, 5000)\n",
+ "\n",
+ "TIMES = {}\n",
+ "\n",
+ "TIMES['numpy'] = %timeit -o slow_f(a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d5afed2",
+ "metadata": {},
+ "source": [
+ "Now we can construct Just-In-Time (JIT) compiled versions of this function, for each framework:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "f66e04d1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "jax_f = jax.jit(slow_f)\n",
+ "numba_f = numba.jit(slow_f)\n",
+ "dace_f = dace.program(auto_optimize=True)(slow_f)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "8b6f4f7b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%pythran\n",
+ "#pythran export pythran_f(float64[:,:])\n",
+ "def pythran_f(x):\n",
+ " return x * x + x * 2.0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "de148d29",
+ "metadata": {},
+ "source": [
+ "Before we measure the time, we will run the functions first as a warmup, to allow compilers to run JIT compilation:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "99491394",
+ "metadata": {
+ "scrolled": false
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1.29 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
+ "323 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
+ "1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# On your marks...\n",
+ "%timeit -r 1 -n 1 jax_f(a).block_until_ready()\n",
+ "%timeit -r 1 -n 1 numba_f(a)\n",
+ "%timeit -r 1 -n 1 dace_f(a)\n",
+ "%timeit -r 1 -n 1 pythran_f(a)\n",
+ "pass\n",
+ "# ...get set..."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "067febc9",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "43.6 ms ± 4.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# ...Go!\n",
+ "TIMES['jax'] = %timeit -o jax_f(a).block_until_ready()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "e7f811ff",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "27.8 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['numba'] = %timeit -o numba_f(a)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "e6d98ce6",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "31.3 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['pythran'] = %timeit -o pythran_f(a)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "9db35692",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "25.7 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['dace_jit'] = %timeit -o dace_f(a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4bb9be46",
+ "metadata": {},
+ "source": [
+ "You could also precompile the program for faster runtimes (be aware that the return value is retained across calls!):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "a3c0702e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Either provide type annotations on the `@dace.program`, or call `compile` with sample arguments\n",
+ "cprog = dace_f.compile(a)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "d0754f47",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "21.5 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['dace'] = %timeit -o cprog(a)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78d0830a",
+ "metadata": {},
+ "source": [
+ "We can now plot the results:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "01ae5917",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "barplot('Simple program, multiple operators')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "16be4882",
+ "metadata": {},
+ "source": [
+ "## Loops\n",
+ "\n",
+ "Here we test how interpreter overhead can be mitigated by the Python compiling frameworks. Let's take another application from Numba's [5 minute guide](https://numba.readthedocs.io/en/stable/user/5minguide.html):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "c7134a92",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def go_fast(a):\n",
+ " trace = 0.0\n",
+ " for i in range(a.shape[0]):\n",
+ " trace += np.tanh(a[i, i])\n",
+ " return a + trace"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "844c1c84",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "b = np.random.rand(1000, 1000)\n",
+ "\n",
+ "TIMES = {}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "69ef66f2",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1.94 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['numpy'] = %timeit -o go_fast(b)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "1b6aef84",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "numba_fast = numba.jit(go_fast)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "e74804c4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import jax.numpy as jnp\n",
+ "\n",
+ "@jax.jit\n",
+ "def jax_fast(a):\n",
+ " trace = 0.0\n",
+ " for i in range(a.shape[0]):\n",
+ " trace += jnp.tanh(a[i, i])\n",
+ " return a + trace"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "f88a24c6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "N = dace.symbol('N')\n",
+ "\n",
+ "@dace.program(auto_optimize=True)\n",
+ "def dace_fast(a: dace.float64[N, N]):\n",
+ " trace = 0.0\n",
+ " for i in range(N):\n",
+ " trace += np.tanh(a[i, i])\n",
+ " return a + trace"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "e6f18b89",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "%%pythran\n",
+ "from numpy import tanh\n",
+ "\n",
+ "#pythran export pythran_fast(float64[:,:])\n",
+ "def pythran_fast(a):\n",
+ " trace = 0.0\n",
+ " for i in range(a.shape[0]):\n",
+ " trace += tanh(a[i, i])\n",
+ " return a + trace"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "e7e5ab60",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "DaCe compilation time: 0.5581727027893066 seconds\n"
+ ]
+ }
+ ],
+ "source": [
+ "import time\n",
+ "start = time.time()\n",
+ "csdfg = dace_fast.compile(b)\n",
+ "print('DaCe compilation time:', time.time() - start, 'seconds')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "a67d01ac",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "11.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
+ "147 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "%timeit -r 1 -n 1 jax_fast(b).block_until_ready()\n",
+ "%timeit -r 1 -n 1 numba_fast(b)\n",
+ "%timeit -r 1 -n 1 pythran_fast(b)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7e0ffab9",
+ "metadata": {},
+ "source": [
+ "Note that the slow JAX first run time is due to the inspector/executor model, in which the compilation time depends on the size of the array."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "97657722",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2.28 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['jax'] = %timeit -o jax_fast(b).block_until_ready()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "6696626d",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "970 µs ± 130 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['numba'] = %timeit -o numba_fast(b)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "98a80c82",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "673 µs ± 54.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['pythran'] = %timeit -o pythran_fast(b)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "7a741c90",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "668 µs ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['dace'] = %timeit -o csdfg(b, N=b.shape[0])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "fc4c6fb2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "barplot('Loops')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e888f1f6",
+ "metadata": {},
+ "source": [
+ "### Varying sizes\n",
+ "\n",
+ "Since the DaCe program was defined symbolically, the input array size can be changed without recompilation:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "fbdc52c3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sizes = [np.random.randint(700, 5000) for _ in range(10)]\n",
+ "arrays = [np.random.rand(n, n) for n in sizes]\n",
+ "\n",
+ "def vary_size(call):\n",
+ " for a in arrays:\n",
+ " call(a)\n",
+ "\n",
+ "def vary_size_dace(call):\n",
+ " for a, n in zip(arrays, sizes):\n",
+ " call(a, N=n)\n",
+ " \n",
+ "def vary_size_jax(call):\n",
+ " for a in arrays:\n",
+ " call(a).block_until_ready()\n",
+ " \n",
+ "TIMES = {}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "2aa26e86",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "155 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
+ "125 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
+ "124 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
+ "114 ms ± 8.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n",
+ "334 ms ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+ ]
+ }
+ ],
+ "source": [
+ "TIMES['numpy'] = %timeit -o vary_size(go_fast)\n",
+ "TIMES['numba'] = %timeit -o vary_size(numba_fast)\n",
+ "TIMES['pythran'] = %timeit -o vary_size(pythran_fast)\n",
+ "TIMES['dace'] = %timeit -o vary_size_dace(csdfg)\n",
+ "TIMES['jax'] = %timeit -o vary_size_jax(jax_fast)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "id": "144b470a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "barplot('Loop - Varying sizes')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "16405894",
+ "metadata": {},
+ "source": [
+ "## Auto-parallelization\n",
+ "\n",
+ "DaCe can use data-centric dependency analysis to not only track and reduce data movement, but also automatically extract parallel regions in code. Here we look at a simple program and how it is run in parallel. We use the `auto_optimize` flag in the `dace.program` decorator to automatically apply optimization heuristics."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "id": "eb5b28ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def element_update(a):\n",
+ " return a * 5\n",
+ "\n",
+ "def someforloop(A):\n",
+ " for i in range(A.shape[0]):\n",
+ " for j in range(A.shape[1]):\n",
+ " A[i, j] = element_update(A[i, j])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "id": "d80217b2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "a = np.random.rand(1000, 1000)\n",
+ "daceloop = dace.program(auto_optimize=True)(someforloop)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f2ba2545",
+ "metadata": {},
+ "source": [
+ "Here it is compared with numpy and numba's similar capability:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "id": "8420d1f0",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "446 ms ± 41.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ ":4: NumbaWarning: \u001b[1m\n",
+ "Compilation is falling back to object mode WITH looplifting enabled because Function \"someforloop\" failed type inference due to: \u001b[1mUntyped global name 'element_update':\u001b[0m \u001b[1m\u001b[1mCannot determine Numba type of \u001b[0m\n",
+ "\u001b[1m\n",
+ "File \"\", line 7:\u001b[0m\n",
+ "\u001b[1mdef someforloop(A):\n",
+ "