diff --git a/.github/workflows/dev.yml b/.github/workflows/dev.yml index 8bb35f1ef871..39c449c50a8e 100644 --- a/.github/workflows/dev.yml +++ b/.github/workflows/dev.yml @@ -64,7 +64,7 @@ jobs: # if you encounter error, try rerun the command below with --write instead of --check # and commit the changes npx prettier@2.3.2 --check \ - {ballista,datafusion,datafusion-examples,docs,python}/**/*.md \ + '{ballista,datafusion,datafusion-examples,docs,python}/**/*.md' \ README.md \ DEVELOPERS.md \ - ballista/**/*.{ts,tsx} + 'ballista/**/*.{ts,tsx}' diff --git a/ballista/README.md b/ballista/README.md index 0a8db63a1a6c..eeb4273ee893 100644 --- a/ballista/README.md +++ b/ballista/README.md @@ -19,8 +19,8 @@ # Ballista: Distributed Compute with Apache Arrow and DataFusion -Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and -DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and +Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow and +DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs. The foundational technologies in Ballista are: @@ -37,23 +37,23 @@ redundancy in the case of a scheduler failing. # Getting Started -Fully working examples are available. Refer to the [Ballista Examples README](../ballista-examples/README.md) for +Fully working examples are available. Refer to the [Ballista Examples README](../ballista-examples/README.md) for more information. ## Distributed Scheduler Overview -Ballista uses the DataFusion query execution framework to create a physical plan and then transforms it into a +Ballista uses the DataFusion query execution framework to create a physical plan and then transforms it into a distributed physical plan by breaking the query down into stages whenever the partitioning scheme changes. -Specifically, any `RepartitionExec` operator is replaced with an `UnresolvedShuffleExec` and the child operator +Specifically, any `RepartitionExec` operator is replaced with an `UnresolvedShuffleExec` and the child operator of the repartition operator is wrapped in a `ShuffleWriterExec` operator and scheduled for execution. -Each executor polls the scheduler for the next task to run. Tasks are currently always `ShuffleWriterExec` operators -and each task represents one *input* partition that will be executed. The resulting batches are repartitioned -according to the shuffle partitioning scheme and each *output* partition is streamed to disk in Arrow IPC format. +Each executor polls the scheduler for the next task to run. Tasks are currently always `ShuffleWriterExec` operators +and each task represents one _input_ partition that will be executed. The resulting batches are repartitioned +according to the shuffle partitioning scheme and each _output_ partition is streamed to disk in Arrow IPC format. -The scheduler will replace `UnresolvedShuffleExec` operators with `ShuffleReaderExec` operators once all shuffle -tasks have completed. The `ShuffleReaderExec` operator connects to other executors as required using the Flight +The scheduler will replace `UnresolvedShuffleExec` operators with `ShuffleReaderExec` operators once all shuffle +tasks have completed. The `ShuffleReaderExec` operator connects to other executors as required using the Flight interface, and streams the shuffle IPC files. # How does this compare to Apache Spark? diff --git a/docs/user-guide/src/distributed/docker-compose.md b/docs/user-guide/src/distributed/docker-compose.md index 14989e58034d..9ada1baa11a9 100644 --- a/docs/user-guide/src/distributed/docker-compose.md +++ b/docs/user-guide/src/distributed/docker-compose.md @@ -24,7 +24,7 @@ demonstrates how to start a cluster using a single process that acts as both a s volume mounted into the container so that Ballista can access the host file system. ```yaml -version: '2.2' +version: "2.2" services: etcd: image: quay.io/coreos/etcd:v3.4.9 diff --git a/docs/user-guide/src/distributed/kubernetes.md b/docs/user-guide/src/distributed/kubernetes.md index 4b80d1731943..ef4accaf3799 100644 --- a/docs/user-guide/src/distributed/kubernetes.md +++ b/docs/user-guide/src/distributed/kubernetes.md @@ -129,16 +129,16 @@ spec: ballista-cluster: ballista spec: containers: - - name: ballista-scheduler - image: - command: ["/scheduler"] - args: ["--bind-port=50050"] - ports: - - containerPort: 50050 - name: flight - volumeMounts: - - mountPath: /mnt - name: data + - name: ballista-scheduler + image: + command: ["/scheduler"] + args: ["--bind-port=50050"] + ports: + - containerPort: 50050 + name: flight + volumeMounts: + - mountPath: /mnt + name: data volumes: - name: data persistentVolumeClaim: @@ -245,10 +245,10 @@ spec: minReplicaCount: 0 maxReplicaCount: 5 triggers: - - type: external - metadata: - # Change this DNS if the scheduler isn't deployed in the "default" namespace - scalerAddress: ballista-scheduler.default.svc.cluster.local:50050 + - type: external + metadata: + # Change this DNS if the scheduler isn't deployed in the "default" namespace + scalerAddress: ballista-scheduler.default.svc.cluster.local:50050 ``` And then deploy it into the cluster: @@ -261,4 +261,4 @@ If the cluster is inactive, Keda will now scale the number of executors down to you launch a query. Please note that Keda will perform a scan once every 30 seconds, so it might take a bit to scale the executors. -Please visit Keda's [documentation page](https://keda.sh/docs/2.3/concepts/scaling-deployments/) for more information. \ No newline at end of file +Please visit Keda's [documentation page](https://keda.sh/docs/2.3/concepts/scaling-deployments/) for more information.