Overview
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.
The foundational technologies in Ballista are:
- Apache Arrow memory model and compute kernels for efficient processing of data.
- Apache Arrow Flight Protocol for efficient data transfer between processes.
- Google Protocol Buffers for serializing query plans.
- Docker for packaging up executors along with user-defined code.
Architecture
The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.
How does this compare to Apache Spark?
Although Ballista is largely inspired by Apache Spark, there are some key differences.
- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
Status
Ballista is at the proof-of-concept phase currently but is under active development by a growing community.
Deployment
Ballista is packaged as Docker images. Refer to the following guides to create a Ballista cluster:
- Create a cluster using Docker
- Create a cluster using Docker Compose
- Create a cluster using Kubernetes
Deploying a standalone Ballista cluster
Start a Scheduler
Start a scheduler using the following syntax:
docker run --network=host \
-d ballistacompute/ballista-rust:0.4.1 \
/scheduler --port 50050
Run docker ps
to check that the process is running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
59452ce72138 ballistacompute/ballista-rust:0.4.1 "/scheduler --port 5…" 6 seconds ago Up 5 seconds affectionate_hofstadter
Run docker logs CONTAINER_ID
to check the output from the process:
$ docker logs 59452ce72138
[2021-02-14T18:32:20Z INFO scheduler] Ballista v0.4.1 Scheduler listening on 0.0.0.0:50050
Start executors
Start one or more executor processes. Each executor process will need to listen on a different port.
docker run --network=host \
-d ballistacompute/ballista-rust:0.4.1 \
/executor --external-host localhost --port 50051
Use docker ps
to check that both the scheduer and executor(s) are now running:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0746ce262a19 ballistacompute/ballista-rust:0.4.1 "/executor --externa…" 2 seconds ago Up 1 second naughty_mclean
59452ce72138 ballistacompute/ballista-rust:0.4.1 "/scheduler --port 5…" 4 minutes ago Up 4 minutes affectionate_hofstadter
Use docker logs CONTAINER_ID
to check the output from the executor(s):
$ docker logs 0746ce262a19
[2021-02-14T18:36:25Z INFO executor] Running with config: ExecutorConfig { host: "localhost", port: 50051, work_dir: "/tmp/.tmpVRFSvn", concurrent_tasks: 4 }
[2021-02-14T18:36:25Z INFO executor] Ballista v0.4.1 Rust Executor listening on 0.0.0.0:50051
[2021-02-14T18:36:25Z INFO executor] Starting registration with scheduler
The external host and port will be registered with the scheduler. The executors will discover other executors by requesting a list of executors from the scheduler.
Using etcd as backing store
NOTE: This functionality is currently experimental
Ballista can optionally use etcd as a backing store for the scheduler.
docker run --network=host \
-d ballistacompute/ballista-rust:0.4.1 \
/scheduler --port 50050 \
--config-backend etcd \
--etcd-urls etcd:2379
Please refer to the etcd web site for installation instructions. Etcd version 3.4.9 or later is recommended.
Installing Ballista with Docker Compose
Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data volume mounted into the container so that Ballista can access the host file system.
version: '2.0'
services:
etcd:
image: quay.io/coreos/etcd:v3.4.9
command: "etcd -advertise-client-urls http://etcd:2379 -listen-client-urls http://0.0.0.0:2379"
ports:
- "2379:2379"
ballista-executor:
image: ballistacompute/ballista-rust:0.4.1
command: "/executor --bind-host 0.0.0.0 --port 50051 --local"
environment:
- RUST_LOG=info
ports:
- "50050:50050"
- "50051:50051"
volumes:
- ./data:/data
With the above content saved to a docker-compose.yaml
file, the following command can be used to start the single
node cluster.
docker-compose up
The scheduler listens on port 50050 and this is the port that clients will need to connect to.
Deploying Ballista with Kubernetes
Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that you are already comfortable with managing Kubernetes deployments.
The k8s deployment consists of:
- k8s stateful set for one or more scheduler processes
- k8s stateful set for one or more executor processes
- k8s service to route traffic to the schedulers
- k8s persistent volume and persistent volume claims to make local data accessible to Ballista
Limitations
Ballista is at an early stage of development and therefore has some significant limitations:
- There is no support for shared object stores such as S3. All data must exist locally on each node in the cluster, including where any client process runs (until #473 is resolved).
- Only a single scheduler instance is currently supported unless the scheduler is configured to use
etcd
as a backing store.
Create Persistent Volume and Persistent Volume Claim
Copy the following yaml to a pv.yaml
file and apply to the cluster to create a persistent volume and a persistent
volume claim so that the specified host directory is available to the containers. This is where any data should be
located so that Ballista can execute queries against it.
apiVersion: v1
kind: PersistentVolume
metadata:
name: data-pv
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
To apply this yaml:
kubectl apply -f pv.yaml
You should see the following output:
persistentvolume/data-pv created
persistentvolumeclaim/data-pv-claim created
Deploying Ballista Scheduler and Executors
Copy the following yaml to a cluster.yaml
file.
apiVersion: v1
kind: Service
metadata:
name: ballista-scheduler
labels:
app: ballista-scheduler
spec:
ports:
- port: 50050
name: scheduler
clusterIP: None
selector:
app: ballista-scheduler
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ballista-scheduler
spec:
serviceName: "ballista-scheduler"
replicas: 1
selector:
matchLabels:
app: ballista-scheduler
template:
metadata:
labels:
app: ballista-scheduler
ballista-cluster: ballista
spec:
containers:
- name: ballista-scheduler
image: ballistacompute/ballista-rust:0.4.1
command: ["/scheduler"]
args: ["--port=50050"]
ports:
- containerPort: 50050
name: flight
volumeMounts:
- mountPath: /mnt
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: data-pv-claim
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ballista-executor
spec:
serviceName: "ballista-scheduler"
replicas: 2
selector:
matchLabels:
app: ballista-executor
template:
metadata:
labels:
app: ballista-executor
ballista-cluster: ballista
spec:
containers:
- name: ballista-executor
image: ballistacompute/ballista-rust:0.4.1
command: ["/executor"]
args: ["--port=50051", "--scheduler-host=ballista-scheduler", "--scheduler-port=50050", "--external-host=$(MY_POD_IP)"]
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 50051
name: flight
volumeMounts:
- mountPath: /mnt
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: data-pv-claim
$ kubectl apply -f cluster.yaml
This should show the following output:
service/ballista-scheduler created
statefulset.apps/ballista-scheduler created
statefulset.apps/ballista-executor created
You can also check status by running kubectl get pods
:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
busybox 1/1 Running 0 16m
ballista-scheduler-0 1/1 Running 0 42s
ballista-executor-0 1/1 Running 2 42s
ballista-executor-1 1/1 Running 0 26s
You can view the scheduler logs with kubectl logs ballista-scheduler-0
:
$ kubectl logs ballista-scheduler-0
[2021-02-19T00:24:01Z INFO scheduler] Ballista v0.4.1 Scheduler listening on 0.0.0.0:50050
[2021-02-19T00:24:16Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
[2021-02-19T00:24:17Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }
Deleting the Ballista cluster
Run the following kubectl command to delete the cluster.
kubectl delete -f cluster.yaml
Configuration
The rust executor and scheduler can be configured using toml files, environment variables and command line arguments. The specification for config options can be found in rust/ballista/src/bin/[executor|scheduler]_config_spec.toml
.
Those files fully define Ballista's configuration. If there is a discrepancy between this documentation and the files, assume those files are correct.
To get a list of command line arguments, run the binary with --help
There is an example config file at ballista/rust/ballista/examples/example_executor_config.toml
The order of precedence for arguments is: default config file < environment variables < specified config file < command line arguments.
The executor and scheduler will look for the default config file at /etc/ballista/[executor|scheduler].toml
To specify a config file use the --config-file
argument.
Environment variables are prefixed by BALLISTA_EXECUTOR
or BALLISTA_SCHEDULER
for the executor and scheduler respectively. Hyphens in command line arguments become underscores. For example, the --scheduler-host
argument for the executor becomes BALLISTA_EXECUTOR_SCHEDULER_HOST
Clients
Ballista Rust Client
The Rust client supports a DataFrame
API as well as SQL. See the
TPC-H Benchmark Client for an example.
Ballista Python Client
Python bindings are available. See this example for more information.
Frequently Asked Questions
What is the relationship between Apache Arrow, DataFusion, and Ballista?
Apache Arrow is a library which provides a standardized memory representation for columnar data. It also provides "kernels" for performing common operations on this data.
DataFusion is a library for executing queries in-process using the Apache Arrow memory model and computational kernels. It is designed to run within a single process, using threads for parallel query execution.
Ballista is a distributed compute platform design to leverage DataFusion and other query execution libraries.