Overview

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Architecture

The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.

Ballista Architecture Diagram

How does this compare to Apache Spark?

Although Ballista is largely inspired by Apache Spark, there are some key differences.

  • The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
  • Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
  • The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
  • The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Status

Ballista is at the proof-of-concept phase currently but is under active development by a growing community.

Deployment

Ballista is packaged as Docker images. Refer to the following guides to create a Ballista cluster:

Deploying a standalone Ballista cluster

Start a Scheduler

Start a scheduler using the following syntax:

docker run --network=host \
  -d ballistacompute/ballista-rust:0.4.1 \
  /scheduler --port 50050

Run docker ps to check that the process is running:

$ docker ps
CONTAINER ID   IMAGE                                         COMMAND                  CREATED         STATUS         PORTS     NAMES
59452ce72138   ballistacompute/ballista-rust:0.4.1   "/scheduler --port 5…"   6 seconds ago   Up 5 seconds             affectionate_hofstadter

Run docker logs CONTAINER_ID to check the output from the process:

$ docker logs 59452ce72138
[2021-02-14T18:32:20Z INFO  scheduler] Ballista v0.4.1 Scheduler listening on 0.0.0.0:50050

Start executors

Start one or more executor processes. Each executor process will need to listen on a different port.

docker run --network=host \
  -d ballistacompute/ballista-rust:0.4.1 \
  /executor --external-host localhost --port 50051 

Use docker ps to check that both the scheduer and executor(s) are now running:

$ docker ps
CONTAINER ID   IMAGE                                         COMMAND                  CREATED         STATUS         PORTS     NAMES
0746ce262a19   ballistacompute/ballista-rust:0.4.1   "/executor --externa…"   2 seconds ago   Up 1 second              naughty_mclean
59452ce72138   ballistacompute/ballista-rust:0.4.1   "/scheduler --port 5…"   4 minutes ago   Up 4 minutes             affectionate_hofstadter

Use docker logs CONTAINER_ID to check the output from the executor(s):

$ docker logs 0746ce262a19
[2021-02-14T18:36:25Z INFO  executor] Running with config: ExecutorConfig { host: "localhost", port: 50051, work_dir: "/tmp/.tmpVRFSvn", concurrent_tasks: 4 }
[2021-02-14T18:36:25Z INFO  executor] Ballista v0.4.1 Rust Executor listening on 0.0.0.0:50051
[2021-02-14T18:36:25Z INFO  executor] Starting registration with scheduler

The external host and port will be registered with the scheduler. The executors will discover other executors by requesting a list of executors from the scheduler.

Using etcd as backing store

NOTE: This functionality is currently experimental

Ballista can optionally use etcd as a backing store for the scheduler.

docker run --network=host \
  -d ballistacompute/ballista-rust:0.4.1 \
  /scheduler --port 50050 \
  --config-backend etcd \
  --etcd-urls etcd:2379

Please refer to the etcd web site for installation instructions. Etcd version 3.4.9 or later is recommended.

Installing Ballista with Docker Compose

Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data volume mounted into the container so that Ballista can access the host file system.

version: '2.0'
services:
  etcd:
    image: quay.io/coreos/etcd:v3.4.9
    command: "etcd -advertise-client-urls http://etcd:2379 -listen-client-urls http://0.0.0.0:2379"
    ports:
      - "2379:2379"
  ballista-executor:
    image: ballistacompute/ballista-rust:0.4.1
    command: "/executor --bind-host 0.0.0.0 --port 50051 --local"
    environment:
      - RUST_LOG=info
    ports:
      - "50050:50050"
      - "50051:50051"
    volumes:
      - ./data:/data


With the above content saved to a docker-compose.yaml file, the following command can be used to start the single node cluster.

docker-compose up

The scheduler listens on port 50050 and this is the port that clients will need to connect to.

Deploying Ballista with Kubernetes

Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that you are already comfortable with managing Kubernetes deployments.

The k8s deployment consists of:

  • k8s stateful set for one or more scheduler processes
  • k8s stateful set for one or more executor processes
  • k8s service to route traffic to the schedulers
  • k8s persistent volume and persistent volume claims to make local data accessible to Ballista

Limitations

Ballista is at an early stage of development and therefore has some significant limitations:

  • There is no support for shared object stores such as S3. All data must exist locally on each node in the cluster, including where any client process runs (until #473 is resolved).
  • Only a single scheduler instance is currently supported unless the scheduler is configured to use etcd as a backing store.

Create Persistent Volume and Persistent Volume Claim

Copy the following yaml to a pv.yaml file and apply to the cluster to create a persistent volume and a persistent volume claim so that the specified host directory is available to the containers. This is where any data should be located so that Ballista can execute queries against it.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-pv
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

To apply this yaml:

kubectl apply -f pv.yaml

You should see the following output:

persistentvolume/data-pv created
persistentvolumeclaim/data-pv-claim created

Deploying Ballista Scheduler and Executors

Copy the following yaml to a cluster.yaml file.

apiVersion: v1
kind: Service
metadata:
  name: ballista-scheduler
  labels:
    app: ballista-scheduler
spec:
  ports:
    - port: 50050
      name: scheduler
  clusterIP: None
  selector:
    app: ballista-scheduler
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ballista-scheduler
spec:
  serviceName: "ballista-scheduler"
  replicas: 1
  selector:
    matchLabels:
      app: ballista-scheduler
  template:
    metadata:
      labels:
        app: ballista-scheduler
        ballista-cluster: ballista
    spec:
      containers:
      - name: ballista-scheduler
        image: ballistacompute/ballista-rust:0.4.1
        command: ["/scheduler"]
        args: ["--port=50050"]
        ports:
          - containerPort: 50050
            name: flight
        volumeMounts:
          - mountPath: /mnt
            name: data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-pv-claim
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ballista-executor
spec:
  serviceName: "ballista-scheduler"
  replicas: 2
  selector:
    matchLabels:
      app: ballista-executor
  template:
    metadata:
      labels:
        app: ballista-executor
        ballista-cluster: ballista
    spec:
      containers:
        - name: ballista-executor
          image: ballistacompute/ballista-rust:0.4.1
          command: ["/executor"]
          args: ["--port=50051", "--scheduler-host=ballista-scheduler", "--scheduler-port=50050", "--external-host=$(MY_POD_IP)"]
          env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP            
          ports:
            - containerPort: 50051
              name: flight
          volumeMounts:
            - mountPath: /mnt
              name: data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: data-pv-claim
$ kubectl apply -f cluster.yaml

This should show the following output:

service/ballista-scheduler created
statefulset.apps/ballista-scheduler created
statefulset.apps/ballista-executor created

You can also check status by running kubectl get pods:

$ kubectl get pods
NAME                   READY   STATUS    RESTARTS   AGE
busybox                1/1     Running   0          16m
ballista-scheduler-0   1/1     Running   0          42s
ballista-executor-0    1/1     Running   2          42s
ballista-executor-1    1/1     Running   0          26s

You can view the scheduler logs with kubectl logs ballista-scheduler-0:

$ kubectl logs ballista-scheduler-0
[2021-02-19T00:24:01Z INFO  scheduler] Ballista v0.4.1 Scheduler listening on 0.0.0.0:50050
[2021-02-19T00:24:16Z INFO  ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
[2021-02-19T00:24:17Z INFO  ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }

Deleting the Ballista cluster

Run the following kubectl command to delete the cluster.

kubectl delete -f cluster.yaml

Configuration

The rust executor and scheduler can be configured using toml files, environment variables and command line arguments. The specification for config options can be found in rust/ballista/src/bin/[executor|scheduler]_config_spec.toml.

Those files fully define Ballista's configuration. If there is a discrepancy between this documentation and the files, assume those files are correct.

To get a list of command line arguments, run the binary with --help

There is an example config file at ballista/rust/ballista/examples/example_executor_config.toml

The order of precedence for arguments is: default config file < environment variables < specified config file < command line arguments.

The executor and scheduler will look for the default config file at /etc/ballista/[executor|scheduler].toml To specify a config file use the --config-file argument.

Environment variables are prefixed by BALLISTA_EXECUTOR or BALLISTA_SCHEDULER for the executor and scheduler respectively. Hyphens in command line arguments become underscores. For example, the --scheduler-host argument for the executor becomes BALLISTA_EXECUTOR_SCHEDULER_HOST

Clients

Ballista Rust Client

The Rust client supports a DataFrame API as well as SQL. See the TPC-H Benchmark Client for an example.

Ballista Python Client

Python bindings are available. See this example for more information.

Frequently Asked Questions

What is the relationship between Apache Arrow, DataFusion, and Ballista?

Apache Arrow is a library which provides a standardized memory representation for columnar data. It also provides "kernels" for performing common operations on this data.

DataFusion is a library for executing queries in-process using the Apache Arrow memory model and computational kernels. It is designed to run within a single process, using threads for parallel query execution.

Ballista is a distributed compute platform design to leverage DataFusion and other query execution libraries.