Ballista A modern distributed compute platform

Ballista 0.4.0

With this release, Ballista was re-implemented from scratch to take advantage of the many changes in Apache Arrow 3.0.0, especially some major refactoring in the DataFusion query engine that made it easier for projects such as Ballista to extend DataFusion’s functionality.

Please refer to the user guide for installation instructions.

Functionality

SQL

This is the first release of Ballista to support SQL joins, making it usable for a much wider range of queries than were supported in previous releases.

With this release it is now possible to run TPC-H queries 1, 3, 5, 6, 10, and 12 against a distributed cluster.

Ballista generally supports any query that DataFusion supports.

Deployment

Ballista now consists of separate scheduler and executor processes, and can be deployed as a standalone cluster and also supports deployment to Kubernetes.

The Ballista scheduler process can optionally use etcd as a backing store to maintain cluster state.

Benchmark Results

Here are some quick and dirty benchmark results, comparing performance of a single node Ballista cluster (one scheduler and one executor) with Apache Spark running in local mode, configured with 8 threads. The data set was the TPC-H SF=100 in Parquet format, with each table consisting of 8 partitions.

These are not ideal benchmarks but at least give a very rough idea of relative performance. Ballista is faster than Spark with some queries, and slower with others.

There are a number of optimizations that are yet to be implemented, and performance will improve with future 0.4.x releases.

Query Ballista 0.4.0 Apache Spark 3.0.1 Ballista as % of Spark
1 14.3 25.3 56.6%
3 108.5 45.5 238.3%
5 74.9 82.9 90.4%
6 3.9 9.7 40.6%
10 167.8 39.4 426.3%
12 13.9 28.0 49.7%

Please note that these are not official TPC-H benchmark results.