Ballista A modern distributed compute platform

Release Notes

Ballista 0.4.1

This is a minor point release and contains the following changes:

  • Fixes a bug when running with multiple executors
  • Update of pyo3 dependency for building Python wheel

Ballista 0.4.0

With this release, Ballista was re-implemented from scratch to take advantage of the many changes in Apache Arrow 3.0.0, especially some major refactoring in the DataFusion query engine that made it easier for projects such as Ballista to extend DataFusion’s functionality.

Please refer to the user guide for installation instructions.

Functionality

SQL

This is the first release of Ballista to support SQL joins, making it usable for a much wider range of queries than were supported in previous releases.

With this release it is now possible to run TPC-H queries 1, 3, 5, 6, 10, and 12 against a distributed cluster.

Ballista generally supports any query that DataFusion supports.

Deployment

Ballista now consists of separate scheduler and executor processes, and can be deployed as a standalone cluster and also supports deployment to Kubernetes.

The Ballista scheduler process can optionally use etcd as a backing store to maintain cluster state.

Benchmark Results

Here are some quick and dirty benchmark results, comparing performance of a single node Ballista cluster (one scheduler and one executor) with Apache Spark running in local mode, configured with 8 threads. The data set was the TPC-H SF=100 in Parquet format, with each table consisting of 8 partitions.

These are not ideal benchmarks but at least give a very rough idea of relative performance. Ballista is faster than Spark with some queries, and slower with others.

There are a number of optimizations that are yet to be implemented, and performance will improve with future 0.4.x releases.

Query Ballista 0.4.0 Apache Spark 3.0.1 Ballista as % of Spark
1 14.3 25.3 56.6%
3 108.5 45.5 238.3%
5 74.9 82.9 90.4%
6 3.9 9.7 40.6%
10 167.8 39.4 426.3%
12 13.9 28.0 49.7%

Please note that these are not official TPC-H benchmark results.

Ballista 0.3.0

The goal of the 0.3.0 release is to provide a minimum viable product of distributed compute in Rust. It is now possible to run a query that is very close to TPC-H query 1 on a distributed cluster with reasonable performance. Performance and scalability is comparable to Apache Spark (within the range of 2x slower to 2x faster based on initial benchmarks).

Performance tuning will be one of the main areas of focus for the 0.4.0 release.

Please refer to the user guide for installation instructions.

This release supports the following operators:

  • Projection
  • Selection
  • Hash Aggregate
  • CSV Table Scan
  • Parquet Table Scan
  • In-memory Table Scan
  • Shuffle

This release supports the following expressions:

  • Column references
  • Literal values
  • Aggregate expressions (MIN, MAX, SUM, AVG, COUNT)
  • Basic math expressions (+, -, *, /)
  • Comparison expressions (<. <=, =, !=, >=, >)
  • Aliased expressions

This release contains the following improvements to the Rust project compared to 0.3.0-alpha-2:

  • Query execution no longer uses async and this has allowed us to remove the dedicated thread in ParquetScanExec.

Known issues:

  • The scheduler is still extremely simple and inefficient.
  • Distributed query performance is not optimized yet.

Ballista 0.3.0-alpha-2

This is the second 0.3.0 alpha release and improves overall stability and UX.

This release contains the following improvements to the Rust project:

  • It is now possible to specify configuration options such as parquet and csv reader batch sizes when submitting queries.
  • Executors now use a simpler and more efficient worker thread model.
  • The scheduler is slightly more efficient and uses a stateful client when communicating with executors, reducing the overhead of creating new network connections.
  • The TPC-H example now accepts command-line arguments.
  • The HashAggregateExec operator is re-implemented using async/await, removing the overhead of a dedicated thread.
  • Explicit projections are now supported.
  • Improvements to Docker packaging.
  • Fixed bug where client-supplied schema was ignored.

This release contains the following improvements to the JVM project:

  • Serde code added for new protobuf messages involved in distributed query execution.

Known issues:

  • The scheduler is still extremely simple and inefficient.
  • Distributed query performance is not optimized yet.

Ballista 0.3.0-alpha-1

Ballista 0.3.0-alpha-1 is the first release with true distributed compute capabilities. It is now possible to deploy Ballista to Kubernetes, or as a standalone cluster using etcd for discovery, and use a DataFrame API to build and submit queries to the cluster for execution.

Here is a brief demo of Ballista executing TPCH query 1 in Kubernetes.

asciicast

The goal of the alpha release is to make it easier for contributors to start testing the new distributed scheduler and provide feedback. Performance is not optimized yet and this release supports a limited number of operators and expressions. A more capable 0.3.0 release will be available in August 2020.

Ballista 0.2.5

This release contains the following improvements to the Rust project:

  • Parquet data sources are now supported.
  • Rust executor Docker image is now a multi-stage build, resulting in a relatively small final image.

This release contains the following improvements to the JVM project:

  • General code clean-up in the Kotlin code base, including renaming some classes for consistency.
  • Spark module now contains utility to convert NYC Taxi data set from CSV to Parquet.

This release also contains the following improvements:

  • General improvements to release scripts and benchmarks.

Release artifacts are now available on Maven Central, Docker Hub, and crates.io.

Benchmark results can be found here.

Ballista 0.2.4

This release contains the following improvements to the Rust project:

  • It is now possible to execute queries locally (in-process) using Context::local().
  • Rust executor 30x performance improvement after removing use of musl. See this blog post for more information.

This release contains the following improvements to the JVM project:

  • Fixed a bug in the protobuf module where SUM and AVG aggregates were not encoded correctly.

This release contains the following improvements to Spark project:

  • The Spark project is now a multi-module gradle project.

This release also contains the following improvements:

  • Local mode benchmarks are now dockerized for Rust, JVM, and Spark.

Release artifacts are now available on Maven Central, Docker Hub, and crates.io.

Ballista 0.2.3

This release contains the following improvements to the JVM query engine:

  • The CSV data source is now much faster due to some simple optimizations being implemented. As a result of these improvements, the JVM executor now has similar performance to the Rust and Spark executors for an aggregate query against one month’s worth of NYC Taxi data.
  • The CAST expression is now supported for casting between numeric and string data types.

This release also contains the following improvements:

  • The Rust integration test now runs the same aggregate query against all three executors (JVM, Rust, Spark) and they all produce identical results.
  • A packaging issue was resolved that was preventing Java artifacts from being published to Maven Central.

Release artifacts are now available on Maven Central, Docker Hub, and crates.io.

Ballista 0.2.2

This release contains improvements to the JVM query engine:

  • Arithmetic expressions +, -, *, and / are now implemented for all numeric types.
  • Comparison expressions are now implemented for all numeric types plus strings.
  • AND and OR expressions are implemented.
  • SUM aggregate expression is implemented for all numeric types.
  • CSV data source now supports all numeric types.
  • There is an experimental new fuzzer module for generating random values, expressions, and query plans.