Ballista A modern distributed compute platform

Release Notes

Ballista 0.3.0

The goal of the 0.3.0 release is to provide a minimum viable product of distributed compute in Rust. It is now possible to run a query that is very close to TPC-H query 1 on a distributed cluster with reasonable performance. Performance and scalability is comparable to Apache Spark (within the range of 2x slower to 2x faster based on initial benchmarks).

Performance tuning will be one of the main areas of focus for the 0.4.0 release.

Please refer to the user guide for installation instructions.

This release supports the following operators:

  • Projection
  • Selection
  • Hash Aggregate
  • CSV Table Scan
  • Parquet Table Scan
  • In-memory Table Scan
  • Shuffle

This release supports the following expressions:

  • Column references
  • Literal values
  • Aggregate expressions (MIN, MAX, SUM, AVG, COUNT)
  • Basic math expressions (+, -, *, /)
  • Comparison expressions (<. <=, =, !=, >=, >)
  • Aliased expressions

This release contains the following improvements to the Rust project compared to 0.3.0-alpha-2:

  • Query execution no longer uses async and this has allowed us to remove the dedicated thread in ParquetScanExec.

Known issues:

  • The scheduler is still extremely simple and inefficient.
  • Distributed query performance is not optimized yet.

Ballista 0.3.0-alpha-2

This is the second 0.3.0 alpha release and improves overall stability and UX.

This release contains the following improvements to the Rust project:

  • It is now possible to specify configuration options such as parquet and csv reader batch sizes when submitting queries.
  • Executors now use a simpler and more efficient worker thread model.
  • The scheduler is slightly more efficient and uses a stateful client when communicating with executors, reducing the overhead of creating new network connections.
  • The TPC-H example now accepts command-line arguments.
  • The HashAggregateExec operator is re-implemented using async/await, removing the overhead of a dedicated thread.
  • Explicit projections are now supported.
  • Improvements to Docker packaging.
  • Fixed bug where client-supplied schema was ignored.

This release contains the following improvements to the JVM project:

  • Serde code added for new protobuf messages involved in distributed query execution.

Known issues:

  • The scheduler is still extremely simple and inefficient.
  • Distributed query performance is not optimized yet.

Ballista 0.3.0-alpha-1

Ballista 0.3.0-alpha-1 is the first release with true distributed compute capabilities. It is now possible to deploy Ballista to Kubernetes, or as a standalone cluster using etcd for discovery, and use a DataFrame API to build and submit queries to the cluster for execution.

Here is a brief demo of Ballista executing TPCH query 1 in Kubernetes.

asciicast

The goal of the alpha release is to make it easier for contributors to start testing the new distributed scheduler and provide feedback. Performance is not optimized yet and this release supports a limited number of operators and expressions. A more capable 0.3.0 release will be available in August 2020.

Ballista 0.2.5

This release contains the following improvements to the Rust project:

  • Parquet data sources are now supported.
  • Rust executor Docker image is now a multi-stage build, resulting in a relatively small final image.

This release contains the following improvements to the JVM project:

  • General code clean-up in the Kotlin code base, including renaming some classes for consistency.
  • Spark module now contains utility to convert NYC Taxi data set from CSV to Parquet.

This release also contains the following improvements:

  • General improvements to release scripts and benchmarks.

Release artifacts are now available on Maven Central, Docker Hub, and crates.io.

Benchmark results can be found here.

Ballista 0.2.4

This release contains the following improvements to the Rust project:

  • It is now possible to execute queries locally (in-process) using Context::local().
  • Rust executor 30x performance improvement after removing use of musl. See this blog post for more information.

This release contains the following improvements to the JVM project:

  • Fixed a bug in the protobuf module where SUM and AVG aggregates were not encoded correctly.

This release contains the following improvements to Spark project:

  • The Spark project is now a multi-module gradle project.

This release also contains the following improvements:

  • Local mode benchmarks are now dockerized for Rust, JVM, and Spark.

Release artifacts are now available on Maven Central, Docker Hub, and crates.io.

Ballista 0.2.3

This release contains the following improvements to the JVM query engine:

  • The CSV data source is now much faster due to some simple optimizations being implemented. As a result of these improvements, the JVM executor now has similar performance to the Rust and Spark executors for an aggregate query against one month’s worth of NYC Taxi data.
  • The CAST expression is now supported for casting between numeric and string data types.

This release also contains the following improvements:

  • The Rust integration test now runs the same aggregate query against all three executors (JVM, Rust, Spark) and they all produce identical results.
  • A packaging issue was resolved that was preventing Java artifacts from being published to Maven Central.

Release artifacts are now available on Maven Central, Docker Hub, and crates.io.

Ballista 0.2.2

This release contains improvements to the JVM query engine:

  • Arithmetic expressions +, -, *, and / are now implemented for all numeric types.
  • Comparison expressions are now implemented for all numeric types plus strings.
  • AND and OR expressions are implemented.
  • SUM aggregate expression is implemented for all numeric types.
  • CSV data source now supports all numeric types.
  • There is an experimental new fuzzer module for generating random values, expressions, and query plans.