Ballista 0.4.1
22 Feb 2021This is a minor point release and contains the following changes:
- Fixes a bug when running with multiple executors
- Update of pyo3 dependency for building Python wheel
This is a minor point release and contains the following changes:
With this release, Ballista was re-implemented from scratch to take advantage of the many changes in Apache Arrow 3.0.0, especially some major refactoring in the DataFusion query engine that made it easier for projects such as Ballista to extend DataFusion’s functionality.
Please refer to the user guide for installation instructions.
This is the first release of Ballista to support SQL joins, making it usable for a much wider range of queries than were supported in previous releases.
With this release it is now possible to run TPC-H queries 1, 3, 5, 6, 10, and 12 against a distributed cluster.
Ballista generally supports any query that DataFusion supports.
Ballista now consists of separate scheduler and executor processes, and can be deployed as a standalone cluster and also supports deployment to Kubernetes.
The Ballista scheduler process can optionally use etcd as a backing store to maintain cluster state.
Here are some quick and dirty benchmark results, comparing performance of a single node Ballista cluster (one scheduler and one executor) with Apache Spark running in local mode, configured with 8 threads. The data set was the TPC-H SF=100 in Parquet format, with each table consisting of 8 partitions.
These are not ideal benchmarks but at least give a very rough idea of relative performance. Ballista is faster than Spark with some queries, and slower with others.
There are a number of optimizations that are yet to be implemented, and performance will improve with future 0.4.x releases.
Query | Ballista 0.4.0 | Apache Spark 3.0.1 | Ballista as % of Spark |
---|---|---|---|
1 | 14.3 | 25.3 | 56.6% |
3 | 108.5 | 45.5 | 238.3% |
5 | 74.9 | 82.9 | 90.4% |
6 | 3.9 | 9.7 | 40.6% |
10 | 167.8 | 39.4 | 426.3% |
12 | 13.9 | 28.0 | 49.7% |
Please note that these are not official TPC-H benchmark results.
The goal of the 0.3.0 release is to provide a minimum viable product of distributed compute in Rust. It is now possible to run a query that is very close to TPC-H query 1 on a distributed cluster with reasonable performance. Performance and scalability is comparable to Apache Spark (within the range of 2x slower to 2x faster based on initial benchmarks).
Performance tuning will be one of the main areas of focus for the 0.4.0 release.
Please refer to the user guide for installation instructions.
This release supports the following operators:
This release supports the following expressions:
This release contains the following improvements to the Rust project compared to 0.3.0-alpha-2:
async
and this has allowed us to remove the dedicated thread
in ParquetScanExec
.Known issues:
This is the second 0.3.0 alpha release and improves overall stability and UX.
This release contains the following improvements to the Rust project:
This release contains the following improvements to the JVM project:
Known issues:
Ballista 0.3.0-alpha-1 is the first release with true distributed compute capabilities. It is now possible to deploy Ballista to Kubernetes, or as a standalone cluster using etcd for discovery, and use a DataFrame API to build and submit queries to the cluster for execution.
Here is a brief demo of Ballista executing TPCH query 1 in Kubernetes.
The goal of the alpha release is to make it easier for contributors to start testing the new distributed scheduler and provide feedback. Performance is not optimized yet and this release supports a limited number of operators and expressions. A more capable 0.3.0 release will be available in August 2020.
This release contains the following improvements to the Rust project:
This release contains the following improvements to the JVM project:
This release also contains the following improvements:
Release artifacts are now available on Maven Central, Docker Hub, and crates.io.
Benchmark results can be found here.
This release contains the following improvements to the Rust project:
Context::local()
.This release contains the following improvements to the JVM project:
SUM
and AVG
aggregates were not encoded correctly.This release contains the following improvements to Spark project:
This release also contains the following improvements:
Release artifacts are now available on Maven Central, Docker Hub, and crates.io.
This release contains the following improvements to the JVM query engine:
CAST
expression is now supported for casting between numeric and string data types.This release also contains the following improvements:
Release artifacts are now available on Maven Central, Docker Hub, and crates.io.
This release contains improvements to the JVM query engine:
+
, -
, *
, and /
are now implemented for all numeric types.AND
and OR
expressions are implemented.SUM
aggregate expression is implemented for all numeric types.