Ballista A modern distributed compute platform

This Week in Ballista

This Week in Ballista #10

Welcome to “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Ballista has been donated to the Apache Arrow project

We missed a few editions of this newsletter while we worked through the process of donating the project to Apache Arrow.

The good news is that we are now through that process and the Ballista Rust codebase has now been donated to Apache Arrow via this pull request.

We can now resume development by creating pull requests against the Apache Arrow repo and an initial list of issues has been created here.

DataFusion and Ballista to move to new repository soon

DataFusion and Ballista are tightly coupled and there is an opportunity now to have a common scheduler that can scale queries seamlessly across cores in DataFusion and across nodes in Ballista.

DataFusion and Ballista have different release cycle requirements than the core Arrow crate and for this reason there is a proposal to move these components out of the Apache Arrow repo and into a new top-level DataFusion repository. The project will continue to operate under the governance of the Apache Arrow PMC and discussions will happen on the Apache Arrow mailing lists.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the ASF Slack Channel to chat with the core contributors in #arrow-rust.

This Week in Ballista #9

Welcome to “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Proposal: Donate Ballista to Apache Arrow project

This week’s update is a little different. Andy Grove has started a discussion about donating Ballista to Apache Arrow. Here is his proposal, which is also filed as a GitHub issue here. There is a discussion on the Apache Arrow mailing list about this as well.

The Ballista project has recently reached a point where I believe that the basic architecture has been proven to work. The project has also suddenly become very popular and generated a lot of interest (more than 2k stars on GitHub).

For these reasons, I think that the project has grown too large for me to continue maintaining as a personal project and I think it is now time to move the code to a foundation to ensure its continued success.

Given the deep dependencies on Apache Arrow (the core Arrow, DataFusion, and Parquet crates) and the fact that there is already some overlap between Arrow and Ballista committers, I believe that the obvious choice would be to donate the project to Apache Arrow.

Some of the benefits of donating the project to Arrow are:

  • It will be easier to add new features to Ballista and DataFusion if they are in the same repository. Ballista essentially extends DataFusion and it is often necessary to touch both code bases when implementing new functionality.
  • DataFusion would benefit from having a scheduler (rather than trying to eagerly evaluate the entire query plan) and it would probably make sense to push some parts of the Ballista scheduler down a level in the stack so that the same approach is used to scale across cores in DataFusion and to scale across nodes in Ballista.
  • Apache Arrow has a strong community anbd there is a team of committers that understand Arrow and DataFusion that can help with PR reviews so I will no longer be a bottleneck. Companies are also more likely to commit resources to contributing to an Apache project compared to a personal project.

If you have opinions (for or against) this donation, please comment on the GitHub issue.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #8

Welcome to “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Progress This Week

The main changes in the past week are:

  • The executor can now be configured to execute a fixed number of concurrent tasks.
  • The client has been optimized to stream results from executors without loading the data set into memory.
  • Basic statistics (row count, number of batches, number of bytes) are now available to the scheduler for each query stage that has been executed. This will allow dynamic query plan optimizations to be implemented in the future.
  • lto is enabled by default for the cargo release target.

There has also been some extensive refactoring to make the code base easier to work with.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #7

Welcome to the 7th edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Focus

Now that 0.4.0 has been released, we have been working on a PR to decouple planning from execution.

There are a number of incremental improvements planned for performance and scalability for future 0.4.x releases.

Work has started on designing a Web UI for the scheduler process. This will make it much easier to debug cluster setup issues and to monitor running queries.

Ballista 0.4.1

Ballista 0.4.1 is a patch release to fix a bug when running with multiple executors.

Recent Talk

Andy Grove gave a presentation on Ballista and Apache Arrow at the New York Open Statistical Programming Meetup. The video is now available online.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #6

Welcome to the 6th edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Ballista 0.4.0 Now Available!

We have now released version 0.4.0 and this is a huge milestone for the project because this is the first ever release that can truly be used to run a wide range of distributed queries and demonstrates that the basic architecture is sound.

This release supports running a cluster in standalone mode and also supports Kubernetes deployments.

Please refer to the release notes for more information about the functionality in this release and some preliminary benchmarks comparing performance with Apache Spark. The results are mixed but some of the benchmark queries run with 2x the performance of Spark, which is promising.

However, there is much more work to do to make Ballista scalable and performant with large data sets. The most important optimizations are related to the join support. Here are the relevant GitHub issues:

In addition to these larger bodies of work, there are still numerous smaller bug fixes and incremental improvements that are being planned for subsequent 0.4.x releases and we are now moving to a weekly release schedule, with releases every Sunday to coincide with this newsletter.

Upcoming Talk

Andy Grove will be giving a presentation on Ballista and Apache Arrow at the New York Open Statistical Programming Meetup this Wednesday (2/24). See the event page for more details.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #5

Welcome to the fifth edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

0.4.0-alpha-1 Release

Distributed query execution is now working well enough to support TPC-H queries 1, 3, 5, 6, 10, and 12 and the first 0.4.0 release is now available on Dockerhub, making it simple to try out Ballista locally by setting up a standalone cluster or by using Docker Compose. See the Ballista User Guide for instructions.

The full 0.4.0 release will be published once the remaining issues are resolved.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #4

Welcome to the fourth edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Distributed Query Execution

Distributed query execution is now implemented and has been tested successfully end to end but there is still a little more work to do before this is enabled by default and able to support the benchmarks.

The source for the query planner is here. This is a simple implementation that will be optimized once the benchmark queries are functionally working.

The query planner works by translating a DataFusion physical query plan into a Ballista physical query plan by breaking the plan up into query stages whenever partitioning changes between a parent and child operator. A query stage is represented by the QueryStageExec operator.

A query stage can be executed by distributing the execution across the available executors in the cluster. The output from each partition is streamed to disk on the executor in Arrow IPC format. Future query stages then retrieve these results using a ShuffleReaderExec operator.

Finally, a CollectExec operator can be executed to retrieve the final result partitions in the client and coalesce them into a single partition.

All of these pieces work and now the following work needs to happen to make this fully usable:

  • The scheduler already has the ability to receive a logical plan and execute a distributed query using the available executors in the cluster, and the Ballista DataFrame::collect method now needs to be updated to use this mechanism instead of sending the logical plan to one executor to be executed in-process with DataFusion (#485).
  • There is still a little more work to do in the serde module to add support for all of the operators and expressions required to support the benchmark queries (multiple issues).
  • The User Guide needs updating to reflect the changes in 0.4.0 (#486).

At the current rate of progress, it is very likely that there will be a 0.4.0 alpha release before the end of the month and at this point it will be easier to have more contributors working on different areas of the project.

Follow @BallistaCompute

There is now a @BallistaCompute Twitter account. Follow this account to receive notifications when “This Week in Ballista” is posted each week. Eventually there will be an RSS feed and support for email notifications as well.

Help Wanted

There has been an effort to start cleaning up the GitHub issues and more issues are now labeled with help wanted.

There are a mix of small short term issues and larger longer term initiatives.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.

This Week in Ballista #3

Welcome to the third edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Status

There have been some notable developments this week:

There are now separate scheduler and executor binaries, with support for the following deployment models:

  • Local Mode: Single process containing scheduler and executor. This is the simplest way to run Ballista and is primarily intended for local development testing.
  • Standalone Mode: Executors connect to the scheduler process
  • Etcd Mode: Executors connect to the scheduler process and the scheduler uses etcd for state management

Thanks @edrevo for taking the lead on this work.

Preliminary work on distributed query execution is now checked in, although not working yet. End-to-end testing is now at the point where fragments of the physical plan are being sent to executors for execution. This work will likely continue over the next 2-3 weekends and the hope is that distributed execution will be fully working again sometime in March 2021, with support for several of the TPC-H queries.

Current Focus

There is still a need to continue with implementing serde for the physical plan so that a wide range of queries can be supported in distributed mode once the distributed planner is complete.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.

This Week in Ballista #2

Welcome to the second edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Status

With the release of Apache Arrow 3.0.0 fast approaching, the Ballista Rust subproject was recently started from scratch so that it could take advantage of the many changes that have been implemented in DataFusion.

There has been much progress over the past week and it is now possible to run several queries from the TPC-H Benchmark against a single Rust executor. Ballista also now supports SQL joins for the first time.

Here is a quick demonstration of the current capabilities. First, we start a Rust executor in one terminal:

$ RUST_LOG=info ./target/release/executor

[2021-01-24T15:52:54Z INFO  executor] Running with config: ExecutorConfig { discovery_mode: Standalone, 
  host: "localhost", port: 50051, etcd_urls: "localhost:2379", concurrent_tasks: 4 }
[2021-01-24T15:52:54Z INFO  ballista::executor] Running in standalone mode
[2021-01-24T15:52:54Z INFO  executor] Ballista v0.4.0-SNAPSHOT Rust Executor listening on 0.0.0.0:50051

Once the executor has started, we can run the benchmark client in another terminal and have it send the logical plan for one of the TPC-H queries to the executor. For reference, this is the SQL for query 5 that was used in this demonstration:

select
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
where
        c_custkey = o_custkey
  and l_orderkey = o_orderkey
  and l_suppkey = s_suppkey
  and c_nationkey = s_nationkey
  and s_nationkey = n_nationkey
  and n_regionkey = r_regionkey
  and r_name = 'ASIA'
  and o_orderdate >= date '1994-01-01'
  and o_orderdate < date '1995-01-01'
group by
    n_name
order by
    revenue desc;

Here is the output from running the query:

./target/debug/tpch benchmark --host localhost --port 50051 --query 5 --path data --format tbl --iterations 1
Running benchmarks with the following options: BenchmarkOpt { host: "localhost", port: 50051, query: 5, debug: false, 
  iterations: 1, batch_size: 32768, path: "data", file_format: "tbl" }
+-----------+--------------------+
| n_name    | revenue            |
+-----------+--------------------+
| INDIA     | 152776.4509        |
| CHINA     | 140090.8413        |
| INDONESIA | 53419.768299999996 |
+-----------+--------------------+
Query 5 iteration 0 took 7021.5 ms
Query 5 avg time: 7021.45 ms

Please refer to the TPC-H Benchmark project for full instructions on generating test data and running these benchmarks.

Current Focus

There are two main areas of focus now:

  • There is very early work on re-implementing the distributed query execution capabilities of the 0.3.0 release. This work will likely take a few weekends to get working again.
  • The JVM project has been a bit neglected and now needs some work to bring it up to date with the changes that have been made to the ballista.proto file which defines the serialization format for query plans.

Help Wanted

There are no specific asks this week but the the following links can be used to find issues where help is wanted:

Code contributions are great but there are many ways to contribute to the project, including code reviews, writing documentation, project management, testing, and benchmarking.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.

This Week in Ballista #1

Welcome to the first edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform based on Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Status & Focus

With the release of Apache Arrow 3.0.0 fast approaching, the Ballista Rust subproject was started from scratch so that it can take advantage of the many changes that have been implemented in DataFusion, particularly the extensible physical query plans. Previously, Ballista had forked physical operators and expressions from DataFusion but this is no longer required, so the Ballista codebase will be much smaller now and it will be easier to benefit from the ongoing improvements in DataFusion. Ballista also no longer provides its own DataFrame or Logical Plan and instead supports execution of DataFusion logical plans.

The focus over the past week has been implementing the serde module so that logical query plans can be serialized to Google Protocol Buffer format, enabling clients to send DataFusion logical query plans to Ballista executors. There is a minimal example demonstrating this functionality.

There has also been an effort to port over code from 0.3.0 as we work towards getting end-to-end functionality working again.

Rust Help Wanted

We are currently looking for help with these specific items for the Rust implementation:

  • Implementing more serde code for serializing logical plans and expressions #341

See the full list of Rust issues where help is needed.

JVM Help Wanted

We are currently looking for help with these specific items for the JVM implementation:

  • Update Kotlin code to use latest protobuf definition #374

See the full list of JVM issues where help is needed.

Other Help Wanted

Code contributions are great but there are many ways to contribute to the project, including code reviews, writing documentation, project management, testing, and benchmarking.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.