Ballista A modern distributed compute platform

This Week in Ballista

This Week in Ballista #8

Welcome to “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Progress This Week

The main changes in the past week are:

  • The executor can now be configured to execute a fixed number of concurrent tasks.
  • The client has been optimized to stream results from executors without loading the data set into memory.
  • Basic statistics (row count, number of batches, number of bytes) are now available to the scheduler for each query. stage that has been executed. This will allow dynamic query plan optimizations to be implemented in the future.
  • lto is enabled by default for the cargo release target.

There has also been some extensive refactoring to make the code base easier to work with.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #7

Welcome to the 7th edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Focus

Now that 0.4.0 has been released, we have been working on a PR to decouple planning from execution.

There are a number of incremental improvements planned for performance and scalability for future 0.4.x releases.

Work has started on designing a Web UI for the scheduler process. This will make it much easier to debug cluster setup issues and to monitor running queries.

Ballista 0.4.1

Ballista 0.4.1 is a patch release to fix a bug when running with multiple executors.

Recent Talk

Andy Grove gave a presentation on Ballista and Apache Arrow at the New York Open Statistical Programming Meetup. The video is now available online.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #6

Welcome to the 6th edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Ballista 0.4.0 Now Available!

We have now released version 0.4.0 and this is a huge milestone for the project because this is the first ever release that can truly be used to run a wide range of distributed queries and demonstrates that the basic architecture is sound.

This release supports running a cluster in standalone mode and also supports Kubernetes deployments.

Please refer to the release notes for more information about the functionality in this release and some preliminary benchmarks comparing performance with Apache Spark. The results are mixed but some of the benchmark queries run with 2x the performance of Spark, which is promising.

However, there is much more work to do to make Ballista scalable and performant with large data sets. The most important optimizations are related to the join support. Here are the relevant GitHub issues:

In addition to these larger bodies of work, there are still numerous smaller bug fixes and incremental improvements that are being planned for subsequent 0.4.x releases and we are now moving to a weekly release schedule, with releases every Sunday to coincide with this newsletter.

Upcoming Talk

Andy Grove will be giving a presentation on Ballista and Apache Arrow at the New York Open Statistical Programming Meetup this Wednesday (2/24). See the event page for more details.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover. We would really like some help improving the user guide as well.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #5

Welcome to the fifth edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

0.4.0-alpha-1 Release

Distributed query execution is now working well enough to support TPC-H queries 1, 3, 5, 6, 10, and 12 and the first 0.4.0 release is now available on Dockerhub, making it simple to try out Ballista locally by setting up a standalone cluster or by using Docker Compose. See the Ballista User Guide for instructions.

The full 0.4.0 release will be published once the remaining issues are resolved.

Help Wanted

Please try out Ballista with your own queries and data sets and file issues for any bugs or missing features that you discover.

We are also looking for help with these issues.

Community

Follow the @BallistaCompute Twitter account to receive notifications when new editions of “This Week in Ballista” are published.

Join the Ballista Discord Channel to chat with the core contributors.

This Week in Ballista #4

Welcome to the fourth edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Distributed Query Execution

Distributed query execution is now implemented and has been tested successfully end to end but there is still a little more work to do before this is enabled by default and able to support the benchmarks.

The source for the query planner is here. This is a simple implementation that will be optimized once the benchmark queries are functionally working.

The query planner works by translating a DataFusion physical query plan into a Ballista physical query plan by breaking the plan up into query stages whenever partitioning changes between a parent and child operator. A query stage is represented by the QueryStageExec operator.

A query stage can be executed by distributing the execution across the available executors in the cluster. The output from each partition is streamed to disk on the executor in Arrow IPC format. Future query stages then retrieve these results using a ShuffleReaderExec operator.

Finally, a CollectExec operator can be executed to retrieve the final result partitions in the client and coalesce them into a single partition.

All of these pieces work and now the following work needs to happen to make this fully usable:

  • The scheduler already has the ability to receive a logical plan and execute a distributed query using the available executors in the cluster, and the Ballista DataFrame::collect method now needs to be updated to use this mechanism instead of sending the logical plan to one executor to be executed in-process with DataFusion (#485).
  • There is still a little more work to do in the serde module to add support for all of the operators and expressions required to support the benchmark queries (multiple issues).
  • The User Guide needs updating to reflect the changes in 0.4.0 (#486).

At the current rate of progress, it is very likely that there will be a 0.4.0 alpha release before the end of the month and at this point it will be easier to have more contributors working on different areas of the project.

Follow @BallistaCompute

There is now a @BallistaCompute Twitter account. Follow this account to receive notifications when “This Week in Ballista” is posted each week. Eventually there will be an RSS feed and support for email notifications as well.

Help Wanted

There has been an effort to start cleaning up the GitHub issues and more issues are now labeled with help wanted.

There are a mix of small short term issues and larger longer term initiatives.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.

This Week in Ballista #3

Welcome to the third edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Status

There have been some notable developments this week:

There are now separate scheduler and executor binaries, with support for the following deployment models:

  • Local Mode: Single process containing scheduler and executor. This is the simplest way to run Ballista and is primarily intended for local development testing.
  • Standalone Mode: Executors connect to the scheduler process
  • Etcd Mode: Executors connect to the scheduler process and the scheduler uses etcd for state management

Thanks @edrevo for taking the lead on this work.

Preliminary work on distributed query execution is now checked in, although not working yet. End-to-end testing is now at the point where fragments of the physical plan are being sent to executors for execution. This work will likely continue over the next 2-3 weekends and the hope is that distributed execution will be fully working again sometime in March 2021, with support for several of the TPC-H queries.

Current Focus

There is still a need to continue with implementing serde for the physical plan so that a wide range of queries can be supported in distributed mode once the distributed planner is complete.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.

This Week in Ballista #2

Welcome to the second edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform powered by Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Status

With the release of Apache Arrow 3.0.0 fast approaching, the Ballista Rust subproject was recently started from scratch so that it could take advantage of the many changes that have been implemented in DataFusion.

There has been much progress over the past week and it is now possible to run several queries from the TPC-H Benchmark against a single Rust executor. Ballista also now supports SQL joins for the first time.

Here is a quick demonstration of the current capabilities. First, we start a Rust executor in one terminal:

$ RUST_LOG=info ./target/release/executor

[2021-01-24T15:52:54Z INFO  executor] Running with config: ExecutorConfig { discovery_mode: Standalone, 
  host: "localhost", port: 50051, etcd_urls: "localhost:2379", concurrent_tasks: 4 }
[2021-01-24T15:52:54Z INFO  ballista::executor] Running in standalone mode
[2021-01-24T15:52:54Z INFO  executor] Ballista v0.4.0-SNAPSHOT Rust Executor listening on 0.0.0.0:50051

Once the executor has started, we can run the benchmark client in another terminal and have it send the logical plan for one of the TPC-H queries to the executor. For reference, this is the SQL for query 5 that was used in this demonstration:

select
    n_name,
    sum(l_extendedprice * (1 - l_discount)) as revenue
from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
where
        c_custkey = o_custkey
  and l_orderkey = o_orderkey
  and l_suppkey = s_suppkey
  and c_nationkey = s_nationkey
  and s_nationkey = n_nationkey
  and n_regionkey = r_regionkey
  and r_name = 'ASIA'
  and o_orderdate >= date '1994-01-01'
  and o_orderdate < date '1995-01-01'
group by
    n_name
order by
    revenue desc;

Here is the output from running the query:

./target/debug/tpch benchmark --host localhost --port 50051 --query 5 --path data --format tbl --iterations 1
Running benchmarks with the following options: BenchmarkOpt { host: "localhost", port: 50051, query: 5, debug: false, 
  iterations: 1, batch_size: 32768, path: "data", file_format: "tbl" }
+-----------+--------------------+
| n_name    | revenue            |
+-----------+--------------------+
| INDIA     | 152776.4509        |
| CHINA     | 140090.8413        |
| INDONESIA | 53419.768299999996 |
+-----------+--------------------+
Query 5 iteration 0 took 7021.5 ms
Query 5 avg time: 7021.45 ms

Please refer to the TPC-H Benchmark project for full instructions on generating test data and running these benchmarks.

Current Focus

There are two main areas of focus now:

  • There is very early work on re-implementing the distributed query execution capabilities of the 0.3.0 release. This work will likely take a few weekends to get working again.
  • The JVM project has been a bit neglected and now needs some work to bring it up to date with the changes that have been made to the ballista.proto file which defines the serialization format for query plans.

Help Wanted

There are no specific asks this week but the the following links can be used to find issues where help is wanted:

Code contributions are great but there are many ways to contribute to the project, including code reviews, writing documentation, project management, testing, and benchmarking.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.

This Week in Ballista #1

Welcome to the first edition of “This Week in Ballista”, a weekly newsletter that summarizes activity in the Ballista Distributed Compute project.

Ballista is a modern distributed compute platform based on Apache Arrow and primarily implemented in Rust, but designed to provide first-class support for other programming languages, including Python, C++, and Java.

Current Status & Focus

With the release of Apache Arrow 3.0.0 fast approaching, the Ballista Rust subproject was started from scratch so that it can take advantage of the many changes that have been implemented in DataFusion, particularly the extensible physical query plans. Previously, Ballista had forked physical operators and expressions from DataFusion but this is no longer required, so the Ballista codebase will be much smaller now and it will be easier to benefit from the ongoing improvements in DataFusion. Ballista also no longer provides its own DataFrame or Logical Plan and instead supports execution of DataFusion logical plans.

The focus over the past week has been implementing the serde module so that logical query plans can be serialized to Google Protocol Buffer format, enabling clients to send DataFusion logical query plans to Ballista executors. There is a minimal example demonstrating this functionality.

There has also been an effort to port over code from 0.3.0 as we work towards getting end-to-end functionality working again.

Rust Help Wanted

We are currently looking for help with these specific items for the Rust implementation:

  • Implementing more serde code for serializing logical plans and expressions #341

See the full list of Rust issues where help is needed.

JVM Help Wanted

We are currently looking for help with these specific items for the JVM implementation:

  • Update Kotlin code to use latest protobuf definition #374

See the full list of JVM issues where help is needed.

Other Help Wanted

Code contributions are great but there are many ways to contribute to the project, including code reviews, writing documentation, project management, testing, and benchmarking.

Join the Community

There is a growing community in the Ballista Discord Channel. This is a great place to ask questions and learn more about the project.