Ballista A modern distributed compute platform

Benchmark Specifications

Data

The benchmarks use the NYC Taxi data set, specifically the Yellow Taxi data for 2019. The data set is 7.3 GB in CSV format.

Query 1

Query 1 is a simple aggregate query and is executed against the NYC Taxi 2019 data set in CSV format. Here is query expressed in SQL.

SELECT 
  passenger_count, 
  MIN(fare_amount), 
  MAX(fare_amount), 
  SUM(fare_amount) 
FROM tripdata 
GROUP BY passenger_count