Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
1f352ad
Refactor: split benchmark.sh into install/start/check/stop/load/query…
alexey-milovidov May 7, 2026
1b050f1
Merge branch 'main' into refactor/per-system-script-interface
alexey-milovidov May 7, 2026
0001b85
Refactor mongodb and polars to the new per-system layout
alexey-milovidov May 7, 2026
98c67c4
clickhouse-datalake{,-partitioned}: create the table once in load, re…
alexey-milovidov May 7, 2026
b5d60e8
clickhouse/query: drop the cat shim — clickhouse-client reads stdin n…
alexey-milovidov May 7, 2026
b95012a
clickhouse/start: drop idempotency check; let bench_check_loop verify…
alexey-milovidov May 7, 2026
94794b5
Add missing change
alexey-milovidov May 7, 2026
fb44010
Merge branch 'main' into refactor/per-system-script-interface
alexey-milovidov May 8, 2026
eb9821f
trino{,-partitioned,-datalake,-datalake-partitioned}: refactor to per…
alexey-milovidov May 8, 2026
d509fa0
presto{,-partitioned,-datalake,-datalake-partitioned}: refactor to pe…
alexey-milovidov May 8, 2026
61515ff
datafusion-vortex{,-partitioned}: refactor to per-system layout
alexey-milovidov May 9, 2026
00de41d
quickwit: refactor to per-system layout
alexey-milovidov May 9, 2026
4a44596
gizmosql/util.sh: fixed PID-file path
alexey-milovidov May 9, 2026
f8636fa
Merge branch 'main' into refactor/per-system-script-interface
alexey-milovidov May 9, 2026
53de4b8
lib/benchmark-common.sh: silence start/stop output
alexey-milovidov May 9, 2026
f1ba3af
duckdb*/install: symlink duckdb into /usr/local/bin
alexey-milovidov May 9, 2026
5929597
byconity/install: install docker compose v2 plugin if missing
alexey-milovidov May 9, 2026
e2669c4
concurrent: ClickBench QPS benchmark with N persistent connections
alexey-milovidov May 9, 2026
055a3ac
{chdb,chdb-parquet-partitioned,hyper,hyper-parquet,sail,sail-partitio…
alexey-milovidov May 9, 2026
e2e4eb3
lib/benchmark-common.sh: surface the actual ./check failure on timeout
alexey-milovidov May 9, 2026
9633ba5
cloudberry/install: fail fast on non-RHEL hosts
alexey-milovidov May 9, 2026
c288eab
cloud-init.sh.in: export HOME so child install/query scripts find it
alexey-milovidov May 9, 2026
0312937
monetdb/install: actually initialize and start monetdbd
alexey-milovidov May 9, 2026
7a21c35
paradedb{,-partitioned}/install: bump image tag to a published one
alexey-milovidov May 9, 2026
f6e0172
selectdb/install: switch to download.velodb.io Apache Doris build
alexey-milovidov May 9, 2026
f86b41c
greenplum: switch to woblerr/greenplum Docker image
alexey-milovidov May 9, 2026
af13b01
cloud-init: forward operator-side env vars (YT_PROXY, YT_TOKEN, CHYT_…
alexey-milovidov May 9, 2026
98c68a2
hologres/benchmark.sh: drop yum dep on host, use a docker psql shim
alexey-milovidov May 9, 2026
3971980
cloudberry: run the upstream RHEL build inside a docker container
alexey-milovidov May 9, 2026
18fd6fc
paradedb-partitioned: mark as historical
alexey-milovidov May 9, 2026
db1a2a1
paradedb: rework around pg_search after pg_lakehouse removal
alexey-milovidov May 9, 2026
3aebf66
monetdb: fix the password and dbfarm-init paths so ./check actually p…
alexey-milovidov May 9, 2026
fdd10f4
databend: stagger meta/query startup, raise fd limit and check timeout
alexey-milovidov May 9, 2026
9db2d3a
byconity: give the docker-compose chain time to settle
alexey-milovidov May 9, 2026
51cc62a
move download-hits-* scripts into lib/
alexey-milovidov May 9, 2026
558053e
spark-velox: add per-system-script-interface entry
alexey-milovidov May 9, 2026
96337ba
{tablespace,tembo-olap}/queries.sql: add trailing newline
alexey-milovidov May 9, 2026
f236e7b
cloud-init: add a 16 GB swapfile
alexey-milovidov May 9, 2026
3101589
druid, pinot: bump to currently-published versions
alexey-milovidov May 9, 2026
87717c0
sail{,-partitioned}/install: --ignore-installed past Ubuntu 24.04 pac…
alexey-milovidov May 9, 2026
c28bb40
siglens/install: set HOME and GOPATH before `go mod tidy`
alexey-milovidov May 9, 2026
4aaba80
datafusion-vortex{,-partitioned}/install: idempotent submodule update
alexey-milovidov May 9, 2026
fd64aef
elasticsearch/load.py: bump bulk request timeout 30s -> 300s
alexey-milovidov May 9, 2026
2e23b0e
tidb/start: wait for tiflash to register before returning
alexey-milovidov May 9, 2026
11ca02f
byconity: bump image from 0.1.0-GA to 1.0.1-hotfix1
alexey-milovidov May 9, 2026
2611f51
cockroachdb, sirius: bump BENCH_CHECK_TIMEOUT past 300 s default
alexey-milovidov May 9, 2026
591350d
duckdb-vortex{,-partitioned}: stamp HOME so vcpkg / extension cache work
alexey-milovidov May 9, 2026
3b14156
README: document the systems whose upstream distribution has gone away
alexey-milovidov May 9, 2026
4c7d1d5
Revert "README: document the systems whose upstream distribution has …
alexey-milovidov May 9, 2026
55f2e31
{vertica,oxla,kinetica,heavyai,infobright}/README.md: note these are …
alexey-milovidov May 9, 2026
de34928
{vertica,oxla,kinetica,heavyai,infobright}/README.md: say "dead" plainly
alexey-milovidov May 9, 2026
8189771
kinetica: fetch kisql binary from raw.githubusercontent.com
alexey-milovidov May 9, 2026
9907949
heavyai: switch to omnisci/core-os-cpu:v5.10.2 docker image
alexey-milovidov May 9, 2026
faf36df
collect-results.sh: ORDER BY time DESC before LIMIT 1 BY
alexey-milovidov May 9, 2026
89b085a
Remove the `concurrent/` directory
alexey-milovidov May 9, 2026
10e05ba
tembo-olap: mark the 2024-02-09 result historical
alexey-milovidov May 9, 2026
b79e224
cloud-init: bump per-run timeout 20000s -> 36000s
alexey-milovidov May 9, 2026
33097f8
byconity/load: drop redundant pigz; lib already decompresses .gz
alexey-milovidov May 9, 2026
5f462f9
druid/install: bump JDK from 11 to 17 to satisfy the version bump
alexey-milovidov May 9, 2026
c5d5c1f
sail{,-partitioned}/install: pass --ignore-installed to get-pip.py too
alexey-milovidov May 9, 2026
ca4c4cb
tidb/start: poll information_schema.cluster_info, not tikv_store_status
alexey-milovidov May 9, 2026
910907e
pinot, sirius: bump BENCH_CHECK_TIMEOUT to fit cold-start
alexey-milovidov May 9, 2026
4eaadfc
datafusion-vortex{,-partitioned}: install libclang-dev; bump v0.34 ->…
alexey-milovidov May 9, 2026
244ea36
questdb/data-size: glob-free du; the materialized view was rejecting …
alexey-milovidov May 9, 2026
c8af85e
lib/benchmark-common.sh: pull last numeric line, not last line
alexey-milovidov May 9, 2026
9aab622
heavyai: drop the omnisql positional db arg; bump check timeout to 900s
alexey-milovidov May 9, 2026
c8125e5
Rename selectdb -> velodb; mark old results historical
alexey-milovidov May 9, 2026
e548292
bytehouse: tag cloud results historical, document the status
alexey-milovidov May 9, 2026
0195f86
velodb: drop the "historical" tag from old results
alexey-milovidov May 9, 2026
5702c3d
cloud-init: make the global benchmark timeout configurable
alexey-milovidov May 9, 2026
25e6eb2
cloud-init: rework BENCHMARK_TIMEOUT as a @timeout@ render placeholder
alexey-milovidov May 9, 2026
cc1204f
{cloud-init,run-benchmark}: drop YT_-related runtime_env forwarding
alexey-milovidov May 9, 2026
8713287
Add some results
alexey-milovidov May 9, 2026
1a03827
velodb/results/20260509: fix the stale "SelectDB" system field
alexey-milovidov May 9, 2026
8447eb4
selectdb: disable BE / FE caches so warm runs are actually warm
alexey-milovidov May 9, 2026
f26f504
lib/benchmark-common.sh: detect silent partial loads after ./load
alexey-milovidov May 9, 2026
15dc0e4
lib/benchmark-common.sh: don't cache data-size between load and main
alexey-milovidov May 9, 2026
5ea85fc
Add more results
alexey-milovidov May 9, 2026
2920580
lib/benchmark-common.sh: actually clear caches before cold runs
alexey-milovidov May 9, 2026
02f9c9e
umbra: harden against silently-misattributed query failures
alexey-milovidov May 9, 2026
42e887c
spark-velox: drop the "Velox" tag from result and template
alexey-milovidov May 9, 2026
64045a2
umbra/results/20260509: remove partial-load result
alexey-milovidov May 9, 2026
332f219
{kinetica,presto*}: fix two systemic causes of all-null query timings
alexey-milovidov May 9, 2026
4ef9095
spark/install: pin pyspark to 3.5.5 (4.0 broke the shared query.py)
alexey-milovidov May 9, 2026
67b6b00
drill/query: actually execute queries (use --run= instead of stdin)
alexey-milovidov May 9, 2026
dbbb1a1
Refactor pg_duckdb*, ursa, yugabytedb to per-system-script-interface
alexey-milovidov May 9, 2026
3349102
mariadb-columnstore: refactor + switch bulk loader from LOAD DATA INF…
alexey-milovidov May 9, 2026
98e28d1
duckdb-memory: convert to a long-lived FastAPI server
alexey-milovidov May 10, 2026
1599cf6
Tag dataframe / Python-HTTP-server entries as "in-memory" everywhere
alexey-milovidov May 10, 2026
2880a63
Add more results
alexey-milovidov May 10, 2026
e2bdcbc
Merge branch 'refactor/per-system-script-interface' of github.com:Cli…
alexey-milovidov May 10, 2026
999b3ec
{clickhouse,clickhouse-tencent}/install: force async_load_databases=f…
alexey-milovidov May 10, 2026
7d255c8
{clickhouse,clickhouse-tencent}/install: switch the override to YAML
alexey-milovidov May 10, 2026
4aeb7f6
{clickhouse,clickhouse-tencent}/create.sql: enable mark + PK cache pr…
alexey-milovidov May 10, 2026
82e6fe0
Revert "{clickhouse,clickhouse-tencent}/create.sql: enable mark + PK …
alexey-milovidov May 10, 2026
b355245
More results
alexey-milovidov May 10, 2026
99327f4
Remove "serverless" tag from local-storage systems
alexey-milovidov May 10, 2026
b71273a
More results
alexey-milovidov May 10, 2026
44c6baf
Merge branch 'refactor/per-system-script-interface' of github.com:Cli…
alexey-milovidov May 10, 2026
7c1f7a3
{clickhouse,clickhouse-tencent}/install: eager-load PK and column sizes
alexey-milovidov May 10, 2026
f0c5963
{clickhouse,clickhouse-tencent}/create.sql: disable auto_statistics_t…
alexey-milovidov May 10, 2026
a82d445
More results
alexey-milovidov May 10, 2026
b805405
lib/benchmark-common.sh: handle Spark progress bar in timing parse
alexey-milovidov May 10, 2026
81d2bc6
duckdb-dataframe: drop SQL whitelist, install pytz
alexey-milovidov May 10, 2026
b290cb2
mariadb-columnstore: rewrite Q29 without REGEXP_REPLACE
alexey-milovidov May 10, 2026
0145e3d
sqlite: rewrite Q29 without REGEXP_REPLACE
alexey-milovidov May 10, 2026
129e1cb
{databend,octosql}/install: pick arch-matched binary
alexey-milovidov May 10, 2026
38120c2
fail fast on arm64 for systems with no arm64 support
alexey-milovidov May 10, 2026
302b31c
pg_ducklake/load: bump duckdb.memory_limit so CTAS doesn't OOM
alexey-milovidov May 10, 2026
ad83140
elasticsearch: single-node discovery so 9.4 bootstrap actually starts
alexey-milovidov May 10, 2026
f73956b
{pandas,polars-dataframe}: eval Python expressions, drop SQL whitelist
alexey-milovidov May 10, 2026
c5b15b6
{presto,presto-datalake,presto-partitioned,presto-datalake-partitione…
alexey-milovidov May 10, 2026
4729051
cratedb/check: wait for shard recovery, not just wire protocol
alexey-milovidov May 10, 2026
321e026
run-benchmark: retry run-instances on InsufficientInstanceCapacity
alexey-milovidov May 10, 2026
1acbc67
run-benchmark: also retry on quota errors
alexey-milovidov May 10, 2026
ba4a5b6
Merge branch 'main' into refactor/per-system-script-interface
alexey-milovidov May 10, 2026
c18a16e
More results
alexey-milovidov May 10, 2026
546b5e9
gizmosql/install: insulate the one-line installer from an unset HOME
alexey-milovidov May 10, 2026
65ba6ac
cedardb/results/20260510: record timed-out runs with partial timings
alexey-milovidov May 10, 2026
84a86f4
cedardb/results/20260510: record timed-out runs with partial timings
alexey-milovidov May 10, 2026
80d85bd
Add error markers
alexey-milovidov May 10, 2026
d6342fd
Merge branch 'refactor/per-system-script-interface' of github.com:Cli…
alexey-milovidov May 10, 2026
51cb9c6
Merge branch 'refactor/per-system-script-interface' of github.com:Cli…
alexey-milovidov May 10, 2026
e0e741e
Merge branch 'refactor/per-system-script-interface' of github.com:Cli…
alexey-milovidov May 10, 2026
519378c
generate-results: skip entries with {"error": ...}
alexey-milovidov May 10, 2026
4dfad38
index.html: tighten selector UI
alexey-milovidov May 10, 2026
15fedf9
index.html: avoid render() during theme bootstrap
alexey-milovidov May 10, 2026
9c7e07f
More results
alexey-milovidov May 10, 2026
e29ce9e
Drop "lukewarm-cold-run" tag from post-refactor results & templates
alexey-milovidov May 10, 2026
5aae012
firebolt{,-parquet,-parquet-partitioned}: actually wait for the cluster
alexey-milovidov May 10, 2026
739d225
Mark Parquet/data-lake/Spark/Sail/Polars systems as stateless
alexey-milovidov May 10, 2026
7fc81ac
index.html: stateless exclusion + tag hover highlights
alexey-milovidov May 10, 2026
caa8dc6
data.generated.js: regenerate after upstream merge
alexey-milovidov May 10, 2026
105ac07
Update results
alexey-milovidov May 10, 2026
06a1efb
Strip "lukewarm-cold-run" from new post-refactor results & firebolt t…
alexey-milovidov May 10, 2026
e8e5c7b
gizmosql/util.sh: stop_gizmosql actually waits; start_gizmosql retries
alexey-milovidov May 10, 2026
aec0a9c
{pandas,polars-dataframe,chdb-dataframe,duckdb-{dataframe,memory},daf…
alexey-milovidov May 10, 2026
31b0340
Reapply "stateless" tag to new Parquet/data-lake/Spark/Sail/Polars re…
alexey-milovidov May 10, 2026
c7b6c4e
Untrack accidentally-committed -old/template.json files
alexey-milovidov May 10, 2026
b282aa4
Rename BENCH_RESTARTABLE -> BENCH_DURABLE; reload non-durable between…
alexey-milovidov May 10, 2026
b54703f
Mark GlareDB systems as stateless
alexey-milovidov May 10, 2026
23e8cd0
New results
alexey-milovidov May 10, 2026
e7fae8f
Merge origin/main into refactor/per-system-script-interface
alexey-milovidov May 10, 2026
cc2655d
lib/benchmark-common.sh: bench_load syncs unconditionally, inside the…
alexey-milovidov May 10, 2026
62e95a1
Tag DuckDB-derivative systems with "DuckDB derivative"
alexey-milovidov May 10, 2026
3ce1279
Tag pg_mooncake and crunchy-bridge-for-analytics as DuckDB derivatives
alexey-milovidov May 10, 2026
1676c25
index.html: per-row remove control
alexey-milovidov May 10, 2026
2c1dad4
index.html: show measurement date in summary rows
alexey-milovidov May 10, 2026
527a3e6
Ensure every result entry has a date
alexey-milovidov May 10, 2026
f923d77
Add concurrent-QPS sustained-throughput test
alexey-milovidov May 11, 2026
6d5ee0a
New results
alexey-milovidov May 11, 2026
f26184d
Add changelog entry
alexey-milovidov May 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,24 @@
*.parquet
hits.csv
hits.tsv

# Per-system runtime artifacts produced by benchmark.sh
result.csv
log.txt
load_out.txt
server.log
server.pid
arc_token.txt
data-size.txt
.doris_home
.sirius_env

# Per-system data files
hits.db
mydb
hits.hyper
hits.vortex
*.vortex

# Python venvs created by install scripts
myenv/
15 changes: 14 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

Changes in the benchmark methodology or presentation, as well as major news.

### 2026-05-11
Unified benchmark scripts for different systems by providing a common interface in a set of scripts: `install`, `start`, `check`, `stop`, `load`, `query`, and `data-size`. Make the dataset download scripts common as well. Use a general benchmark runner in `lib/` to ensure different systems get equal treatment. This makes it easier to add more ways of testing, different datasets and scenarios to the benchmark, and simplifies support of all 88 systems presented. Note: for embedded systems, such as sqlite and Python duckdb module, wrap them into a Python HTTP server, so that the benchmark can run each query separately.

Restart databases before measuring cold run of each query as requested in [#667](https://github.com/ClickHouse/ClickBench/issues/667) and [#793](https://github.com/ClickHouse/ClickBench/issues/793). This prevents unfair measurements and removes the way for cheating on benchmark for systems that do excessive in-process caching without flushing it before the cold run. Unify flushing the OS page cache before cold run, so that all benchmark entries follow the same rules. Notes: for stateless systems (such as query engines on top of Parquet), the restart is no-op; for systems without durability and in-memory systems, the restart before each query also requires repeated data loading, which time is included in the cold query measurement.

Introduced a new measurement - QPS and error rate on concurrent workload (10 connections for 10 minutes) to prove the advantage of the refactoring. Currently, the metric is not exposed in the benchmark.

Re-run 88 systems on every machine. Fixed queries with regexps for MariaDB and SQLite. Added ARM64 versions for some systems: databend, octosql, octeryx. Use the faster data loader for MariaDB. An attempt to rerun CedarDB showed a bug. Added new systems: Trino, Presto, Quickwit. Generic runner for pandas and polars. Fixed issues with Spark variants. Clean up some tags. Some systems are found dead: vertica, kinetica, singlestore, heavyai.

Improve the website: move important selectors (open-source, hardware, tuned) on top and show them horizontally, they also filter out visible options in other selectors. When the mouse pointer is on top of a system, highlight their tags. Add a button on the diagram to remove a system from the report. Add measurement date to the diagram (as requested in [#639](https://github.com/ClickHouse/ClickBench/issues/639)). Make some cloud machine names shorter to remove clutter. The report methodology (aggregation of the measurements) and the default selection remains unchanged.

(Alexey Milovidov)

### 2026-05-08
Refactored directory structure to keep every historical result - they are organized in directories `system/results/YYYYMMDD/*.json` for each date. Compared to using git history, this unifies the format and structure of the results, making them ready for analysis. You can analyze it with clickhouse-local: `ch "SELECT * FROM '*/results/*/*.json'"` or export the data: `ch "SELECT * FROM '*/results/*/*.json' ORDER BY _path INTO OUTFILE 'results.parquet'"` (Alexey Milovidov).

Expand Down Expand Up @@ -56,7 +69,7 @@ The systems on the main chart are distinguished by color (systems from the same

Added the "open-source" and "proprietary" tags, so that you can list only open-source databases. For the reference, Umbra, Hyper, and CedarDB are proprietary.

Removed pointless tags, that some systems attribute to themself. One system misattributed itself as "mysql-compatible", two others added tags with their names, another reported two programming languages, a few systems reported an "analytical" tag, which is pointless, and one system didn't report itself as "ClickHouse-derivative" while being based on the ClickHouse interfaces and architecture.
Removed pointless tags, that some systems attribute to themselves. One system misattributed itself as "mysql-compatible", two others added tags with their names, another reported two programming languages, a few systems reported an "analytical" tag, which is pointless, and one system didn't report itself as "ClickHouse-derivative" while being based on the ClickHouse interfaces and architecture.

Some systems provided bogus results on the loading time or data size. For example, one system reported data size 1000 times less, and we didn't notice that. This was corrected. The comparison on the loading time will not include stateless systems that don't require data loading.

Expand Down
207 changes: 4 additions & 203 deletions arc/benchmark.sh
Original file line number Diff line number Diff line change
@@ -1,204 +1,5 @@
#!/bin/bash
# Arc ClickBench Complete Benchmark Script (Go Binary Version)
set -e

# ============================================================
# 1. INSTALL ARC FROM .DEB PACKAGE
# ============================================================
echo "Installing Arc from .deb package..."

# Fetch latest Arc version from GitHub releases
echo "Fetching latest Arc version..."
ARC_VERSION=$(curl -s https://api.github.com/repos/Basekick-Labs/arc/releases/latest | grep -oP '"tag_name": "v\K[^"]+')
if [ -z "$ARC_VERSION" ]; then
echo "Error: Could not fetch latest Arc version from GitHub"
exit 1
fi
echo "Latest Arc version: $ARC_VERSION"

ARCH=$(uname -m)
if [ "$ARCH" = "aarch64" ] || [ "$ARCH" = "arm64" ]; then
DEB_URL="https://github.com/Basekick-Labs/arc/releases/download/v${ARC_VERSION}/arc_${ARC_VERSION}_arm64.deb"
DEB_FILE="arc_${ARC_VERSION}_arm64.deb"
else
DEB_URL="https://github.com/Basekick-Labs/arc/releases/download/v${ARC_VERSION}/arc_${ARC_VERSION}_amd64.deb"
DEB_FILE="arc_${ARC_VERSION}_amd64.deb"
fi

echo "Detected architecture: $ARCH -> $DEB_FILE"

if [ ! -f "$DEB_FILE" ]; then
wget -q "$DEB_URL" -O "$DEB_FILE"
fi

sudo dpkg -i "$DEB_FILE" || sudo apt-get install -f -y
echo "[OK] Arc installed"

# ============================================================
# 2. PRINT SYSTEM INFO (Arc defaults)
# ============================================================
CORES=$(nproc)
TOTAL_MEM_KB=$(grep MemTotal /proc/meminfo | awk '{print $2}')
TOTAL_MEM_GB=$((TOTAL_MEM_KB / 1024 / 1024))
MEM_LIMIT_GB=$((TOTAL_MEM_GB * 80 / 100)) # 80% of system RAM

echo ""
echo "System Configuration:"
echo " CPU cores: $CORES"
echo " Connections: $((CORES * 2)) (cores × 2)"
echo " Threads: $CORES (same as cores)"
echo " Memory limit: ${MEM_LIMIT_GB}GB (80% of ${TOTAL_MEM_GB}GB total)"
echo ""

# ============================================================
# 3. START ARC AND CAPTURE TOKEN FROM LOGS
# ============================================================
echo "Starting Arc service..."

# Check if we already have a valid token from a previous run
if [ -f "arc_token.txt" ]; then
EXISTING_TOKEN=$(cat arc_token.txt)
echo "Found existing token file, will verify after Arc starts..."
fi

sudo systemctl start arc

# Wait for Arc to be ready
echo "Waiting for Arc to be ready..."
for i in {1..30}; do
if curl -sf http://localhost:8000/health > /dev/null 2>&1; then
echo "[OK] Arc is ready!"
break
fi
if [ $i -eq 30 ]; then
echo "Error: Arc failed to start within 30 seconds"
sudo journalctl -u arc --no-pager | tail -50
exit 1
fi
sleep 1
done

# Try to get token - either from existing file or from logs (first run)
ARC_TOKEN=""

# First, check if existing token works
if [ -n "$EXISTING_TOKEN" ]; then
if curl -sf http://localhost:8000/health -H "x-api-key: $EXISTING_TOKEN" > /dev/null 2>&1; then
ARC_TOKEN="$EXISTING_TOKEN"
echo "[OK] Using existing token from arc_token.txt"
else
echo "Existing token invalid, looking for new token in logs..."
fi
fi

# If no valid token yet, try to extract from logs (first run scenario)
if [ -z "$ARC_TOKEN" ]; then
ARC_TOKEN=$(sudo journalctl -u arc --no-pager | grep -oP '(?:Initial admin API token|Admin API token): \K[^\s]+' | head -1)
if [ -n "$ARC_TOKEN" ]; then
echo "[OK] Captured new token from logs"
echo "$ARC_TOKEN" > arc_token.txt
else
echo "Error: Could not find or validate API token"
echo "If this is not the first run, Arc's database may need to be reset:"
echo " sudo rm -rf /var/lib/arc/data/arc.db"
exit 1
fi
fi

echo "Token: ${ARC_TOKEN:0:20}..."

# ============================================================
# 4. DOWNLOAD DATASET
# ============================================================
DATASET_FILE="hits.parquet"
DATASET_URL="https://datasets.clickhouse.com/hits_compatible/hits.parquet"
EXPECTED_SIZE=14779976446

if [ -f "$DATASET_FILE" ]; then
CURRENT_SIZE=$(stat -c%s "$DATASET_FILE" 2>/dev/null || stat -f%z "$DATASET_FILE" 2>/dev/null)
if [ "$CURRENT_SIZE" -eq "$EXPECTED_SIZE" ]; then
echo "[OK] Dataset already downloaded (14GB)"
else
echo "Re-downloading dataset (size mismatch)..."
rm -f "$DATASET_FILE"
wget --continue --progress=dot:giga "$DATASET_URL"
fi
else
echo "Downloading ClickBench dataset (14GB)..."
wget --continue --progress=dot:giga "$DATASET_URL"
fi

# ============================================================
# 5. LOAD DATA INTO ARC
# ============================================================
echo "Loading data into Arc..."

# Determine Arc's data directory (default: /var/lib/arc/data)
ARC_DATA_DIR="/var/lib/arc/data"
TARGET_DIR="$ARC_DATA_DIR/clickbench/hits"
TARGET_FILE="$TARGET_DIR/hits.parquet"

sudo mkdir -p "$TARGET_DIR"

if [ -f "$TARGET_FILE" ]; then
SOURCE_SIZE=$(stat -c%s "$DATASET_FILE" 2>/dev/null || stat -f%z "$DATASET_FILE" 2>/dev/null)
TARGET_SIZE=$(stat -c%s "$TARGET_FILE" 2>/dev/null || stat -f%z "$TARGET_FILE" 2>/dev/null)
if [ "$SOURCE_SIZE" -eq "$TARGET_SIZE" ]; then
echo "[OK] Data already loaded"
else
echo "Reloading data (size mismatch)..."
sudo cp "$DATASET_FILE" "$TARGET_FILE"
fi
else
sudo cp "$DATASET_FILE" "$TARGET_FILE"
echo "[OK] Data loaded to $TARGET_FILE"
fi

# ============================================================
# 6. SET ENVIRONMENT AND RUN BENCHMARK
# ============================================================
export ARC_URL="http://localhost:8000"
export ARC_API_KEY="$ARC_TOKEN"
export DATABASE="clickbench"
export TABLE="hits"

echo ""
echo "Running ClickBench queries (true cold runs)..."
echo "================================================"
./run.sh 2>&1 | tee log.txt

# ============================================================
# 7. STOP ARC AND FORMAT RESULTS
# ============================================================
echo "Stopping Arc..."
sudo systemctl stop arc

# Format results as proper JSON array
cat log.txt | grep -oE '^[0-9]+\.[0-9]+|^null' | \
awk '{
if (NR % 3 == 1) printf "[";
printf "%s", $1;
if (NR % 3 == 0) print "],";
else printf ", ";
}' > results.txt

echo ""
echo "[OK] Benchmark complete!"
echo "================================================"
echo "Load time: 0"
echo "Data size: $EXPECTED_SIZE"
cat results.txt
echo "================================================"

# ============================================================
# 8. CLEANUP
# ============================================================
echo "Cleaning up..."

# Uninstall Arc package
sudo dpkg -r arc || true

# Remove Arc data directory
sudo rm -rf /var/lib/arc

echo "[OK] Cleanup complete"
# Thin shim — actual flow is in lib/benchmark-common.sh.
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-single"
export BENCH_DURABLE=yes
exec ../lib/benchmark-common.sh
11 changes: 11 additions & 0 deletions arc/check
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
set -e

ARC_URL="${ARC_URL:-http://localhost:8000}"
TOKEN=$(cat arc_token.txt 2>/dev/null || true)

if [ -n "$TOKEN" ]; then
curl -sf "$ARC_URL/health" -H "x-api-key: $TOKEN" >/dev/null
else
curl -sf "$ARC_URL/health" >/dev/null
fi
10 changes: 10 additions & 0 deletions arc/data-size
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash
set -e

# Source parquet file size (loaded into Arc's data directory).
F="/var/lib/arc/data/clickbench/hits/hits.parquet"
if [ -f "$F" ]; then
sudo stat -c%s "$F"
else
echo 14779976446
fi
28 changes: 28 additions & 0 deletions arc/install
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
set -e

# Install Arc from a .deb release. Idempotent.
if dpkg -l arc 2>/dev/null | grep -q '^ii '; then
exit 0
fi

ARC_VERSION=$(curl -s https://api.github.com/repos/Basekick-Labs/arc/releases/latest \
| grep -oP '"tag_name": "v\K[^"]+')
if [ -z "$ARC_VERSION" ]; then
echo "Error: Could not fetch latest Arc version from GitHub" >&2
exit 1
fi

ARCH=$(uname -m)
if [ "$ARCH" = "aarch64" ] || [ "$ARCH" = "arm64" ]; then
DEB_FILE="arc_${ARC_VERSION}_arm64.deb"
else
DEB_FILE="arc_${ARC_VERSION}_amd64.deb"
fi
DEB_URL="https://github.com/Basekick-Labs/arc/releases/download/v${ARC_VERSION}/${DEB_FILE}"

if [ ! -f "$DEB_FILE" ]; then
wget -q "$DEB_URL" -O "$DEB_FILE"
fi

sudo dpkg -i "$DEB_FILE" || sudo apt-get install -f -y
20 changes: 20 additions & 0 deletions arc/load
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
set -e

# Arc loads the parquet file into its data directory and indexes it on startup.
ARC_DATA_DIR="/var/lib/arc/data"
TARGET_DIR="$ARC_DATA_DIR/clickbench/hits"
TARGET_FILE="$TARGET_DIR/hits.parquet"

sudo mkdir -p "$TARGET_DIR"

if [ -f "$TARGET_FILE" ] && \
[ "$(stat -c%s hits.parquet)" -eq "$(stat -c%s "$TARGET_FILE")" ]; then
: # already loaded
else
sudo cp hits.parquet "$TARGET_FILE"
fi

# Free up local space.
rm -f hits.parquet
sync
49 changes: 49 additions & 0 deletions arc/query
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
#!/bin/bash
# Reads a SQL query from stdin, POSTs it to Arc's HTTP API.
# Stdout: query response body (JSON).
# Stderr: query runtime in fractional seconds on the last line (extracted
# from Arc's journal log line `execution_time_ms=N`).
# Exit non-zero on error.
set -e

ARC_URL="${ARC_URL:-http://localhost:8000}"
ARC_API_KEY="${ARC_API_KEY:-$(cat arc_token.txt 2>/dev/null)}"

query=$(cat)

# Build JSON payload with proper escaping.
JSON_PAYLOAD=$(jq -Rs '{sql: .}' <<<"$query")

# Mark journal position so we can locate the matching execution_time_ms entry.
LOG_MARKER=$(date -u +"%Y-%m-%dT%H:%M:%S")

RESPONSE=$(curl -s -w "\n%{http_code}" \
-X POST "$ARC_URL/api/v1/query" \
-H "x-api-key: $ARC_API_KEY" \
-H "Content-Type: application/json" \
-d "$JSON_PAYLOAD" \
--max-time 300)

HTTP_CODE=$(printf '%s\n' "$RESPONSE" | tail -1)
BODY=$(printf '%s\n' "$RESPONSE" | head -n -1)

if [ "$HTTP_CODE" != "200" ]; then
printf 'arc query failed: HTTP %s\n%s\n' "$HTTP_CODE" "$BODY" >&2
exit 1
fi

# Result body to stdout.
printf '%s\n' "$BODY"

# Extract execution_time_ms from Arc's journal — give it a moment to flush.
sleep 0.1
EXEC_MS=$(sudo journalctl -u arc --since="$LOG_MARKER" --no-pager 2>/dev/null \
| grep -oP 'execution_time_ms=\K[0-9]+' | tail -1)

if [ -z "$EXEC_MS" ]; then
echo "Could not extract execution_time_ms from arc journal" >&2
exit 1
fi

# Convert ms -> seconds and emit on stderr.
awk -v ms="$EXEC_MS" 'BEGIN { printf "%.4f\n", ms / 1000 }' >&2
Loading