Performance and benchmarks.
ContractForge records operational timing evidence for every run. Use that evidence for dashboards, and use the benchmark harness when you need repeatable write-mode comparisons in a specific runtime.
Performance model
ContractForge standardizes ingestion behavior, but runtime performance still depends on source size, write mode, table layout, Spark runtime, storage access and governance operations.
Write mode cost
Append and overwrite are usually cheaper than merge-based modes. historical and snapshot soft delete perform additional target reads and SQL MERGE work.
Stage evidence
stage_durations records read, prepare, schema, watermark, quality, write, governance, state and lineage timings when the stage runs.
Runtime separation
Do not compare local Spark, classic clusters and serverless jobs as if they were one platform. Keep benchmark records by runtime.
Layout matters
Use Liquid Clustering or partitioning only when access patterns justify it. Layout choices affect merge scans, optimization and query latency.
Recommended practices
| Area | Guidance |
|---|---|
| Cache | Use use_cache=true only when expensive stages reuse the same DataFrame. Disable cache first when memory pressure appears. |
| JDBC | Use partitioning, fetch size and source-side predicates for large tables. Avoid unbounded queries against production databases. |
| REST APIs | Keep page, byte, timeout and retry limits explicit. Land very large extracts before ingestion. |
| Delta layout | Prefer cluster_columns for new Databricks tables when Liquid Clustering matches query patterns. Avoid high-cardinality partitions. |
| Quality gates | Use quarantine for row-isolating checks and abort for set-level checks. Watch quality stage timing in control tables. |
Write-mode benchmark harness
The repository includes scripts/benchmark_write_modes.py for opt-in runtime benchmarking. It creates deterministic synthetic data, runs selected official write modes and writes JSON Lines records with duration, throughput, row counters, stage durations, runtime type, Spark version and ContractForge version.
Not a normal CI test
Benchmark timings depend on cluster size, runtime version, table layout, storage and catalog configuration. Treat results as environment evidence, not universal product numbers.
python scripts/benchmark_write_modes.py --dry-run --rows 100000 --repeats 2
python scripts/benchmark_write_modes.py \
--catalog main \
--target-schema contractforge_bench \
--ctrl-schema contractforge_bench_ops \
--rows 1000000 \
--partitions 32 \
--repeats 3 \
--reset \
--output-jsonl benchmark_write_modes.jsonl
Benchmark output
| Field | Meaning |
|---|---|
mode | Write mode under test. |
requested_rows | Synthetic source row count requested. |
rows_read / rows_written | Normalized ContractForge run counters. |
rows_per_second | rows_written / duration_seconds when duration is available. |
stage_durations | Per-stage timing evidence from the ingestion result. |
runtime_type, spark_version, framework_version | Runtime evidence needed for comparison. |
Dashboard query
SELECT
target_table,
mode,
count(*) AS runs,
round(avg(duration_seconds), 2) AS avg_duration_seconds,
round(percentile_approx(duration_seconds, 0.95), 2) AS p95_duration_seconds,
round(avg(rows_written / NULLIF(duration_seconds, 0)), 2) AS avg_rows_per_second
FROM main.ops.ctrl_ingestion_runs
WHERE status = 'SUCCESS'
GROUP BY target_table, mode
ORDER BY p95_duration_seconds DESC;