AWS adapter
contractforge-aws is the AWS adapter for ContractForge. Its first target is
AWS Glue Spark writing Apache Iceberg tables in S3, cataloged in AWS Glue Data
Catalog and optionally governed through AWS Lake Formation.
Status
The adapter has a stable supported surface for the documented
aws_glue_iceberg scope. Real AWS runtime validation covers Supabase JDBC,
USGS REST, S3 file medallion, dedicated incremental files, controlled failure
paths and available-now streaming through AWS-native MSK. It supports planning,
rendering, S3 artifact publication, Glue job registration, stable runner
execution, Glue run/wait and Athena evidence audit for the supported surface.
| Area | Current support |
|---|---|
| Runtime | AWS Glue Spark |
| Table format | Apache Iceberg |
| Catalog | AWS Glue Data Catalog |
| Evidence | Iceberg control tables using the core evidence model |
| Artifact publication | S3 via environment.artifacts.uri |
| Deployment | contractforge-aws deploy creates or updates Glue jobs |
| Governance | Lake Formation review/apply helpers for supported operations |
Runtime evidence captured on 2026-06-02:
| Project | Result | Evidence |
|---|---|---|
| Supabase JDBC medallion | PASS | Contract-only AWS project completed through adapter CLI and the stable runtime/contractforge_aws_runner.py with five successful Glue jobs. Latest run wrote cost evidence for all five targets, retained bronze quarantine behavior, and audit shows no error rows. |
| USGS REST medallion | PASS | REST/GeoJSON medallion completed through deploy-project --run --wait --record-cost-evidence --audit-evidence using the stable runner. Latest audit shows all historical runs successful, all quality status PASSED, no quarantine rows, no error rows, and joined DPU-second cost rows for all four targets. |
| S3 file medallion | PASS | Three Glue jobs ran through the stable runner, target rows remained bronze=7, silver=7, gold=3, and audit records success, quality, quarantine and cost evidence. |
| Incremental files | PASS | Incremental-file project ran through the stable runner and recorded a successful existing-bookmark run with cost/audit evidence; historical runs preserve wave 1, wave 2 and no-new-input SKIPPED evidence. |
| Failure paths | PASS | ensure-evidence-tables created Athena Iceberg evidence/state tables, then two expected Glue failures ran through the stable runner with EXPECTED_FAILURE, failed-run rows, error evidence, abort-quality evidence and DPU-second cost evidence for both targets. |
| Available-now streaming | PASS | Azure Event Hubs and AWS MSK Kafka available-now jobs ran through the stable runner with checkpointed Glue streaming, success status, cost evidence and Athena audit over run, quality, quarantine, error and cost tables. MSK is the AWS-native Kafka maturity provider. |
Install
pip install contractforge-core contractforge-aws
Planning and rendering stay SDK-free on import. Optional runtime helpers import AWS SDKs lazily, or use caller-provided clients.
Environment
AWS needs an environment contract for artifact storage, Glue settings and the Iceberg warehouse:
adapter: aws
artifacts:
uri: s3://contractforge-artifacts/prod/orders/
include_contract_bundle: true
include_normalized_contract: true
parameters:
aws:
iceberg:
warehouse: s3://contractforge-warehouse/prod/
glue_job:
role_arn: arn:aws:iam::123456789012:role/ContractForgeGlueRole
artifacts.uri is where rendered scripts, manifests, original split contracts
and normalized contract snapshots are published.
Glue version 4.0, worker type G.1X, two workers, 60 minute timeout, zero
retries, library-runner mode and bookmark behavior are adapter defaults. Declare
them only when the project needs to override the defaults.
Runtime model
AWS deployments now use the same operational shape as Databricks: the platform
job runs the ContractForge adapter library and receives the contract as runtime
configuration. The default Glue script is the stable
runtime/contractforge_aws_runner.py artifact. deploy publishes the contract
and environment as runtime JSON artifacts, registers the Glue job with their S3
URIs as arguments, and Glue loads the contract inside the adapter runtime.
The generated <target>.glue_job.py is still emitted for review, syntax
validation and the explicit generated_script fallback mode. The renderer is
still the base implementation today, but per-contract Glue code is no longer
the default deployed job script.
Review artifacts
Every render includes a deployment manifest:
<target>.deployment_manifest.json
The manifest lists generated artifacts, per-artifact bytes and lines, an
artifact_summary with total and runtime script size, and an
artifact_size_budget that marks generated Glue scripts WARN above 256 KiB.
Treat sudden growth as a review signal before publishing Glue scripts to S3.
Runtime-sensitive mappings, currently hash_diff_upsert, also render:
<target>.performance_profile.json
<target>.performance.sql
The profile is a benchmark plan, and the SQL is an Athena-compatible evidence
report over ctrl_ingestion_runs and ctrl_ingestion_cost. They are not
benchmark results by themselves. They keep AWS hash-diff merge support at
SUPPORTED_WITH_WARNINGS until real Glue/Iceberg measurements are captured.
For hash-diff contracts, merge_keys and hash strategy have separate
meanings. merge_keys are the durable row identity used in the Iceberg
MERGE ON clause. Use hash_keys for explicit content hashing, or
hash_strategy: all_columns_except plus hash_exclude_columns for wide
tables. The AWS renderer automatically excludes ContractForge/framework
generated columns and prefilters unchanged rows before Iceberg MERGE. The AWS
planner returns AWS_HASH_DIFF_MERGE_KEYS_REQUIRED when hash_diff_upsert omits
merge_keys. Iceberg snapshot summaries expose physical file rewrite counters;
ContractForge augments operation_metrics_json with
hash_diff_candidate_rows and hash_input_columns so dashboards can separate
business change volume from Iceberg write amplification.
The generated IAM artifact is also review-only:
<target>.iam_policy.json
It derives Glue Catalog, CloudWatch Logs, source S3, Iceberg warehouse, artifact S3, script and dependency-file permissions from the contract and environment. Dependency files and explicit script paths are exact S3 object ARNs; source, artifact and warehouse locations are prefix-scoped when the runtime must read or write multiple objects.
Deploy flow
contractforge-aws deploy contracts/orders.ingestion.yaml --environment environments/prod.aws.yaml
This performs the adapter-owned AWS pipeline:
load bundle
-> plan and render
-> publish artifacts to S3
-> materialize Glue job definition
-> create or update Glue job
The registered Glue job then runs natively in AWS through the stable
ContractForge AWS runner. The core never imports boto3; runtime contract
loading is owned by contractforge-aws inside Glue.
For complete ingestion projects, use the project-level command:
contractforge-aws deploy-project project.yaml --dry-run --summary-only
contractforge-aws deploy-project project.yaml --run --wait \
--summary-only \
--audit-evidence \
--athena-output-location s3://bucket/athena-results/
deploy-project reads project.yaml, resolves the AWS environment, deploys
contracts in execution_order, and can start/wait Glue jobs. --dry-run
performs the same local project loading, planning, rendering and generated
Glue Python syntax compilation without AWS API calls. Negative observability
tests can mark a step with
expected_result: failed and run with --accept-expected-failures.
--summary-only also works for real runs: it keeps deployment/run/wait/cost
status plus artifact counts and bytes, but omits the verbose per-artifact S3
list.
When AWS Glue temporarily rejects a project step with
ConcurrentRunsExceededException, deploy-project retries start within the
same --max-wait-seconds budget using --poll-interval-seconds. This keeps the
operator path deterministic under account-level or job-level concurrency
limits without requiring manual Glue Studio intervention.
--audit-evidence is allowed only with --wait, so evidence queries run after
terminal Glue states.
Native Project Orchestration
AWS project orchestration is Step Functions based. ContractForge still plans
and renders contracts before deployment, then publishes the stable runner plus
runtime contract snapshots. The adapter maps project.yaml.execution_order
into a Step Functions state machine that starts Glue jobs through the Glue
.sync integration. Independent steps in the same dependency wave are emitted
as a Parallel state.
Render without AWS orchestration API calls:
contractforge-aws deploy-project project.yaml --dry-run --render-orchestration --summary-only
Create or update the Step Functions state machine after Glue jobs are registered:
contractforge-aws deploy-project project.yaml --deploy-orchestration --summary-only
Create/update the state machine, start it and wait for terminal status:
contractforge-aws deploy-project project.yaml --deploy-orchestration --wait-orchestration --summary-only
Record Glue DPU-second cost evidence after an orchestrated run:
contractforge-aws deploy-project project.yaml --deploy-orchestration --wait-orchestration \
--record-cost-evidence \
--athena-output-location s3://bucket/athena-results/
When project.yaml.schedule is declared, the output also
contains an EventBridge Scheduler target for the state machine. Deployment
requires parameters.aws.step_functions.role_arn; Scheduler deployment also
requires parameters.aws.scheduler.role_arn.
Use standard cron syntax in the project file and an IANA timezone name:
schedule:
cron: "0 6 * * *"
timezone: America/Sao_Paulo
enabled: false
adapters:
aws:
state: DISABLED
The AWS adapter renders this to the native EventBridge Scheduler expression
cron(0 6 * * ? *). AWS-native values such as state,
flexible_time_window and native expression belong under
schedule.adapters.aws.
Direct Glue execution flags (--run / --wait) are mutually exclusive with
Step Functions execution flags (--run-orchestration / --wait-orchestration)
to prevent duplicate ingestion runs.
After runtime execution, audit the canonical evidence tables through Athena:
contractforge-aws audit-evidence \
--database contractforge_cf_supabase_e2e_ops \
--athena-output-location s3://contractforge-aws-smoke-449112696824-us-east-1/athena-results/
The audit returns runs by status, runs by quality status, quality rows,
quarantine rows, error rows and reconciled Glue DPU-second cost rows using the
canonical ContractForge evidence tables. Cost rollups join ctrl_ingestion_cost
to ctrl_ingestion_runs on run_id and target_table, so orphan platform cost
records are not counted.
Glue JobRun cost signals are reconciled after a terminal run because
DPUSeconds is available from the AWS API, not inside the generated Glue job.
Use --record-cost-evidence with waited project runs to append
ctrl_ingestion_cost rows without duplicating ctrl_ingestion_runs. The
adapter records cost under the canonical ContractForge run id
(job_name:glue_run_id) and keeps the raw Glue run id in the payload:
contractforge-aws deploy-project project.yaml --run --wait \
--record-cost-evidence \
--athena-output-location s3://bucket/athena-results/
For a Glue run that already completed, record cost evidence without rerunning the ingestion job:
contractforge-aws record-glue-cost contracts/bronze/orders.ingestion.yaml \
--environment environments/aws.environment.yaml \
--job-name contractforge_lake_bronze_orders \
--run-id jr_123 \
--athena-output-location s3://bucket/athena-results/
The command uses the same idempotent cost writer as deploy-project --record-cost-evidence. If the contract is not adjacent to its
.environment.yaml, pass --environment explicitly so the command writes to
the intended evidence database.
For runtime-sensitive contracts such as hash_diff_upsert, render or execute the
benchmark report through the adapter CLI:
contractforge-aws benchmark-report contracts/customers.ingestion.yaml \
--environment environments/aws.environment.yaml
contractforge-aws benchmark-report contracts/customers.ingestion.yaml \
--environment environments/aws.environment.yaml \
--run \
--athena-output-location s3://bucket/athena-results/
To make smoke-test lifecycle explicit, render a non-destructive cleanup plan:
contractforge-aws cleanup-project examples/real-world/aws-eventhubs-kafka-available-now/project.yaml
cleanup-project does not delete anything. It reads the same project and
environment contracts, renders the expected Glue job names, artifact S3 prefix,
warehouse S3 prefix, evidence database and any declared external cleanup
resources. The Event Hubs streaming example declares the Azure resource group
used by the Kafka-compatible Event Hubs namespace, so the cleanup output includes
the reviewed az group delete command without executing it.
To report release readiness, use:
contractforge-aws stabilization-report
The report separates two claims. supported_surface_ready: true means the
documented v0.1.0 AWS Glue/Iceberg surface has passed the real-project gates.
stable_final: false remains true while broader production-certification
boundaries are accepted but still open, such as hash-diff concurrency,
unvalidated Kafka providers, Lake Formation governance equivalence and historical.
Use --strict-final in CI only when those boundaries must block the release.
Supported semantics
| ContractForge feature | AWS mapping |
|---|---|
append | Iceberg append |
overwrite | Iceberg replace/overwrite |
upsert | Iceberg MERGE INTO |
hash_diff_upsert | Hash-diff staging plus Iceberg merge, with merge_keys for row identity and hash_keys or hash_strategy: all_columns_except for content comparison; performance warnings remain |
| Quality abort/warn/quarantine | Glue Data Quality where mappable, plus ContractForge evidence tables |
incremental_files | Glue DynamicFrame + job bookmarks for eligible S3 formats; no-new-input bookmark runs write SKIPPED evidence without executing column-dependent preparation/write logic |
| JDBC/PostgreSQL | Spark/Glue JDBC with user-supplied drivers and secrets |
| Logical table refs | {{ table_ref:layer.table }} -> Glue Catalog/Iceberg names |
The dedicated incremental-files fixture is
examples/real-world/aws-incremental-files. It validates portable
incremental_files contracts, Glue bookmark configuration, Iceberg rendering
and generated Python compilation before the upload-wave runtime test is run in
AWS. The real AWS validation covers wave 1, wave 2 and a no-new-input rerun; the
last run preserves the target row count and writes SKIPPED evidence with
skip_reason=no_new_input.
Available-now streaming jobs write per-micro-batch evidence to
ctrl_ingestion_streams and copy aggregate totals into final run evidence as
stream_batches, stream_rows_read, stream_rows_written and
stream_rows_quarantined. Final run, state and lineage rows-written evidence
prefer those ContractForge stream totals because the latest Iceberg snapshot may
only represent the last micro-batch. The Azure Event Hubs Kafka path has been
validated in AWS Glue with checkpoint progression, no-input reruns, quarantine,
target writes and cost evidence. Other Kafka/Event Hubs providers still return a
provider-review warning until their connector/runtime semantics are tested.
historical, snapshot soft delete, some governance equivalence and proprietary native passthrough connectors remain review-required until validated for concrete AWS consumer engines.
Real validation
The Supabase JDBC medallion example validates the same project shape on Databricks and AWS: shared connection YAML, split contracts, logical table refs, JDBC partitioned reads, quality quarantine, annotations, operations and control tables. Snowflake parity is covered by the Snowflake stable-surface evidence and the cross-adapter contracts where Snowflake source bindings are supported.
See the repository example:
examples/real-world/supabase-jdbc-medallion/
examples/real-world/usgs-earthquake-rest-medallion/
examples/real-world/s3-file-medallion/
examples/real-world/aws-failure-paths/