AWS adapter

contractforge-aws is the AWS adapter for ContractForge. Its first target is AWS Glue Spark writing Apache Iceberg tables in S3, cataloged in AWS Glue Data Catalog and optionally governed through AWS Lake Formation.

Status

The adapter has a stable supported surface for the documented aws_glue_iceberg scope. Real AWS runtime validation covers Supabase JDBC, USGS REST, S3 file medallion, dedicated incremental files, controlled failure paths and available-now streaming through AWS-native MSK. It supports planning, rendering, S3 artifact publication, Glue job registration, stable runner execution, Glue run/wait and Athena evidence audit for the supported surface.

Area	Current support
Runtime	AWS Glue Spark
Table format	Apache Iceberg
Catalog	AWS Glue Data Catalog
Evidence	Iceberg control tables using the core evidence model
Artifact publication	S3 via `environment.artifacts.uri`
Deployment	`contractforge-aws deploy` creates or updates Glue jobs
Governance	Lake Formation review/apply helpers for supported operations

Runtime evidence captured on 2026-06-02:

Project	Result	Evidence
Supabase JDBC medallion	`PASS`	Contract-only AWS project completed through adapter CLI and the stable `runtime/contractforge_aws_runner.py` with five successful Glue jobs. Latest run wrote cost evidence for all five targets, retained bronze quarantine behavior, and audit shows no error rows.
USGS REST medallion	`PASS`	REST/GeoJSON medallion completed through `deploy-project --run --wait --record-cost-evidence --audit-evidence` using the stable runner. Latest audit shows all historical runs successful, all quality status `PASSED`, no quarantine rows, no error rows, and joined DPU-second cost rows for all four targets.
S3 file medallion	`PASS`	Three Glue jobs ran through the stable runner, target rows remained bronze=7, silver=7, gold=3, and audit records success, quality, quarantine and cost evidence.
Incremental files	`PASS`	Incremental-file project ran through the stable runner and recorded a successful existing-bookmark run with cost/audit evidence; historical runs preserve wave 1, wave 2 and no-new-input `SKIPPED` evidence.
Failure paths	`PASS`	`ensure-evidence-tables` created Athena Iceberg evidence/state tables, then two expected Glue failures ran through the stable runner with `EXPECTED_FAILURE`, failed-run rows, error evidence, abort-quality evidence and DPU-second cost evidence for both targets.
Available-now streaming	`PASS`	Azure Event Hubs and AWS MSK Kafka available-now jobs ran through the stable runner with checkpointed Glue streaming, success status, cost evidence and Athena audit over run, quality, quarantine, error and cost tables. MSK is the AWS-native Kafka maturity provider.

Install

pip install contractforge-core contractforge-aws

Planning and rendering stay SDK-free on import. Optional runtime helpers import AWS SDKs lazily, or use caller-provided clients.

Environment

AWS needs an environment contract for artifact storage, Glue settings and the Iceberg warehouse:

adapter: aws
artifacts:
  uri: s3://contractforge-artifacts/prod/orders/
  include_contract_bundle: true
  include_normalized_contract: true
parameters:
  aws:
    iceberg:
      warehouse: s3://contractforge-warehouse/prod/
    glue_job:
      role_arn: arn:aws:iam::123456789012:role/ContractForgeGlueRole

artifacts.uri is where rendered scripts, manifests, original split contracts and normalized contract snapshots are published. Glue version 4.0, worker type G.1X, two workers, 60 minute timeout, zero retries, library-runner mode and bookmark behavior are adapter defaults. Declare them only when the project needs to override the defaults.

Runtime model

AWS deployments now use the same operational shape as Databricks: the platform job runs the ContractForge adapter library and receives the contract as runtime configuration. The default Glue script is the stable runtime/contractforge_aws_runner.py artifact. deploy publishes the contract and environment as runtime JSON artifacts, registers the Glue job with their S3 URIs as arguments, and Glue loads the contract inside the adapter runtime.

The generated <target>.glue_job.py is still emitted for review, syntax validation and the explicit generated_script fallback mode. The renderer is still the base implementation today, but per-contract Glue code is no longer the default deployed job script.

Review artifacts

Every render includes a deployment manifest:

<target>.deployment_manifest.json

The manifest lists generated artifacts, per-artifact bytes and lines, an artifact_summary with total and runtime script size, and an artifact_size_budget that marks generated Glue scripts WARN above 256 KiB. Treat sudden growth as a review signal before publishing Glue scripts to S3.

Runtime-sensitive mappings, currently hash_diff_upsert, also render:

<target>.performance_profile.json
<target>.performance.sql

The profile is a benchmark plan, and the SQL is an Athena-compatible evidence report over ctrl_ingestion_runs and ctrl_ingestion_cost. They are not benchmark results by themselves. They keep AWS hash-diff merge support at SUPPORTED_WITH_WARNINGS until real Glue/Iceberg measurements are captured.

For hash-diff contracts, merge_keys and hash strategy have separate meanings. merge_keys are the durable row identity used in the Iceberg MERGE ON clause. Use hash_keys for explicit content hashing, or hash_strategy: all_columns_except plus hash_exclude_columns for wide tables. The AWS renderer automatically excludes ContractForge/framework generated columns and prefilters unchanged rows before Iceberg MERGE. The AWS planner returns AWS_HASH_DIFF_MERGE_KEYS_REQUIRED when hash_diff_upsert omits merge_keys. Iceberg snapshot summaries expose physical file rewrite counters; ContractForge augments operation_metrics_json with hash_diff_candidate_rows and hash_input_columns so dashboards can separate business change volume from Iceberg write amplification.

The generated IAM artifact is also review-only:

<target>.iam_policy.json

It derives Glue Catalog, CloudWatch Logs, source S3, Iceberg warehouse, artifact S3, script and dependency-file permissions from the contract and environment. Dependency files and explicit script paths are exact S3 object ARNs; source, artifact and warehouse locations are prefix-scoped when the runtime must read or write multiple objects.

Deploy flow

contractforge-aws deploy contracts/orders.ingestion.yaml --environment environments/prod.aws.yaml

This performs the adapter-owned AWS pipeline:

load bundle
  -> plan and render
  -> publish artifacts to S3
  -> materialize Glue job definition
  -> create or update Glue job

The registered Glue job then runs natively in AWS through the stable ContractForge AWS runner. The core never imports boto3; runtime contract loading is owned by contractforge-aws inside Glue.

For complete ingestion projects, use the project-level command:

contractforge-aws deploy-project project.yaml --dry-run --summary-only
contractforge-aws deploy-project project.yaml --run --wait \
  --summary-only \
  --audit-evidence \
  --athena-output-location s3://bucket/athena-results/

deploy-project reads project.yaml, resolves the AWS environment, deploys contracts in execution_order, and can start/wait Glue jobs. --dry-run performs the same local project loading, planning, rendering and generated Glue Python syntax compilation without AWS API calls. Negative observability tests can mark a step with expected_result: failed and run with --accept-expected-failures. --summary-only also works for real runs: it keeps deployment/run/wait/cost status plus artifact counts and bytes, but omits the verbose per-artifact S3 list. When AWS Glue temporarily rejects a project step with ConcurrentRunsExceededException, deploy-project retries start within the same --max-wait-seconds budget using --poll-interval-seconds. This keeps the operator path deterministic under account-level or job-level concurrency limits without requiring manual Glue Studio intervention. --audit-evidence is allowed only with --wait, so evidence queries run after terminal Glue states.

Native Project Orchestration

AWS project orchestration is Step Functions based. ContractForge still plans and renders contracts before deployment, then publishes the stable runner plus runtime contract snapshots. The adapter maps project.yaml.execution_order into a Step Functions state machine that starts Glue jobs through the Glue .sync integration. Independent steps in the same dependency wave are emitted as a Parallel state.

Render without AWS orchestration API calls:

contractforge-aws deploy-project project.yaml --dry-run --render-orchestration --summary-only

Create or update the Step Functions state machine after Glue jobs are registered:

contractforge-aws deploy-project project.yaml --deploy-orchestration --summary-only

Create/update the state machine, start it and wait for terminal status:

contractforge-aws deploy-project project.yaml --deploy-orchestration --wait-orchestration --summary-only

Record Glue DPU-second cost evidence after an orchestrated run:

contractforge-aws deploy-project project.yaml --deploy-orchestration --wait-orchestration \
  --record-cost-evidence \
  --athena-output-location s3://bucket/athena-results/

When project.yaml.schedule is declared, the output also contains an EventBridge Scheduler target for the state machine. Deployment requires parameters.aws.step_functions.role_arn; Scheduler deployment also requires parameters.aws.scheduler.role_arn.

Use standard cron syntax in the project file and an IANA timezone name:

schedule:
  cron: "0 6 * * *"
  timezone: America/Sao_Paulo
  enabled: false
  adapters:
    aws:
      state: DISABLED

The AWS adapter renders this to the native EventBridge Scheduler expression cron(0 6 * * ? *). AWS-native values such as state, flexible_time_window and native expression belong under schedule.adapters.aws.

Direct Glue execution flags (--run / --wait) are mutually exclusive with Step Functions execution flags (--run-orchestration / --wait-orchestration) to prevent duplicate ingestion runs.

After runtime execution, audit the canonical evidence tables through Athena:

contractforge-aws audit-evidence \
  --database contractforge_cf_supabase_e2e_ops \
  --athena-output-location s3://contractforge-aws-smoke-449112696824-us-east-1/athena-results/

The audit returns runs by status, runs by quality status, quality rows, quarantine rows, error rows and reconciled Glue DPU-second cost rows using the canonical ContractForge evidence tables. Cost rollups join ctrl_ingestion_cost to ctrl_ingestion_runs on run_id and target_table, so orphan platform cost records are not counted.

Glue JobRun cost signals are reconciled after a terminal run because DPUSeconds is available from the AWS API, not inside the generated Glue job. Use --record-cost-evidence with waited project runs to append ctrl_ingestion_cost rows without duplicating ctrl_ingestion_runs. The adapter records cost under the canonical ContractForge run id (job_name:glue_run_id) and keeps the raw Glue run id in the payload:

contractforge-aws deploy-project project.yaml --run --wait \
  --record-cost-evidence \
  --athena-output-location s3://bucket/athena-results/

For a Glue run that already completed, record cost evidence without rerunning the ingestion job:

contractforge-aws record-glue-cost contracts/bronze/orders.ingestion.yaml \
  --environment environments/aws.environment.yaml \
  --job-name contractforge_lake_bronze_orders \
  --run-id jr_123 \
  --athena-output-location s3://bucket/athena-results/

The command uses the same idempotent cost writer as deploy-project --record-cost-evidence. If the contract is not adjacent to its .environment.yaml, pass --environment explicitly so the command writes to the intended evidence database.

For runtime-sensitive contracts such as hash_diff_upsert, render or execute the benchmark report through the adapter CLI:

contractforge-aws benchmark-report contracts/customers.ingestion.yaml \
  --environment environments/aws.environment.yaml

contractforge-aws benchmark-report contracts/customers.ingestion.yaml \
  --environment environments/aws.environment.yaml \
  --run \
  --athena-output-location s3://bucket/athena-results/

To make smoke-test lifecycle explicit, render a non-destructive cleanup plan:

contractforge-aws cleanup-project examples/real-world/aws-eventhubs-kafka-available-now/project.yaml

cleanup-project does not delete anything. It reads the same project and environment contracts, renders the expected Glue job names, artifact S3 prefix, warehouse S3 prefix, evidence database and any declared external cleanup resources. The Event Hubs streaming example declares the Azure resource group used by the Kafka-compatible Event Hubs namespace, so the cleanup output includes the reviewed az group delete command without executing it.

To report release readiness, use:

contractforge-aws stabilization-report

The report separates two claims. supported_surface_ready: true means the documented v0.1.0 AWS Glue/Iceberg surface has passed the real-project gates. stable_final: false remains true while broader production-certification boundaries are accepted but still open, such as hash-diff concurrency, unvalidated Kafka providers, Lake Formation governance equivalence and historical. Use --strict-final in CI only when those boundaries must block the release.

Supported semantics

ContractForge feature	AWS mapping
`append`	Iceberg append
`overwrite`	Iceberg replace/overwrite
`upsert`	Iceberg `MERGE INTO`
`hash_diff_upsert`	Hash-diff staging plus Iceberg merge, with `merge_keys` for row identity and `hash_keys` or `hash_strategy: all_columns_except` for content comparison; performance warnings remain
Quality abort/warn/quarantine	Glue Data Quality where mappable, plus ContractForge evidence tables
`incremental_files`	Glue DynamicFrame + job bookmarks for eligible S3 formats; no-new-input bookmark runs write `SKIPPED` evidence without executing column-dependent preparation/write logic
JDBC/PostgreSQL	Spark/Glue JDBC with user-supplied drivers and secrets
Logical table refs	`{{ table_ref:layer.table }}` -> Glue Catalog/Iceberg names

The dedicated incremental-files fixture is examples/real-world/aws-incremental-files. It validates portable incremental_files contracts, Glue bookmark configuration, Iceberg rendering and generated Python compilation before the upload-wave runtime test is run in AWS. The real AWS validation covers wave 1, wave 2 and a no-new-input rerun; the last run preserves the target row count and writes SKIPPED evidence with skip_reason=no_new_input.

Available-now streaming jobs write per-micro-batch evidence to ctrl_ingestion_streams and copy aggregate totals into final run evidence as stream_batches, stream_rows_read, stream_rows_written and stream_rows_quarantined. Final run, state and lineage rows-written evidence prefer those ContractForge stream totals because the latest Iceberg snapshot may only represent the last micro-batch. The Azure Event Hubs Kafka path has been validated in AWS Glue with checkpoint progression, no-input reruns, quarantine, target writes and cost evidence. Other Kafka/Event Hubs providers still return a provider-review warning until their connector/runtime semantics are tested.

historical, snapshot soft delete, some governance equivalence and proprietary native passthrough connectors remain review-required until validated for concrete AWS consumer engines.

Real validation

The Supabase JDBC medallion example validates the same project shape on Databricks and AWS: shared connection YAML, split contracts, logical table refs, JDBC partitioned reads, quality quarantine, annotations, operations and control tables. Snowflake parity is covered by the Snowflake stable-surface evidence and the cross-adapter contracts where Snowflake source bindings are supported.

See the repository example:

examples/real-world/supabase-jdbc-medallion/
examples/real-world/usgs-earthquake-rest-medallion/
examples/real-world/s3-file-medallion/
examples/real-world/aws-failure-paths/

Status​

Install​

Environment​

Runtime model​

Review artifacts​

Deploy flow​

Native Project Orchestration​

Supported semantics​

Real validation​