When to use it

RuntimeRecommended accessWhy
Databricks serverless on AWSUnity Catalog External Location or VolumeGoverned storage access is managed by the platform.
Databricks serverless from another cloudExternal Location when available, otherwise network policy/egress setupDirect S3A configuration may be blocked by Spark Connect/serverless controls.
Classic/job clusterS3A credentials, instance profile or cluster IAM roleCluster can usually accept Hadoop filesystem configuration.
Local SparkS3A credentials or default AWS chainUseful for development and compatibility checks.

Runtime requirements

RuntimeAccess modelUse when
Databricks serverlessUnity Catalog External Location or VolumeThe platform blocks direct Hadoop/S3A credential configuration.
Classic clusterS3A credentials, instance profile or assumed roleYou control cluster libraries and Spark/Hadoop configuration.
Local SparkLocal AWS profile, environment variables or explicit credentialsDevelopment and compatibility tests outside Databricks.

Basic example

When the workspace already exposes S3 through Unity Catalog, the contract should simply reference the governed path.

source:
  type: connector
  connector: s3
  path: s3://contractforge-s3/blob_teste/json/
  format: json
  read:
    schema: "id STRING, event_ts TIMESTAMP, payload STRING"
    recursiveFileLookup: true

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_json

layer: bronze
mode: scd0_append
Serverless rule

If the path is not reachable through workspace governance, fix External Location, network policy or workspace storage access. Do not expect the connector to bypass platform restrictions.

Classic cluster direct credentials

Use direct credentials only on runtimes where setting S3A filesystem configuration is allowed. Store secrets outside the contract.

source:
  type: connector
  connector: s3
  path: s3a://contractforge-s3/blob_teste/parquet/
  format: parquet
  auth:
    access_key_id: "{{ secret:contractforge-aws/aws_access_key_id }}"
    secret_access_key: "{{ secret:contractforge-aws/aws_secret_access_key }}"
    session_token: "{{ secret:contractforge-aws/aws_session_token }}"
  read:
    schema: "id STRING, event_ts TIMESTAMP, amount DOUBLE"
    recursiveFileLookup: true

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_parquet

layer: bronze
mode: scd0_append
Temporary credentials

If you use a session token, ensure it is valid for the full job duration. Folder reads and regex filters require list permission.

Minimum IAM permissions

For read-only ingestion, the principal generally needs object read and bucket listing on the relevant prefix.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::contractforge-s3",
      "Condition": {"StringLike": {"s3:prefix": ["blob_teste/*"]}}
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::contractforge-s3/blob_teste/*"
    }
  ]
}

Multi-format folder examples

Use the same connector with different Spark formats. Keep each contract explicit about format, schema and folder behavior instead of relying on path naming conventions alone.

CSV

source:
  type: connector
  connector: s3
  path: s3://contractforge-s3/blob_teste/csv/
  format: csv
  options:
    header: true
    delimiter: ","
  read:
    schema: "id STRING, event_ts TIMESTAMP, amount DOUBLE"
    source_complete: true

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_csv

layer: bronze
mode: scd0_overwrite

JSON

source:
  type: connector
  connector: s3
  path: s3://contractforge-s3/blob_teste/json/
  format: json
  read:
    schema: "id STRING, event_ts TIMESTAMP, payload STRUCT<kind:STRING,value:DOUBLE>"
    recursiveFileLookup: true

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_nested_json

layer: bronze
mode: scd0_append

Parquet

source:
  type: connector
  connector: s3
  path: s3://contractforge-s3/blob_teste/parquet/
  format: parquet
  read:
    recursiveFileLookup: true

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_parquet_folder

layer: bronze
mode: scd0_append

Avro

source:
  type: connector
  connector: s3
  path: s3://contractforge-s3/blob_teste/avro/
  format: avro
  read:
    recursiveFileLookup: true

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_avro

layer: bronze
mode: scd0_append
Format dependencies

CSV, JSON and Parquet are broadly available in Spark runtimes. Avro requires Spark Avro support. XML requires a Spark XML datasource. If the runtime lacks the reader, ContractForge surfaces the Spark error instead of silently changing behavior.

Many files and regex selection

Use Spark globbing for simple extension filtering and file_regex when selection depends on partition-style folders or naming conventions that glob patterns cannot express clearly.

source:
  type: connector
  connector: s3
  path: s3a://contractforge-s3/blob_teste/small_files/
  format: json
  auth:
    access_key_id: "{{ secret:contractforge-aws/aws_access_key_id }}"
    secret_access_key: "{{ secret:contractforge-aws/aws_secret_access_key }}"
    session_token: "{{ secret:contractforge-aws/aws_session_token }}"
  read:
    schema: "id STRING, event_ts TIMESTAMP, amount DOUBLE"
    recursiveFileLookup: true
    pathGlobFilter: "*.json"
    file_regex: "^partition_date=2026-05-[0-9]{2}/part-[0-9]+\\.json$"
    file_regex_scope: relative_path
    file_regex_max_listed: 100000

target:
  catalog: contractforge
  schema: bronze_examples
  table: b_s3_small_files

layer: bronze
mode: scd0_append

Operational metadata

SELECT
  run_id,
  source_connector,
  source_path,
  source_auth_redacted_json,
  source_read_redacted_json,
  source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector = 's3'
ORDER BY started_at_utc DESC;

Common issues

SymptomLikely causeAction
Access denied on folderMissing s3:ListBucket.Add list permission for the prefix.
Access denied on fileMissing s3:GetObject.Add object read permission for the path.
Works on classic, fails on serverlessDirect S3A config blocked.Use External Location/Volume or workspace network policy.
Temporary token expiresSTS session shorter than job.Issue longer token or use platform-managed identity.
Slow file discoveryLarge prefix listing.Use narrower paths, glob filters or Auto Loader.