Read the source
URL, path, table, SQL, API request, pagination, credentials, driver options and runtime-specific access setup.
Connectors
Connectors should retrieve data, expose source metadata and fail clearly when runtime capabilities are missing. Business shaping stays in transform.
contractforge connectors list
contractforge connectors show rest_api jdbc s3 azure_blob
contractforge connectors doctor rest_api jdbc s3 azure_blob
connectors doctor does not open network connections. It documents expected runtime capabilities so reviewers can catch missing drivers, cloud access or platform constraints early.
Use this matrix to choose the first connector to test. The detailed page for each connector contains contract examples, runtime constraints and operational notes.
| Source type | Connectors | Best fit | Common follow-up |
|---|---|---|---|
| Catalog data | table, delta_table, view, sql | Data already registered in Spark or Unity Catalog. | Use write modes and quality gates directly. |
| Files | csv, json, jsonl, ndjson, parquet, delta, orc, avro, xml, text | Static folders, backfills and finite file sets reachable by Spark. | Declare schema, use folder filters and apply transform.shape when nested. |
| HTTP files | http_file, http_csv, http_json, http_text | Bounded public/authenticated files when Spark cannot read HTTPS directly. | Protect with byte/time limits; move high-volume feeds to storage. |
| Object storage | object_storage, blob, s3, azure_blob | S3, ADLS, Blob or governed storage paths. | Choose External Location/Volume for serverless or direct credentials on classic clusters. |
| Relational databases | jdbc, postgres, mysql, sqlserver, oracle | Database extraction with partitioning, predicates and watermarks. | Install drivers, validate network, deduplicate before MERGE. |
| APIs | rest_api | Bounded API extraction, pagination and raw payload capture. | Use transform.shape.parse_json for nested documents. |
| File streams | autoloader | Databricks cloudFiles available_now with checkpoints. | Inspect ctrl_ingestion_streams and child run metrics. |
| External systems | snowflake, bigquery | Sources supported by installed Spark connector packages. | Keep provider credentials outside contracts and validate installed libraries. |
A connector is responsible for retrieving bytes or records and exposing safe source metadata. It should not contain business transformations. Use transform.shape, column_mapping, quality rules and write modes for the contract-specific behavior.
URL, path, table, SQL, API request, pagination, credentials, driver options and runtime-specific access setup.
Parse JSON strings, flatten structs, explode arrays, zip parallel arrays, project fields, cast columns and deduplicate.
Required keys, accepted values, expressions, uniqueness and row-level quarantine decisions.
Append, overwrite, SCD1, hash diff, SCD2 or snapshot soft delete with metrics and control-table evidence.
Serverless is not a weaker runtime, but it is more opinionated. Prefer governed access paths and platform-managed connectivity before adding connector-level credentials.
| Connector | Serverless preference | Classic/local preference | What to validate first |
|---|---|---|---|
| Files on workspace/Volumes | Use Volumes, Workspace files or External Locations with explicit schema. | Use any Spark-readable path and installed format libraries. | Path grants, schema, recursive lookup and format dependencies. |
| S3 | Use Unity Catalog External Locations or Volumes. Direct S3A credentials can be blocked by Spark Connect controls. | Use S3A credentials, instance profile, assumed role or the AWS default chain. | List/read grants, path prefix, session token expiry and Hadoop AWS libraries. |
| Azure Blob / ADLS | Use External Locations or Volumes. Direct SAS/Hadoop configuration is not the durable serverless path. | Use SAS, service principal, managed identity or ABFS configuration when allowed. | Storage credential validation, list permission and endpoint reachability. |
| HTTP file | Use for bounded files when Spark cannot read HTTPS directly. | Same pattern; classic networking is usually easier to customize. | Driver egress, file size, timeout, retries and response format. |
| REST API | Driver-side HTTP. Keep page, byte, record and timeout limits explicit. | Same contract model; cluster networking may allow more custom routing. | DNS, proxy, allowlists, API rate limits and payload shape. |
| JDBC / RDS IAM | Requires driver availability, database route and, for IAM, AWS credentials visible to the Python driver. | Install driver and configure network, security groups, peering or PrivateLink. | TCP route, driver class, SSL options, IAM policy and source query bounds. |
| Snowflake | Requires connector/JDBC availability and Snowflake network policy that allows the Databricks egress path. | Use connector/JDBC packages and validate credentials from the cluster. | Service user, role, warehouse, PAT/JWT support and network policy. |
| BigQuery | Prefer Unity Catalog Lakehouse Federation when available; use the direct Spark connector only when dependencies and credentials are supported. | Use the Spark BigQuery connector with service-account/runtime credentials. | Federated connection, materialization dataset, service-account permissions and cost controls. |
| Auto Loader | Use supported cloudFiles paths through Volumes or External Locations. | Use cluster/cloud IAM and checkpoint/schema locations. | Checkpoint path, schema location, source access and child-run stream metrics. |
For object storage, start with Unity Catalog External Locations or Volumes. They make permissions auditable and avoid runtime-specific Hadoop credential mutation.
For BigQuery and other catalog-integrated sources, a federated table can be consumed through table or sql, which avoids connector packaging and credential-file handling in jobs.
Direct JDBC, Snowflake and BigQuery connectors remain useful when dependencies, credentials and network routes are explicitly supported by the runtime.
If a platform blocks network, driver or filesystem configuration, fix the platform capability. Do not hide the issue behind runtime-specific behavior.
Most connectors use the same top-level shape. Individual connector pages document connector-specific options, but these fields are the common vocabulary.
| Field | Use |
|---|---|
type | Use connector for registry-based connectors. |
connector | Connector id, such as csv, s3, azure_blob, jdbc, rest_api or autoloader. |
name | Logical source name for observability. |
provider | Cloud or source provider, when useful for operations. |
format | Data format for file-like connectors. |
path | File, folder, object-storage or URL path. |
account_url, container | Object-storage location parts for Azure Blob/ADLS-style sources. |
table, query | Table or SQL query for catalog/JDBC sources. |
options | Spark/DataSource connector options and runtime-specific configuration. |
read | Read semantics such as schema, multiline, recursive lookup, partitioning, bounds or fetch size. |
request | HTTP method, URL, headers and request parameters. |
auth | Authentication metadata. Secrets should use {{ secret:scope/key }} references. |
pagination | REST pagination strategy. |
response | REST response extraction and record path handling. |
incremental | Connector-level incremental settings. |
limits | Timeout, retry and payload safety limits supported by the connector. |
Every connector contributes source metadata to the run payload and ctrl_ingestion_runs. Secrets are redacted before persistence.
SELECT
source_connector,
source_format,
source_path,
source_options_json,
source_read_json,
source_auth_json,
source_metrics_json
FROM main.ops.ctrl_ingestion_runs
ORDER BY started_at_utc DESC
LIMIT 20;