HTTP file
Use http_file for bounded files served through HTTP(S), especially when the Spark runtime cannot read arbitrary HTTPS paths directly. The connector downloads the file on the driver, materializes it for Spark and preserves source metadata in control tables.
When to use it
Direct file download
Use this for bounded CSV, JSON, JSONL, NDJSON or text files exposed over HTTP(S).
Spark cannot read HTTPS
The driver downloads the file and then hands a materialized local/staged file to Spark.
Controlled payloads
Use explicit timeout, retry and size limits. Move large recurring feeds to object storage.
No business parsing
Use transform.shape or downstream contracts for nested payload normalization.
Runtime requirements
| Requirement | Details |
|---|---|
| Driver egress | The Databricks driver or local Spark process must reach the HTTP(S) endpoint. |
| Payload limits | Configure timeout/retry and avoid unbounded downloads. |
| Format support | Spark must support the declared file format after the file is materialized. |
| Authentication | Use secret-backed headers or URLs for authenticated files. |
Basic example
- YAML
- Python
source:
type: connector
connector: http_file
path: https://example.org/data/orders.csv
format: csv
options:
header: true
inferSchema: false
read:
schema: "order_id STRING, order_ts TIMESTAMP, amount DOUBLE"
source_complete: true
limits:
timeout_seconds: 60
retry_attempts: 3
retry_backoff_seconds: 2
max_bytes: 52428800
max_records: 100000
target:
catalog: contractforge
schema: bronze_examples
table: b_http_orders
layer: bronze
mode: overwrite
from contractforge import ingest
result = ingest(
source={
"type": "connector",
"connector": "http_file",
"path": "https://example.org/data/orders.csv",
"format": "csv",
"options": {"header": True, "inferSchema": False},
"read": {
"schema": "order_id STRING, order_ts TIMESTAMP, amount DOUBLE",
"source_complete": True,
},
"limits": {
"timeout_seconds": 60,
"retry_attempts": 3,
"retry_backoff_seconds": 2,
"max_bytes": 52_428_800,
"max_records": 100_000,
},
},
catalog="contractforge",
target_schema="bronze_examples",
target_table="b_http_orders",
layer="bronze",
mode="overwrite",
)
Supported formats
Supported formats are csv, json, jsonl, ndjson and text. Aliases http_csv, http_json and http_text are available when you want the connector name to imply the file format.
Operational guidance
- Use this connector for small and medium files, not unbounded API extraction.
- Set explicit byte, timeout and retry limits because the driver performs the download.
- For recurring high-volume feeds, land files in object storage and use Auto Loader or a file connector.
Public dataset pattern
This pattern is useful for complete public extracts that are easier to download as one bounded file than to access through a Spark filesystem reader.
- YAML
- Python
source:
type: connector
connector: http_file
path: https://example.org/public/covid.csv
format: csv
options:
header: true
inferSchema: false
read:
schema: "submission_date DATE, state STRING, tot_cases BIGINT"
source_complete: true
limits:
timeout_seconds: 60
retry_attempts: 3
retry_backoff_seconds: 2
max_bytes: 52428800
max_records: 100000
target:
catalog: contractforge
schema: bronze_health
table: b_public_covid
layer: bronze
mode: overwrite
from contractforge import ingest
result = ingest(
source={
"type": "connector",
"connector": "http_file",
"path": "https://example.org/public/covid.csv",
"format": "csv",
"options": {"header": True, "inferSchema": False},
"read": {
"schema": "submission_date DATE, state STRING, tot_cases BIGINT",
"source_complete": True,
},
"limits": {"timeout_seconds": 60},
},
catalog="contractforge",
target_schema="bronze_health",
target_table="b_public_covid",
layer="bronze",
mode="overwrite",
)
Use bounded files only
For high-volume recurring feeds, land files in object storage and use Auto Loader or a file connector.
Operational validation
SELECT
run_id,
status,
source_connector,
source_format,
source_path,
rows_read,
rows_written,
source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector IN ('http_file', 'http_csv', 'http_json', 'http_text')
ORDER BY started_at_utc DESC;
Common issues
| Symptom | Likely cause | Action |
|---|---|---|
| DNS or connection failure | The driver cannot reach the host. | Validate egress, DNS, proxy and endpoint allowlists from the runtime. |
| Timeout | File is too large or endpoint is slow. | Increase limits cautiously or move the feed to object storage. |
| Unexpected schema | File format changed or schema inference was used. | Declare explicit schema and reader options. |
| Memory pressure | Payload is too large for driver-side download. | Use object storage ingestion or Auto Loader for large feeds. |
How this connector fits the contract
Keep extraction concerns in source, structural normalization in transform, validation in quality_rules and target semantics in mode. This separation keeps examples portable and prevents connector-specific runtime fixes from becoming hidden business logic.