ETL Pipeline Builder
An agent that designs and executes extract-transform-load pipelines between data sources — using the Legible semantic layer to understand schemas, relationships, and business logic.
The Problem
Building ETL pipelines requires understanding both the source and target schemas, data types, join relationships, business transformation rules, and error handling. Data engineers spend days writing boilerplate extraction scripts, type conversions, and validation logic that could be inferred from a well-defined semantic layer.
How It Works
- The agent reads your MDL (Modeling Definition Language) models from the Legible semantic layer via MCP
- It understands table relationships, calculated fields, and metrics — not just raw schemas
- It generates extraction queries optimized for each source connector (PostgreSQL, BigQuery, Snowflake, etc.)
- It applies transformations based on the semantic layer's business logic (calculated fields, metrics)
- It loads data into the target, handling type coercion, upserts, and incremental loads
Blueprint
agent:
type: claude
description: ETL pipeline design and execution
components:
sandbox:
image: ghcr.io/nvidia/openshell/sandbox-base:latest
resources:
cpus: "4.0"
memory: "16g"
inference:
profiles:
nvidia:
model: meta/llama-3.3-70b-instruct
provider_type: nvidia
policies:
network: policy.yaml
Network policy:
version: "1.0"
rules:
- name: legible-mcp
protocol: tcp
port: 443
destination: "your-legible-instance.example.com"
- name: source-db
protocol: tcp
port: 5432
destination: "source-db.example.com"
- name: target-warehouse
protocol: tcp
port: 443
destination: "*.bigquery.googleapis.com"
Usage
legible agent create etl-builder --blueprint etl-pipeline --profile nvidia
legible agent connect etl-builder
# Inside the sandbox:
# > Build an ETL pipeline to sync the orders model from PostgreSQL to BigQuery
# > Include all calculated fields and apply incremental loading by updated_at
# > Generate a DuckDB intermediate step for data validation
# > Run the pipeline in dry-run mode first and show me the row counts
# > Schedule this pipeline to run every 6 hours
Key Capabilities
| Capability | Description |
|---|---|
| Semantic-aware extraction | Uses MDL models to generate optimized queries, not just SELECT * |
| Cross-source joins | Handles joins across different databases using the relationship definitions in your semantic layer |
| Incremental loading | Supports updated_at watermarks, CDC-style change tracking, and merge/upsert strategies |
| Type coercion | Automatically maps source types to target types (e.g., PostgreSQL JSONB → BigQuery JSON) |
| Validation | Row count comparison, schema drift detection, null/duplicate checks |
| Pipeline as code | Outputs reusable Python scripts, SQL files, or Airflow DAGs |
Example Pipeline
# Generated by ETL Pipeline Builder agent
# Source: PostgreSQL (orders model) → Target: BigQuery
from legible import MCP
mcp = MCP()
# Extract using semantic layer query (includes calculated fields)
orders = mcp.query("""
SELECT order_id, customer_id, total_amount, profit_margin,
order_date, status
FROM orders
WHERE updated_at > '{{ last_watermark }}'
""")
# Load to BigQuery with merge strategy
orders.to_bigquery(
table="analytics.orders",
write_disposition="WRITE_APPEND",
clustering_fields=["order_date", "status"],
)
Supported Source/Target Combinations
Any pair of Legible's 22+ supported connectors can be used. Common patterns:
| Source | Target | Use Case |
|---|---|---|
| PostgreSQL | BigQuery | Operational → Analytics warehouse |
| MySQL | Snowflake | Application data → Data lake |
| SQL Server | PostgreSQL | Legacy migration |
| BigQuery | DuckDB | Cloud → Local development |
| Multiple sources | DuckDB | Federation for ML training data |