ETL Pipeline Builder

An agent that designs and executes extract-transform-load pipelines between data sources — using the Legible semantic layer to understand schemas, relationships, and business logic.

The Problem

Building ETL pipelines requires understanding both the source and target schemas, data types, join relationships, business transformation rules, and error handling. Data engineers spend days writing boilerplate extraction scripts, type conversions, and validation logic that could be inferred from a well-defined semantic layer.

How It Works

The agent reads your MDL (Modeling Definition Language) models from the Legible semantic layer via MCP
It understands table relationships, calculated fields, and metrics — not just raw schemas
It generates extraction queries optimized for each source connector (PostgreSQL, BigQuery, Snowflake, etc.)
It applies transformations based on the semantic layer's business logic (calculated fields, metrics)
It loads data into the target, handling type coercion, upserts, and incremental loads

Blueprint

agent:
  type: claude
  description: ETL pipeline design and execution

components:
  sandbox:
    image: ghcr.io/nvidia/openshell/sandbox-base:latest
    resources:
      cpus: "4.0"
      memory: "16g"
  inference:
    profiles:
      nvidia:
        model: meta/llama-3.3-70b-instruct
        provider_type: nvidia

policies:
  network: policy.yaml

Network policy:

version: "1.0"
rules:
  - name: legible-mcp
    protocol: tcp
    port: 443
    destination: "your-legible-instance.example.com"
  - name: source-db
    protocol: tcp
    port: 5432
    destination: "source-db.example.com"
  - name: target-warehouse
    protocol: tcp
    port: 443
    destination: "*.bigquery.googleapis.com"

Usage

legible agent create etl-builder --blueprint etl-pipeline --profile nvidia
legible agent connect etl-builder

# Inside the sandbox:
# > Build an ETL pipeline to sync the orders model from PostgreSQL to BigQuery
# > Include all calculated fields and apply incremental loading by updated_at
# > Generate a DuckDB intermediate step for data validation
# > Run the pipeline in dry-run mode first and show me the row counts
# > Schedule this pipeline to run every 6 hours

Key Capabilities

Capability	Description
Semantic-aware extraction	Uses MDL models to generate optimized queries, not just `SELECT *`
Cross-source joins	Handles joins across different databases using the relationship definitions in your semantic layer
Incremental loading	Supports `updated_at` watermarks, CDC-style change tracking, and merge/upsert strategies
Type coercion	Automatically maps source types to target types (e.g., PostgreSQL `JSONB` → BigQuery `JSON`)
Validation	Row count comparison, schema drift detection, null/duplicate checks
Pipeline as code	Outputs reusable Python scripts, SQL files, or Airflow DAGs

Example Pipeline

# Generated by ETL Pipeline Builder agent
# Source: PostgreSQL (orders model) → Target: BigQuery

from legible import MCP

mcp = MCP()

# Extract using semantic layer query (includes calculated fields)
orders = mcp.query("""
    SELECT order_id, customer_id, total_amount, profit_margin,
           order_date, status
    FROM orders
    WHERE updated_at > '{{ last_watermark }}'
""")

# Load to BigQuery with merge strategy
orders.to_bigquery(
    table="analytics.orders",
    write_disposition="WRITE_APPEND",
    clustering_fields=["order_date", "status"],
)

Supported Source/Target Combinations

Any pair of Legible's 22+ supported connectors can be used. Common patterns:

Source	Target	Use Case
PostgreSQL	BigQuery	Operational → Analytics warehouse
MySQL	Snowflake	Application data → Data lake
SQL Server	PostgreSQL	Legacy migration
BigQuery	DuckDB	Cloud → Local development
Multiple sources	DuckDB	Federation for ML training data

The Problem​

How It Works​

Blueprint​

Usage​

Key Capabilities​

Example Pipeline​

Supported Source/Target Combinations​