Skip to main content

ETL Pipeline Builder

An agent that designs and executes extract-transform-load pipelines between data sources — using the Legible semantic layer to understand schemas, relationships, and business logic.

The Problem

Building ETL pipelines requires understanding both the source and target schemas, data types, join relationships, business transformation rules, and error handling. Data engineers spend days writing boilerplate extraction scripts, type conversions, and validation logic that could be inferred from a well-defined semantic layer.

How It Works

  1. The agent reads your MDL (Modeling Definition Language) models from the Legible semantic layer via MCP
  2. It understands table relationships, calculated fields, and metrics — not just raw schemas
  3. It generates extraction queries optimized for each source connector (PostgreSQL, BigQuery, Snowflake, etc.)
  4. It applies transformations based on the semantic layer's business logic (calculated fields, metrics)
  5. It loads data into the target, handling type coercion, upserts, and incremental loads

Blueprint

agent:
type: claude
description: ETL pipeline design and execution

components:
sandbox:
image: ghcr.io/nvidia/openshell/sandbox-base:latest
resources:
cpus: "4.0"
memory: "16g"
inference:
profiles:
nvidia:
model: meta/llama-3.3-70b-instruct
provider_type: nvidia

policies:
network: policy.yaml

Network policy:

version: "1.0"
rules:
- name: legible-mcp
protocol: tcp
port: 443
destination: "your-legible-instance.example.com"
- name: source-db
protocol: tcp
port: 5432
destination: "source-db.example.com"
- name: target-warehouse
protocol: tcp
port: 443
destination: "*.bigquery.googleapis.com"

Usage

legible agent create etl-builder --blueprint etl-pipeline --profile nvidia
legible agent connect etl-builder

# Inside the sandbox:
# > Build an ETL pipeline to sync the orders model from PostgreSQL to BigQuery
# > Include all calculated fields and apply incremental loading by updated_at
# > Generate a DuckDB intermediate step for data validation
# > Run the pipeline in dry-run mode first and show me the row counts
# > Schedule this pipeline to run every 6 hours

Key Capabilities

CapabilityDescription
Semantic-aware extractionUses MDL models to generate optimized queries, not just SELECT *
Cross-source joinsHandles joins across different databases using the relationship definitions in your semantic layer
Incremental loadingSupports updated_at watermarks, CDC-style change tracking, and merge/upsert strategies
Type coercionAutomatically maps source types to target types (e.g., PostgreSQL JSONB → BigQuery JSON)
ValidationRow count comparison, schema drift detection, null/duplicate checks
Pipeline as codeOutputs reusable Python scripts, SQL files, or Airflow DAGs

Example Pipeline

# Generated by ETL Pipeline Builder agent
# Source: PostgreSQL (orders model) → Target: BigQuery

from legible import MCP

mcp = MCP()

# Extract using semantic layer query (includes calculated fields)
orders = mcp.query("""
SELECT order_id, customer_id, total_amount, profit_margin,
order_date, status
FROM orders
WHERE updated_at > '{{ last_watermark }}'
""")

# Load to BigQuery with merge strategy
orders.to_bigquery(
table="analytics.orders",
write_disposition="WRITE_APPEND",
clustering_fields=["order_date", "status"],
)

Supported Source/Target Combinations

Any pair of Legible's 22+ supported connectors can be used. Common patterns:

SourceTargetUse Case
PostgreSQLBigQueryOperational → Analytics warehouse
MySQLSnowflakeApplication data → Data lake
SQL ServerPostgreSQLLegacy migration
BigQueryDuckDBCloud → Local development
Multiple sourcesDuckDBFederation for ML training data