Exploratory Analysis Agent
An agent that profiles your datasets, detects anomalies, surfaces insights, and answers ad-hoc analytical questions through natural language — combining the power of an LLM with direct access to your semantic layer.
The Problem
Exploratory data analysis is iterative: run a query, look at results, form a hypothesis, run another query. Analysts spend hours writing SQL, formatting results, and context-switching between tools. The semantic layer already encodes business logic, relationships, and metrics — but leveraging that knowledge still requires manual query writing.
How It Works
- The agent connects to your Legible semantic layer via MCP, gaining access to all models, metrics, calculated fields, and relationships
- You ask questions in natural language; the agent generates SQL using the semantic layer's context (not just raw schema)
- It profiles data distributions, detects outliers, computes correlations, and identifies trends
- It produces formatted tables, statistical summaries, and narrative explanations
- It remembers context within the session, so follow-up questions build on previous results
Blueprint
agent:
type: claude
description: Interactive exploratory data analysis
components:
sandbox:
image: ghcr.io/nvidia/openshell/sandbox-base:latest
resources:
cpus: "2.0"
memory: "8g"
inference:
profiles:
anthropic:
model: claude-sonnet-4-20250514
provider_type: anthropic
nvidia:
model: meta/llama-3.3-70b-instruct
provider_type: nvidia
policies:
network: policy.yaml
Network policy:
version: "1.0"
rules:
- name: legible-mcp
protocol: tcp
port: 443
destination: "your-legible-instance.example.com"
Usage
legible agent create analyst --blueprint exploratory-analysis --profile anthropic
legible agent connect analyst
# Inside the sandbox — conversational analysis:
# > What are my top 10 customers by lifetime revenue?
# > Break that down by quarter — is there a seasonal trend?
# > Which product categories have declining revenue this year vs last?
# > Show me the correlation between order frequency and average order value
# > Profile the customers table — distributions, null rates, outliers
# > Find anomalies in the last 7 days of transaction data
Key Capabilities
| Capability | Description |
|---|---|
| Semantic-aware queries | Uses MDL models, so "revenue" means the defined metric, not a raw column guess |
| Data profiling | Column distributions, cardinality, null rates, min/max/mean/median/p95 |
| Anomaly detection | Z-score outliers, sudden volume changes, unexpected null spikes |
| Trend analysis | Time-series decomposition, year-over-year comparisons, growth rates |
| Correlation | Pearson/Spearman correlations between numeric fields |
| Cohort analysis | Group customers by signup month, analyze retention and LTV by cohort |
| Conversational context | Follow-up questions reference previous results ("break that down by region") |
Example Session
You: What does our revenue look like this quarter?
Agent: Q1 2026 revenue is $4.2M, up 12% from Q4 2025 ($3.75M).
Month breakdown:
January: $1.28M (+8% MoM)
February: $1.35M (+5% MoM)
March: $1.57M (+16% MoM)
The March spike is driven by enterprise deals — the top 5
customers contributed 34% of March revenue vs. 22% average.
You: Which enterprise customers drove that spike?
Agent: Top 5 customers by March 2026 revenue:
1. Acme Corp — $182K (3 large orders on Mar 15)
2. GlobalTech Inc — $147K (annual renewal + expansion)
3. DataFlow Systems — $112K (new customer, first order Mar 3)
4. Meridian Group — $98K (consistent monthly buyer)
5. CloudFirst — $87K (2x their typical monthly spend)
DataFlow Systems is a new account — worth flagging for the
sales team as a high-value early signal.
You: Profile the orders table for anomalies in the last 7 days.
Agent: ORDERS TABLE PROFILE — Last 7 days (Mar 26 – Apr 1)
═══════════════════════════════════════════════════
Rows: 3,847 (vs. 3,200 avg weekly — +20%)
Nulls: shipping_address: 4.2% (normal: <1%) ⚠
Outliers: 12 orders with total > $10K (normal: 2-3 per week) ⚠
Status: "failed" at 4.8% (normal: 2%) ⚠
Recommendations:
1. Investigate the shipping_address null spike — possible
checkout flow bug introduced this week
2. Review the 12 high-value orders for fraud signals
3. Check payment gateway logs for the elevated failure rate
Supported Databases
Works with all 22+ Legible connectors. The agent generates dialect-appropriate SQL through the semantic layer and adapts analysis to the available data.
Why Not Just Use the Chat UI?
The Legible web UI already supports natural language queries. The sandbox agent adds:
- Persistent session context — multi-turn conversations that build on previous results
- Programmatic access — the agent can write scripts, save results to files, chain complex analyses
- Custom tooling — install Python packages (pandas, scipy, matplotlib) inside the sandbox for statistical analysis
- Automation — schedule recurring analyses or trigger them from CI/CD pipelines