Project · Architecture
Every analytics AI today stops at the data layer. The real architecture is six layers and a flywheel — and the agent that owns it isn't a tool upgrade, it's a role transformation.
Every analytics AI today operates at the data layer only. It can query your tables and return numbers. But numbers without meaning are noise. The agent doesn't know why you're asking, what the business cares about, or what to do with the answer. Hex is the clearest example — it connects to your warehouse, generates SQL, returns results. But it has zero understanding of your business context. You end up doing the interpretive work yourself.
The architecture — what the agent is. Six structural layers, each solving a problem the layer above can't.
▸Click any layer to expand the details
Event telemetry that builds and evolves the model
The agent doesn't just consume data — it spots what's missing and proposes new telemetry to make itself smarter. Every unanswerable question becomes a spec.
Agent tracks queries it can't answer and maps them to missing events, properties, or granularity.
Auto-drafts event specs with names, properties, types, and validation rules. Eng reviews and ships.
As the product changes, the agent proposes taxonomy updates — deprecate stale events, add new flows, maintain consistency.
Example: Gap Detected
Trigger
PM asked "Why do APAC users drop off at step 3?"
Missing
No time_spent or error_state property on onboarding_step_completed
Impact
Cannot distinguish rage-quits from confusion from technical failures
Frequency
12 similar queries in 30 days
Proposed Spec: onboarding_step_completed
| Property | Type | Description |
|---|---|---|
| time_spent_seconds | integer | Wall-clock time on step |
| error_count | integer | Validation errors before completion |
| exit_type | enum | completed | abandoned | back_navigated | error |
Taxonomy Monitors (Continuous Hygiene)
Stale Events
Events with zero volume for 30+ days → flag for deprecation
Naming Drift
Detect inconsistent naming (button_click vs btn_clicked) → propose consolidation
Coverage Score
% of user flows with full instrumentation → surface blind spots
New Feature Detection
When eng ships a feature, check if events exist. If not → auto-propose spec.
The Flywheel Effect
Week 1: PM asks why APAC drops off at step 3. Spark can't answer — no granularity on the step.
Week 2: Agent proposes time_spent, error_count, and exit_type properties. Eng reviews and ships the instrumentation.
Week 4: Data flows in. Agent answers: APAC spends 3x longer on step 3 with 2.4 avg errors. 68% exit via back-navigation. Validation messages aren't localized.
Week 5: Agent monitors the new telemetry, detects the pattern, surfaces the insight at HIGH confidence, and proposes an experiment for localized validation copy.
The question that couldn't be answered in Week 1 gets answered proactively in Week 5 — because the agent built the instrumentation to make itself smarter.
Every question the agent can't answer makes the next answer better. Also a land-and-expand engine — the agent becomes the reason teams add more events, which drives Mixpanel usage.
Raw events, schemas, metric formulas
What Mixpanel already has — the structured event model. The advantage over warehouse-connected tools is that the data is pre-organized around product behavior.
Events
signup_completed
User completes registration flow
Properties: signup_method (email | google_sso | apple_sso), platform (web | ios | android)
Volume: ~15K/day
onboarding_step_completed
User completes a step in onboarding wizard
v2 launched Apr 14. Steps 2 and 3 reordered.
spark_query
User asks Spark AI a question
Properties: resolved (boolean), resolution_method (direct_answer | report_generated | failed)
Metrics
| Metric | Formula | Target |
|---|---|---|
| activation_rate | onboarding + first report within 7d / signups | 0.34 |
| weekly_active_users | unique users with 3+ sessions in trailing 7d | — |
Excludes test accounts. 3-session threshold Q1 2026.
This is where every tool stops — including Hex. Necessary but not sufficient.
What the data means — goals, feature maps, operating logic
This is what Hex completely lacks. The agent needs to understand what the business cares about, how features map to outcomes, and what's happened recently that affects interpretation.
OKRs, north star, targets. "Activation from 0.34 to 0.40 by Q3."
Events map to features map to user flows. Semantic meaning.
Edge cases, segment defs, recent changes that affect data.
Growth Team
Focus
Top-of-funnel acquisition and activation
OKR
activation_rate: 0.34 → 0.40 by Q3
Active Experiment
Onboarding v2 (launched 2026-04-14)
Hypothesis: Reordering steps increases activation
User Segments
APAC Cohort
geo in [JP, KR, AU, SG, IN] AND NOT internal
20% lower activation than NA. Not localized.
Recent Changes
2026-04-14
Onboarding v2 — steps 2 and 3 reordered
Affects: activation_rate, time_to_first_insight
Persistent Memory
Beyond static context, the agent accumulates knowledge across sessions. File-based, customer-scoped, and what makes the agent get measurably better the longer it runs in your org.
"APAC step-3 drop ↔ unlocalized validation copy" — once an analyst confirms it, the agent recalls the pattern next time.
"Activated user" means different things at different orgs. The agent learns each customer's working definitions.
Stable segments built and confirmed by humans become reusable in future investigations.
Successful query templates and their refinements are remembered — fewer round-trips on similar questions.
The memory is the switching cost. Leaving means starting the agent over from zero.
When activation drops, the agent knows onboarding changed two weeks ago, APAC behaves differently, the team's OKR target is 0.40, and the analyst confirmed last quarter that step-3 issues for APAC are usually localization. It reasons about why, not just what.
What's happening, why, and does it matter
The orchestrator synthesizes data + context into causal narratives. It doesn't just answer questions — it generates the insight the PM would arrive at after an hour of digging.
The Reasoning Chain
What changed?
Identify metric, magnitude, and timeframe
Who's affected?
Break down by segment, platform, cohort
What happened?
Cross-reference recent_changes for root causes
So what?
Compare to OKR targets — is the trajectory at risk?
Now what?
Recommend action based on severity and ownership
Subagent Orchestration
Complex investigations don't run sequentially. The orchestrator decomposes the question into independent threads, spawns specialized subagents in parallel, and synthesizes their reports.
Inspired by Claude Code's Task tool. Three subagents running in parallel cut investigation time from minutes to seconds — and the synthesizer can spot patterns no single thread would have seen.
Skills (Loadable Capabilities)
The agent doesn't have one fixed reasoning loop — it composes specialized skills based on the task. Skills are versioned, extensible, and can be authored by domain experts inside the customer org.
Sig tests, power calcs, MDE, multi-arm experiment math. Loaded when an experiment crosses sample-size thresholds.
Behavioral cohorting, retention curves, segment construction. Loaded for any "who" question.
Decay-curve fitting, churn driver analysis. Loaded when retention metrics move or are queried.
Briefing format, narrative structure, audience-specific tone. Loaded for any reporting output.
Hypothesis framing, segment selection, guardrail picking. Loaded when drafting new experiments.
Distinguishes signal from noise — seasonality, holiday effects, infra blips, real movements.
This is the output a PM actually acts on. Not "activation is 31%" but the full causal narrative with a recommended next step — produced by an orchestrator running parallel subagents that loaded the right skills for the question.
Deterministic verification of probabilistic outputs
AI outputs are probabilistic. Business decisions need certainty. The harness doesn't make the agent deterministic — it makes the agent auditable. Every insight is verifiable, drillable, and routable to a human validator before it becomes an action.
Every claim links to the underlying data. Click any number in an insight to see the raw events, filters, and query that produced it.
One-click send any insight to an analyst, DS, or eng for human verification. Comes packaged with full context and methodology.
Every insight carries a confidence grade. High = act on it. Medium = review the methodology. Low = agent flags its own uncertainty.
Drilldown Behavior
Each metric, segment, and claim in an insight links to the underlying Mixpanel report. User clicks "31% activation" → sees the actual funnel with filters, date range, and cohort definition applied.
Principle: Show your work. Always.
Validation Routing
Send any insight for human verification with packaged context:
Routing Targets:
Analyst
Channels: Slack, Email | SLA: 4 hours for high-priority
Data Scientist
Channels: Slack, Jira | For: Statistical claims or causal inference
Confidence Model
HIGH (0.85+)
Criteria: Direct metric read, >1K sample, no joins
Green badge. Actionable without review.
MEDIUM (0.60–0.84)
Criteria: Derived metric, cross-segment comparison, or small N
Yellow badge. "Review methodology" link prominent.
LOW (<0.60)
Criteria: Causal claim, sparse data, or novel query pattern
Agent explicitly says "I'm not confident — here's why."
Audit Trail (Full Provenance)
Example: The APAC Activation Drop
Agent: "APAC activation dropped 6pts." Confidence: HIGH (direct read, large sample).
Agent: "Likely caused by Onboarding v2 step reordering." Confidence: MEDIUM (correlation, not proven causal).
PM clicks "28% → 22%" — gets the retention funnel filtered to APAC v2, step drop-off visible. Hits "Send to validate" — analyst receives the claim with full query context. Confirms or corrects; correction feeds back into the agent.
Every probabilistic output has a deterministic verification path. That's how business teams actually use it.
Close the loop — propose actions, human confirms
The agent proposes actions, but a human confirms. Trust is built by being right repeatedly, not by acting autonomously.
Plan Mode
Multi-step actions never go straight to execution. The agent surfaces a structured plan first — steps, dependencies, reversibility — so the human can preview the full sequence, edit it, or kill it before anything ships.
Example plan: APAC fix
1. run_segmented_funnel(cohort=APAC_v2) ↺ reversible
2. draft_experiment_hypothesis(seg=APAC, treatment=localized_copy) ↺ reversible
3. configure_experiment(rollout=5%, guardrails=[activation, error_rate]) ⊘ requires approval
4. slack_post_briefing(channel=#growth, summary=plan) ↺ reversible
Typed Tool Catalog
Every action the agent can take is a typed tool with a defined schema. No prose-shaped actions, no shell escapes — the agent's surface is exactly what's catalogued, and the catalog is auditable.
query_runner(spec: QuerySpec) → QueryResult
segment_builder(definition: SegmentSpec) → SegmentID
experiment_creator(hypothesis, segments, metrics, guardrails) → ExperimentID
experiment_analyzer(exp_id: ExperimentID) → AnalysisReport
slack_message(channel, content, thread_id?) → MessageID
alert_route(insight, target, urgency) → AlertID
report_generator(template, data_refs, period) → ReportURL
instrumentation_proposer(gap: EventGap) → EventSpec
memory_recall(query, scope) → MemoryEntry[]
escalate(insight, owner, severity) → TicketID
Customer-extensible via MCP — orgs add their own tools (Linear, Notion, Jira, internal admin APIs) without touching agent code.
Route to team owner via Slack, email, or in-app. Critical = immediate. Warning = daily digest.
Generate hypothesis, segments, metrics, sample size. PM reviews before launch.
Auto-build supporting Mixpanel report with relevant funnels and retention curves.
When a metric threatens an OKR and no experiment addresses it — structured escalation with context.
Agent proposes. Human approves. human-in-the-loop
Over time, as trust increases, the approval threshold relaxes — auto-execute reversible steps, auto-send daily digests, but still require approval for irreversible actions like experiment launches.
A cycle that runs continuously, pulling from each layer in turn.
Every other approach lives at one layer. The compounding moat is owning all six.
Operates at Layer 1 only. Can query data but can't tell you what it means, why it changed, or what to do. You do the interpretive work yourself.
Static views of Layer 1 data. Don't reason, don't detect, don't propose. PM must notice anomalies, investigate causes, and decide actions manually.
Can access data but no persistent context layer. Every conversation starts from zero. A smart stranger every time.
Self-improving instrumentation + opinionated data model + persistent context + causal reasoning + validation harness + human-confirmed actions. A flywheel that compounds.
Most horizontal AI startups die in the narrow strip between model labs above and platform incumbents below. The survivors share one structural property: they own a vertical deep enough that neither jaw wants to chew it. Mixpanel can be that for product analytics.
Essay Eaten for Breakfast — the pincer in detail →The pattern, elsewhere
Years of legal-domain context, document workflows, and compliance integration that horizontal LLMs can't replicate without the same multi-year build.
Owns the customer-experience workflow end-to-end — outcome pricing, deep CRM integration, vertical-specific evals. Not "AI that does support" — the support agent.
Connectors to every internal system, learned org-specific permissions and document graphs. The accumulated context is the moat, not the model.
Mixpanel's version
The pre-organized event model is the foothold. Four things compound on top of it into a moat the pincer can't close on:
The model labs don't want to become a product analytics company. The hyperscalers don't want to become a metrics-review company. The space sits in the middle — inedible to one or both jaws, exactly where Sierra, Harvey, and Glean sit.
When the flywheel matures, the agent doesn't assist the PM — it runs the operating rhythm and brings humans in at the decisions that matter.
Today's Operating Paradigm
Monday: PM scans 4 dashboards, misses the anomaly in the 5th they skipped.
Metric drops: Noticed 3 days late. A day to pull, a day to slice, a day to hypothesize. Week lost.
Experiment analysis: DS runs sig tests, writes a doc, presents in a meeting. 2-week cycle from data to decision.
New feature ships: Eng forgets instrumentation on 2 of 5 flows. Nobody notices for a month.
Quarterly planning: PM manually assembles trends and writes the narrative. As good as the data they remembered to pull.
Autonomous Agent Paradigm
Monday: Slack briefing before laptop opens — top 3 movements, causal narratives, recommended actions. 5-minute review.
Metric drops: Detected within hours. Auto-segmented, cross-referenced against recent changes, confidence-scored, routed with a causal narrative.
Experiment analysis: Continuous monitoring. Flags significance the moment it's reached, drafts the analysis with a decision. Same-day cycle.
New feature ships: Agent detects code paths without telemetry, drafts event specs, tracks adoption from day one. Zero gaps.
Quarterly planning: Agent maintains a living narrative with citations. PM edits, not assembles.
The Four Autonomous Cycles
The agent runs these four cycles continuously in parallel, feeding insights back into the organization's decision loop.
What the Agent Runs Autonomously
These cycles run continuously in the background. The PM doesn't initiate them — the agent does. Humans enter the loop at decision points, not discovery points.
Continuous Health Monitoring
Agent watches every OKR metric, guardrail, and experiment at 15-minute intervals. Compares against trailing baselines, seasonal patterns, and known change events.
Auto-Actions:
Adaptive Instrumentation
Agent monitors failed queries, low-confidence insights, and new feature deployments. Proposes telemetry changes weekly. Tracks instrumentation coverage score.
Auto-Actions:
Experiment Lifecycle Management
Agent manages experiments from hypothesis through decision. Monitors power, checks for interaction effects, calls significance, drafts the analysis, and recommends ship/kill/extend.
Auto-Actions:
Narrative & Reporting Automation
Agent maintains a living narrative of what's happening, why, and what the team should do. Generates reports automatically — PM edits and approves, never assembles from scratch.
Daily
Slack briefing — top 3 movements
Weekly
Metrics review with trend analysis
Monthly
OKR progress report with forecasts
Quarterly
Full narrative with citations
Auto-Actions:
PM Time Allocation Shift
Today
50%
Synthesizing signal & insights
50%
Decision-making, strategy, execution
3-4 metrics operated deeply
AGENT
With Agentic Platform
10%
70%
Decision-making, strategy, execution
20% Creative
20-30 metrics operated at same depth
What the Agent Absorbs
Monitor metrics continuously
Detect & segment anomalies
Generate causal narratives
Package context for decisions
An agentic system that crawls your product, instruments itself, runs experiments, and optimizes continuously — paid for in tokens instead of headcount.
Agentic Crawler
A persistent agent that navigates your product like a user — every screen, every flow, every edge case. Maintains a living map of the product surface and continuously diffs against the instrumentation layer.
Kills "we didn't track that" permanently.
Autonomous Experimentation
The agent doesn't just detect problems — it designs hypotheses, configures feature flags, sets targeting and metrics, monitors for significance, and calls the result. The build-measure-learn loop runs continuously.
PMs approve strategy. The agent executes.
Token Economics
Every agent action has a token cost. The system optimizes for ROI per token — not just "did we find an insight" but "did that insight generate enough value to justify the compute?" The new unit economics layer for AI-native product development.
The agent learns to spend tokens where they compound.
"Today's analytics AI returns numbers. You still have to pull them, interpret them, and decide what to do. The platform we've sketched does all three — it knows your product, watches it continuously, and turns every question it can't answer into a reason to instrument better. The numbers arrive with meaning attached. The PM stops assembling and starts deciding. That's the shift."