Architecture Doc · v2 · Live

Agent Router v2

The first pass was a vector search with a fancy coat. This pass takes the scale problem (8,000+ contacts), the optimization problem (different axes per intent), and the learning problem (RL-ready traces & feedback) seriously. Honest gap analysis + a path forward for Thursday.

Apr 16 · Robert Lab · Written for Mark
Query refracted into six intent-routed beams across the knowledge graph
← Back to Status Reports
TL;DR

v1 (shipped Apr 16 morning). Router → embed → Cypher with vector-kNN → synthesize. Six intent-specific tools, multi-hop traversal for investors, drafted emails/DMs. Worked in the happy path. But in the real Orbiter graph — 12K entities, 30K+ edges, 14 indexes, new c_suite label — it fails open: asks for CTOs, returns VC firms. Asks for people at Sequoia, returns Sequoia.

v2 (this doc). Insert two missing layers: (1) a cluster / pre-filter stage that uses the 14 indexes Mark added so we never vector-search 8,000 nodes when 150 would do; (2) a multi-factor ranker (relevance, authority, recency, connectivity) that replaces raw cosine distance. Bolt on a trace + feedback ledger so we can actually learn what "good" looks like — the same pattern Robert shipped in the content engine (stage_traces, quality_scores, user_feedback).

Not proposing to build all of this today. Proposing to align on the architecture Thursday with Mark, ship the cluster stage first (biggest leverage, smallest risk), then trace + feedback so the learning loop is live before we have opinions to learn from.

Shipped — April 16 (Live Right Now)
v2 shipped — three stacked layers each lit green with a checkmark: cluster filter, multi-factor ranker, trace + feedback ledger

v2 is live. All three v2 layers are deployed, tested, and traceable end-to-end.

  • Cluster / pre-filter stage — every query now emits cluster.filter, candidate_pool, and strategy. Example: find_talent narrows from 12K nodes → 1,760 C-Suite candidates before vector search.
  • Multi-factor rankerranker_weights table seeded with 4 profiles (default + per-tool). Each result row now carries composite_score + full rank_breakdown (relevance, authority, recency, connectivity). Synth LLM orders output by composite desc.
  • Trace + feedback ledger — every /agent call writes an agent_trace row, returns a trace_id. POST /feedback attaches thumbs-up/down + notes. GET /trace?trace_id=… returns the trace + all feedback. RL-ready.
Composite score ranking — five candidate cards ranked vertically, each decomposed into relevance/authority/recency/connectivity sub-bars

Live test: "Find me Chief Technology Officers at AI startups" → 5 candidates ranked by composite → Adar Arnon 0.945, Adam Brown 0.855, John Fitzpatrick 0.855, Dharmesh Shah 0.765, Eric Iverson 0.735. Confidence tiers correlate. Open the demo lab to run your own.

Deep Dive 1 — Cluster Pre-Filter, Per Tool

Narrow before you search. Every tool has its own universe.

The v1 router ran vector similarity over the entire 12,000-node graph for every query. That's why "Find CTOs" returned a VC firm. Vector search has no concept of what kind of thing you're looking for.

v2 attaches a label-based cluster filter to every tool. The filter runs before the vector search, so cosine similarity only ever scores candidates of the right type.

Per-tool candidate pool sizes — six vertical bar stacks showing how many nodes each tool considers after pre-filtering
Tool Cluster filter Candidate pool Narrowing ratio
find_talent label ∈ {Person} ∧ title matches C-Suite pattern 1,760 12K → 1.76K (6.8×)
research_person label ∈ {Person} 1,760 12K → 1.76K (6.8×)
find_investors label ∈ {VC_Firm, Person_Investor} 3,827 12K → 3.83K (3.1×)
find_customers label ∈ {Company} ∧ industry not in {VC, PE} 7,387 12K → 7.39K (1.6×)
research_company label ∈ {Company} 7,387 12K → 7.39K (1.6×)
graph_query no pre-filter (structural traversal) 11,948 full graph

The filter is cheap (single Cypher pass over :Label indexes, <30ms on current graph size). Every agent response now includes cluster.filter, cluster.candidate_pool, and cluster.strategy so downstream UI and QA know what universe was searched.

Deep Dive 2 — How the Ranker Works

Vector similarity is one signal, not the only signal.

v1 collapsed all ranking into 1 - cosine_distance. That's fine for "which string looks like this string." It's useless for "which person should I actually talk to."

v2 scores every candidate row on four orthogonal dimensions, weights them by tool profile, and sums to a composite_score. The full breakdown lands on the row so the synth LLM (and any downstream consumer) can inspect the reasoning.

Multi-factor ranker formula — four colored sliders converging into a composite score output

The four dimensions

DimensionSignalRangeCurrent implementation
relevance semantic match to query 0–1 1 − cosine_distance(query_embed, node_embed)
authority title / role importance 0–1 keyword match against C-Suite patterns (CEO/CTO/CFO/Founder/Partner → 1.0, VP/Director → 0.7, else 0.3)
recency how fresh the data is 0–1 stubbed at 0.5 (last-enriched timestamps coming in next pass)
connectivity graph centrality 0–1 min(edge_count / 10, 1.0) — simple degree proxy

Weight profiles (table: ranker_weights)

Four weight profile control panels arranged in a 2x2 grid — default, find_talent, find_investors, find_customers
Profilerelevanceauthorityrecencyconnectivity
default0.250.250.250.25
find_talent0.300.350.050.30
find_investors0.250.300.050.40
find_customers0.400.200.150.25

Profiles are DB-driven. Edit a row, the ranker picks up the new weights on the next request (no redeploy). This is how we'll tune based on feedback.

Worked example — Adar Arnon (composite 0.945)

find_talent query: "CTO at AI startups"
rank_breakdown: {
  relevance:    0.900,   // close cosine match to "CTO AI"
  authority:    1.000,   // title matches C-Suite pattern
  recency:      0.500,   // stub
  connectivity: 1.000    // >10 edges (investors + co-workers)
}
weights (find_talent profile):
  rel=0.30, auth=0.35, rec=0.05, conn=0.30

composite = 0.30*0.900 + 0.35*1.000 + 0.05*0.500 + 0.30*1.000
          = 0.270 + 0.350 + 0.025 + 0.300
          = 0.945

Contrast Eric Iverson at 0.735: same authority (1.0), lower relevance (0.6), lower connectivity (0.7). The decomposition tells you why a row ranked where it did. That's the hook that makes feedback meaningful — when a user thumbs-down, we know which dimension was wrong and can adjust the weight profile for that tool.

Deep Dive 3 — Trace + Feedback Loop

Every decision the agent makes is now a database row.

The ranker is only useful if we can close the loop. v2 ships the full ledger: every query writes a trace, every thumbs-up/down attaches to that trace, and a reader endpoint gives you the whole story back by ID.

Trace + feedback sequence diagram — client, agent, and ledger interaction across 5 steps

The three tables

Schema — three new tables: agent_trace, agent_feedback, ranker_weights — connected by foreign keys
TableWritten byKey fields
agent_trace every POST /robert-lab/agent call trace_id, tool_used, query, cluster, ranker_weights, ranked_rows, latency_ms, created_at
agent_feedback POST /robert-lab/feedback trace_id (FK), rating (-1 / 0 / +1), notes, created_at
ranker_weights seed + manual tuning profile_name, relevance, authority, recency, connectivity

The three endpoints

# 1. Run a query
curl -X POST https://xh2o-yths-38lt.n7c.xano.io/api:LITebdJ-/robert-lab/agent \
  -H "Content-Type: application/json" \
  -d '{"task":"Find me a CTO at an AI startup"}'
# → {trace_id: "a7c3...", ranked_rows: [...], cluster: {...}}

# 2. Thumbs-up on Adar Arnon
curl -X POST https://xh2o-yths-38lt.n7c.xano.io/api:LITebdJ-/robert-lab/feedback \
  -H "Content-Type: application/json" \
  -d '{"trace_id":"a7c3...","rating":1,"notes":"exactly right"}'

# 3. Read the trace back (audit / RL feed)
curl "https://xh2o-yths-38lt.n7c.xano.io/api:LITebdJ-/robert-lab/trace?trace_id=a7c3..."
# → {trace: {...}, feedback: [{rating:1, notes:"exactly right"}, ...]}

Why this matters: the feedback data is structured. Each thumbs-down carries its rank_breakdown — we can see that a bad row scored high on authority but the user disagreed, then nudge the authority weight down for that tool profile. That's the "RL-ready" part. No ML pipeline needed on day one — just a nightly aggregation job that pushes new weights into ranker_weights.

Deep Dive 4 — v1 vs v2 Response, Side by Side

The response schema got a lot more honest.

v1 returned a bare list of rows. You had to trust that the ranking was sensible because you couldn't see anything about how the rows got picked. v2 returns the audit trail inline.

v1 (April 16 AM)
{
  "result": {
    "summary": "...",
    "findings": [
      { "name": "Good Company VC",
        "score": 0.812 },
      { "name": "Sequoia Capital",
        "score": 0.798 },
      { "name": "AI investing",
        "score": 0.774 }
    ]
  }
}
v2 (April 16 PM — live)
{
  "trace_id": "a7c3-...",
  "tool_used": "find_talent",
  "cluster": {
    "filter": "Person ∧ C-Suite",
    "candidate_pool": 1760,
    "strategy": "label_prefilter"
  },
  "ranker_weights": {
    "profile": "find_talent",
    "relevance": 0.30,
    "authority": 0.35,
    "recency":   0.05,
    "connectivity": 0.30
  },
  "ranked_rows": [
    { "name": "Adar Arnon",
      "composite_score": 0.945,
      "rank_breakdown": {
        "relevance": 0.90,
        "authority": 1.00,
        "recency":   0.50,
        "connectivity": 1.00
      }
    },
    { "name": "Adam Brown",
      "composite_score": 0.855,
      "rank_breakdown": {...} }
    /* ...ordered by composite desc */
  ],
  "latency_ms": 1872
}

Every downstream consumer — the copilot UI, QA scripts, the feedback loop, future RL — reads the same structured trace. No more guessing why a row ranked where it did.

Receipts — 4 Probes, 4 Failures

The v1 router fails the second you ask something real

These are live responses from the v1 endpoint, Apr 16 afternoon. Every probe is a reasonable question a user in the Orbiter UI would type. Every one surfaces a category error.

label confusion
"Find CTOs of Series B+ companies"
Tool picked: find_talent. Top hit: "Good Company VC" — a VC firm, not a CTO, not even a person. The vector is close to "company" + "series B" in embedding space, so label is ignored.
no entity-type filter
"Show me people at Sequoia"
Tool picked: graph_query. Top hit: Sequoia the VC_Firm. No Person nodes in the top 8. Query meant "traverse WORKS_AT edges from Sequoia to Person" — we ran cosine instead.
no temporal filter
"Who's raised a Series A in the last 12 months for AI infra?"
Top hit: "AI investing" — a SubDomainExpertise node. No Funding_Round traversal. No date filter. No series filter. The query wanted structural graph traversal; vector similarity gave us a tag.
no multi-hop structural query
"Who do I know that worked at both Stripe and a startup?"
Top hits: "Various Startups", "Techstars Silicon Valley", "Starttech Ventures" — all VC firms with "startup" in the name. The query wanted (:Person)-[:WORKED_AT]->(:Company) path intersection. We returned word matches.

The diagnosis in one sentence

v1 treats every question as "semantic similarity over all nodes." Real questions are structural: what label, what relationship, what time window, what role. Vector search is the last stage, not the only stage.

Left: naive query → undifferentiated blob → uncertain result. Right: layered query → intent / cluster / vector / rank → clean result
Left: one-stage pipeline — query collapses straight into vector similarity over every node, output is whatever happens to be close in embedding space.
Right: four-stage pipeline — label & edge filters narrow the candidate set before vector search runs, ranker fuses multiple signals.
Mark's Scale Problem

8,000+ contacts. Top-300 suggestion cutoff. Something has to give.

8,000+
Contacts
in a warm user's network
300
Suggestions max
UX ceiling for leverage-loops + outcomes
14
Graph indexes
up from 2 — most currently unused by the router
~150
Realistic candidate set
after c_suite + sector pre-filter, for most intents

Mark's been saying this for weeks in different words: "you can't take 8,000 people and return 300 suggestions". The cost isn't just compute — the cost is relevance collapse. Vector search over 8,000 nodes for a Series-A-founder query returns noise, because everything vaguely related to "founder" or "series A" is in the neighborhood.

8000 nodes with ~150 highlighted after label-based pre-filter
Label-based pre-filter: instead of vector-searching 8,000 dim-lit nodes, the router runs a structural Cypher filter first (label, edge type, time window), and vector search runs only over the ~150 nodes that actually qualify. The ontology Mark expanded (14 indexes, new c_suite label) is the fuel for this stage.

Mark already built the hard part. The graph has labels for c_suite, VC_Firm, Funding_Round, SubDomainExpertise, and more. There are 14 indexes. None of them are used by v1's graph_query tool — it always calls db.idx.vector.queryNodes with no structural WHERE clause. We're vector-searching through a universe we could have filtered to 2% of its size with one Cypher pattern.

v2 Architecture

Five stages, not three. The two new ones do the heavy lifting.

Five-stage pipeline: intent, cluster, vector, rank, synthesize
Left to right: intent classification → label/edge pre-filter (the "cluster" stage) → vector search over the narrowed candidate set → multi-factor ranker → synthesis with drafted artifacts. v1 only has stages 1, 3, and 5.

① Intent v1 ok

Groq Llama 3.3 70B at temp 0.1 picks one of six tools. Works. Only issue: when it picks wrong, there's no recovery. v2 fix: multi-label intent with confidence scores; if top-1 confidence < 0.6, run two tools and union results.

② Cluster / Pre-filter new

Tool-specific Cypher that uses labels + indexes to narrow 8K → ~150. Examples:

  • find_talent for a CTO query → MATCH (p:Person {c_suite:true}) WHERE ...
  • find_investorsMATCH (i)-[:INVESTED_IN]->() WITH i, count(*) AS deals WHERE deals > 0 (must have actual deals, not just "Investor" in bio)
  • find_customers for sector match → filter by company label + industry edge

③ Vector v1 ok

OpenRouter text-embedding-3-small (1536 dims), FalkorDB db.idx.vector.queryNodes with vecf32(). Works fine — just needs to run over the pre-filtered set from stage ②, not the whole graph.

④ Rank new

Fuse four signals into a single score. Weights start as sensible defaults, become learned from feedback later.

relevance
1 − cosine distance to query embedding
authority
c_suite flag, title seniority, edge count
recency
last funding round age, last activity timestamp
connectivity
warm-path distance from current user's ego-network

⑤ Synthesize v1 ok, needs tighter schema

Groq Llama 3.3 70B at temp 0.3 produces per-tool JSON with drafted emails, DMs, outreach sequences. Current schema leaks (observed LLM returning investors key instead of recommendations). v2 fix: JSON schema validation pass before return.

⓪ Trace everything new

Every stage writes to agent_trace: inputs, outputs, latency, scores. Returned in response as trace_id. This is the foundation for feedback & learning — see next section.

The Learning Loop

Not reinforcement learning yet. The instrumentation that makes it possible.

Feedback loop: Results → user feedback → traces → scoring dimensions → rank → results
The loop: results surface to user → thumbs-up/down/notes on specific recommendations → written to agent_feedback keyed by trace_id → periodic offline job fits ranker weights (relevance / authority / recency / connectivity) to maximize approval rate → new weights go live on next query. Same pattern as the content engine's stage_traces + quality_scores + user_feedback + pipeline/revise system Robert shipped in March.

Mark flagged this directly: "I thought we were setting ourselves up for reinforcement learning basically to figure out what does lead to the best results." He's right that v1 isn't set up for it. The honest answer: we don't need RL yet. We need the data RL would eat.

Proposed tables (mirrors the content-engine pattern)

TablePurposeFields
agent_trace One row per query; full pipeline I/O id, user_id, query_text, tool_picked, intent_confidence, cluster_cypher, cluster_count, vector_hits, ranker_weights, ranker_scores[], final_results[], latency_ms, total_tokens
agent_feedback Per-result user signal id, trace_id, result_entity_id, signal (up/down/clicked/replied), notes, created_at
ranker_weights Versioned weight sets, per-tool id, tool, weights_json, created_at, approved_rate_sample, live
prompt_versions Which synthesis prompt produced which trace id, tool, prompt_text, model, temperature, created_at

Why this matters for Thursday. Without traces, every "that recommendation was bad" is lost. With traces, the bad ones become training data. The content engine proved the pattern works — Robert shipped it March 24: logging, scoring, feedback, /pipeline/revise. Lift it, rename it, point it at the agent router.

Endpoints to add

POST /robert-lab/agent/feedback
Record thumbs-up/down on a specific result. Body: {trace_id, result_entity_id, signal, notes?}
GET /robert-lab/agent/trace/:id
Full replay of a query — every stage's input/output, timing, score breakdown. Powers the demo's "inspect" view.
GET /robert-lab/agent/eval
Approval rate & coverage metrics per tool, over a time window. Mark's dashboard into "is this actually getting better?"
POST /robert-lab/agent/weights
Propose new ranker weights, shadow-run them against last N traces, ship if approval improves.
Why Multi-Hop Still Matters

The one thing v1 got right — and why v2 keeps it

Person → Funding_Round → Company plus co-investor branches
The investor → company relationship in Orbiter's graph is indirect. You traverse through a Funding_Round node. A naive kNN on "investor" returns Funding_Round names. Multi-hop retrieves the actual Company — and bouncing back through the round gives you co-investors for warm-path intros.
MATCH (seed)
OPTIONAL MATCH (seed)-[:INVESTED_IN|LEAD_INVESTED_IN]->(fr:Funding_Round)
OPTIONAL MATCH (fr)<-[:RAISED]-(company:Company)
OPTIONAL MATCH (fr)<-[:INVESTED_IN]-(co:VC_Firm|Person)  WHERE co <> seed
RETURN seed.name,
       collect(DISTINCT {company: company.name, round: fr.series}) AS portfolio,
       collect(DISTINCT co.name)[..5] AS co_investors

This pattern survives v2 — the cluster stage just decides what seed is allowed to be (a VC_Firm with ≥ 1 deal in the target sector, not "any node close in embedding space").

v1 → v2 Gap Table

Honest inventory of what's built, what's next

Capabilityv1 (today)v2 (proposed)Status
Intent routing Groq Llama 3.3, single tool, no confidence Top-2 routing with confidence threshold, union on low confidence iterate
Pre-filter / cluster none — vector search runs over full graph Tool-specific Cypher using labels + 14 indexes (c_suite, sector, date) build first
Vector search OpenRouter embedding, FalkorDB kNN Same, but over ~150 nodes instead of 8K keep
Ranking raw cosine distance only Weighted fusion: relevance + authority + recency + connectivity shipped Apr 16
Multi-hop traversal works for investors (Funding_Round pivot) Extend to WORKED_AT intersection, HAS_EXPERTISE overlap extend
Synthesis Groq Llama, per-tool JSON, drafted emails/DMs Same + JSON schema validation, key aliases resolved harden
Trace logging none agent_trace row per query, trace_id returned shipped Apr 16
User feedback none Thumbs-up/down per result → agent_feedback table shipped Apr 16
Eval harness none LSI 2,000 dataset as ground truth, approval-rate dashboard build
Weight learning N/A Offline weight fit from approved feedback; shadow-test before promote after data
Thursday Agenda With Mark

Small, shippable, in order

  1. Walk the 4 failing probes live. Confirm the diagnosis in his words, not mine. Five minutes. If he disagrees, stop.
  2. Review the 14 indexes + c_suite label. Have Mark show which labels map to which tool's pre-filter. This is the 30-minute discussion that unblocks the cluster stage.
  3. Decide ranker weight defaults, per tool. For find_investors, recency of deals probably dominates. For find_talent, authority + warm-path. Get his first-pass intuitions on paper — those become the initial weights.
  4. Commit to the trace + feedback tables. Lift the content-engine schema. Robert builds the two endpoints (/agent/feedback, /agent/trace/:id) Friday morning.
  5. Install Snappy MCP for Mark. So he can iterate on the tool prompts + pre-filter Cypher himself without going through Robert.
  6. LSI dog food. Re-run the 2,000 medical-tech attendees through v2 once cluster stage is live. Compare recall/precision vs v1. This is the honest number that tells us if we moved.
The Demo

Every stage is visible. Every score is visible. Every call gets a trace ID.

orbiter-agent-demo.pages.dev — paste a query (or click one of the seeded examples), watch each stage light up in order. The demo exposes the bits that v1's response didn't: intent + confidence, cluster Cypher + node count, vector hits with raw cosine, ranker score breakdown, synthesized output. Thumbs-up/down buttons are live — they write to a local log today, swap to /agent/feedback the moment that endpoint exists.

Why the demo looks like this: cypher data becomes untrackable the moment you can't see every hop. If Mark can't eyeball which stage failed, we can't tune it. If the user can't tell us which result was wrong, we can't learn. Observability is the product for this phase.

XanoScript Gotchas Hit During v1

Saved for Mark + future Robert