Skip to main content

LLM Cascade

HYRE enriches every endpoint through a tiered multi-model cascade. Each endpoint picks a tier based on what its output is — a cheap data listing, a heavier analysis, or a genuine analytical decision. Within a tier, models are tried in priority order: if one fails (timeout, rate limit, content policy), the next is tried automatically. If all models fail, raw data is returned with insight: null and HTTP 206 status.

Tiers

TierFirst modelTypical latencyUsed for
fastGemini 2.5 Flash-Lite~1–1.5sHigh-volume listings & formatting (new tokens, pools, prices, TVL)
qualityGemini 2.5 Flash~1.5–2.5sHeavier analysis (Smart Money / Nansen)
reasoningOpenServ SERV~3.5–5sAnalytical decisions — verdicts, recommendations, cross-chain math
Every tier shares the same fallback chain, so a failure in the primary model degrades gracefully to the next provider rather than failing the request.

Reasoning Tier — OpenServ SERV

Decision endpoints use OpenServ SERV Reasoning as their primary model. SERV is an OpenAI-compatible gateway that runs a bounded reasoning pass (BRAID) over an underlying model before answering, which improves multi-factor judgments and arithmetic-heavy comparisons.
SettingValue
Endpointhttps://inference-api.openserv.ai/v1/chat/completions
Modelgemini-flash-latest
reasoning_effortlow
Timeout12s
max_tokens2500
Reasoning tokens share the completion budget, so max_tokens is raised to 2500 on this tier (vs 800 on fast/quality) to avoid truncating the answer. The 12s timeout gives the reasoning pass headroom above the 8s used elsewhere.
Endpoints on the reasoning tier:
EndpointDecision
POST /askNatural-language DeFi reasoning
GET /trenches/token/{mint}/verdictsnipe / watch / avoid
GET /debridge/quoteexecute / wait / avoid
GET /debridge/yield-migratemigrate / stay / wait (break-even math)
GET /lp/meteora/pools/recommendPool recommendation
GET /lp/meteora/pools/strategyLP strategy
GET /lp/positions/{id}/rebalancerebalance / hold
If SERV_API_KEY is not configured, these endpoints fall back to the Gemini cascade automatically — no request fails.

Cascade Order

The fallback chain, in order, after the tier’s primary model:
PriorityProvider / ModelTimeoutNotes
1SERV (reasoning tier only)12sBounded reasoning over Gemini Flash.
2Gemini 2.5 Flash-Lite8sPrimary for fast. Cheap + fast JSON.
3Gemini 2.5 Flash8sPrimary for quality. Stronger analysis.
4OpenRouter cascade8–10sCross-provider resilience (DeepSeek → GLM → Claude Haiku).
5Venice AI8–10sLast-resort fallback (DeepSeek V3.2, GLM Flash).
fast starts at Flash-Lite then escalates to Flash; quality and reasoning fall back through Flash first. All tiers end with OpenRouter then Venice.

Chat Agent (Playground)

The Playground chat agent uses a separate model:
ModelProviderUse Case
Gemini 2.5 Flash-LiteGoogle AIConversation flow, tool selection, response summarization

Failure Modes

FailureBehavior
Timeout (>8–12s)Abort and try next model
HTTP 429 (rate limit)Skip and try next model
HTTP 5xx (server error)Skip and try next model
Content policy blockSkip and try next model
Empty responseSkip and try next model
All models failReturn raw data, insight: null, HTTP 206
HTTP 206 (Partial Content) indicates the data was fetched successfully but the LLM enrichment failed. The data field contains the full upstream data. The signal field falls back to neutral with confidence: 0.

LLM Call Configuration

Every LLM call uses these parameters:
{
  "temperature": 0.3,
  "max_tokens": 800,
  "response_format": { "type": "json_object" }
}
  • Low temperature (0.3) — Prioritizes consistent, factual output over creative variation.
  • JSON mode — Forces the model to return valid JSON, parsed into the response envelope.
  • 800 token limit — Keeps insights concise (1–2 sentences) and response times fast.
On the reasoning tier, max_tokens is raised to 2500 and reasoning_effort: "low" is added, since reasoning tokens share the completion budget.

System Prompts

Each endpoint segment has a dedicated system prompt that instructs the LLM:
SegmentSignal VocabularyPrompt Focus
Trenchessnipe, watch, avoidToken risk assessment, sniper detection, dev behavior
Tradersfollow, ignoreWallet profitability, copy-trade worthiness
LPsadd_liquidity, rebalance, hold, removePool APR sustainability, IL risk, range optimization
DeFihigh_yield, medium_yield, low_yield, riskyTVL trends, yield opportunity assessment
deBridgeexecute, wait, avoid (quote) / migrate, stay, wait (yield)Bridge cost efficiency, cross-chain yield comparison
Nansenfollow, ignore, accumulate, distributeSmart money flow interpretation, wallet classification

Response Format

The LLM returns JSON matching this structure:
{
  "insight": "Solana TVL surged 12% this week, driven by...",
  "signal": "high_yield",
  "confidence": 0.87
}
The enrich() function merges this with the raw data. model_used reflects the model that actually answered — e.g. serv/google/gemini-3.5-flash on the reasoning tier, or gemini-2.5-flash-lite on the fast tier:
{
  "data": { ... },
  "insight": "Solana TVL surged 12% this week, driven by...",
  "signal": "high_yield",
  "confidence": 0.87,
  "sources": ["defillama"],
  "model_used": "gemini-2.5-flash-lite",
  "latency_ms": 342,
  "timestamp": "2026-06-10T10:30:00.000Z"
}