LLM Cascade
HYRE enriches every endpoint through a tiered multi-model cascade. Each endpoint
picks a tier based on what its output is — a cheap data listing, a heavier
analysis, or a genuine analytical decision. Within a tier, models are tried in
priority order: if one fails (timeout, rate limit, content policy), the next is
tried automatically. If all models fail, raw data is returned with
insight: null and HTTP 206 status.
Tiers
| Tier | First model | Typical latency | Used for |
|---|
fast | Gemini 2.5 Flash-Lite | ~1–1.5s | High-volume listings & formatting (new tokens, pools, prices, TVL) |
quality | Gemini 2.5 Flash | ~1.5–2.5s | Heavier analysis (Smart Money / Nansen) |
reasoning | OpenServ SERV | ~3.5–5s | Analytical decisions — verdicts, recommendations, cross-chain math |
Every tier shares the same fallback chain, so a failure in the primary model
degrades gracefully to the next provider rather than failing the request.
Reasoning Tier — OpenServ SERV
Decision endpoints use OpenServ SERV Reasoning
as their primary model. SERV is an OpenAI-compatible gateway that runs a bounded
reasoning pass (BRAID) over an underlying model before answering, which improves
multi-factor judgments and arithmetic-heavy comparisons.
| Setting | Value |
|---|
| Endpoint | https://inference-api.openserv.ai/v1/chat/completions |
| Model | gemini-flash-latest |
reasoning_effort | low |
| Timeout | 12s |
max_tokens | 2500 |
Reasoning tokens share the completion budget, so max_tokens is raised to 2500
on this tier (vs 800 on fast/quality) to avoid truncating the answer. The
12s timeout gives the reasoning pass headroom above the 8s used elsewhere.
Endpoints on the reasoning tier:
| Endpoint | Decision |
|---|
POST /ask | Natural-language DeFi reasoning |
GET /trenches/token/{mint}/verdict | snipe / watch / avoid |
GET /debridge/quote | execute / wait / avoid |
GET /debridge/yield-migrate | migrate / stay / wait (break-even math) |
GET /lp/meteora/pools/recommend | Pool recommendation |
GET /lp/meteora/pools/strategy | LP strategy |
GET /lp/positions/{id}/rebalance | rebalance / hold |
If SERV_API_KEY is not configured, these endpoints fall back to the Gemini
cascade automatically — no request fails.
Cascade Order
The fallback chain, in order, after the tier’s primary model:
| Priority | Provider / Model | Timeout | Notes |
|---|
| 1 | SERV (reasoning tier only) | 12s | Bounded reasoning over Gemini Flash. |
| 2 | Gemini 2.5 Flash-Lite | 8s | Primary for fast. Cheap + fast JSON. |
| 3 | Gemini 2.5 Flash | 8s | Primary for quality. Stronger analysis. |
| 4 | OpenRouter cascade | 8–10s | Cross-provider resilience (DeepSeek → GLM → Claude Haiku). |
| 5 | Venice AI | 8–10s | Last-resort fallback (DeepSeek V3.2, GLM Flash). |
fast starts at Flash-Lite then escalates to Flash; quality and reasoning
fall back through Flash first. All tiers end with OpenRouter then Venice.
Chat Agent (Playground)
The Playground chat agent uses a separate model:
| Model | Provider | Use Case |
|---|
| Gemini 2.5 Flash-Lite | Google AI | Conversation flow, tool selection, response summarization |
Failure Modes
| Failure | Behavior |
|---|
| Timeout (>8–12s) | Abort and try next model |
| HTTP 429 (rate limit) | Skip and try next model |
| HTTP 5xx (server error) | Skip and try next model |
| Content policy block | Skip and try next model |
| Empty response | Skip and try next model |
| All models fail | Return raw data, insight: null, HTTP 206 |
HTTP 206 (Partial Content) indicates the data was fetched successfully but the
LLM enrichment failed. The data field contains the full upstream data. The
signal field falls back to neutral with confidence: 0.
LLM Call Configuration
Every LLM call uses these parameters:
{
"temperature": 0.3,
"max_tokens": 800,
"response_format": { "type": "json_object" }
}
- Low temperature (0.3) — Prioritizes consistent, factual output over creative variation.
- JSON mode — Forces the model to return valid JSON, parsed into the response envelope.
- 800 token limit — Keeps insights concise (1–2 sentences) and response times fast.
On the reasoning tier, max_tokens is raised to 2500 and
reasoning_effort: "low" is added, since reasoning tokens share the completion
budget.
System Prompts
Each endpoint segment has a dedicated system prompt that instructs the LLM:
| Segment | Signal Vocabulary | Prompt Focus |
|---|
| Trenches | snipe, watch, avoid | Token risk assessment, sniper detection, dev behavior |
| Traders | follow, ignore | Wallet profitability, copy-trade worthiness |
| LPs | add_liquidity, rebalance, hold, remove | Pool APR sustainability, IL risk, range optimization |
| DeFi | high_yield, medium_yield, low_yield, risky | TVL trends, yield opportunity assessment |
| deBridge | execute, wait, avoid (quote) / migrate, stay, wait (yield) | Bridge cost efficiency, cross-chain yield comparison |
| Nansen | follow, ignore, accumulate, distribute | Smart money flow interpretation, wallet classification |
The LLM returns JSON matching this structure:
{
"insight": "Solana TVL surged 12% this week, driven by...",
"signal": "high_yield",
"confidence": 0.87
}
The enrich() function merges this with the raw data. model_used reflects the
model that actually answered — e.g. serv/google/gemini-3.5-flash on the
reasoning tier, or gemini-2.5-flash-lite on the fast tier:
{
"data": { ... },
"insight": "Solana TVL surged 12% this week, driven by...",
"signal": "high_yield",
"confidence": 0.87,
"sources": ["defillama"],
"model_used": "gemini-2.5-flash-lite",
"latency_ms": 342,
"timestamp": "2026-06-10T10:30:00.000Z"
}