AI Models & Consensus
Model Selection
We selected 6 AI models based on three criteria:
Complementary strengths — each model excels at different things
Availability via API — all accessible through OpenRouter
Cost efficiency — balancing quality with price
The Six Models
Claude 3.5 Sonnet
Nuanced reasoning
Complex analysis, edge cases
GPT-4o
Broad knowledge
Current events, mainstream topics
Gemini Pro
Fact-checking
Numbers, statistics, verification
Llama 3.1 70B
Unbiased
Controversial topics
DeepSeek Chat
Math/Logic
Probability calculations
Mistral Large
Balanced
Tie-breaker, European perspective
Why These Six?
Claude tends to be the most thoughtful, catching nuances others miss. When a market question has hidden assumptions or edge cases, Claude usually finds them.
GPT-4 has the broadest training data. For markets about mainstream topics (elections, sports, celebrities), GPT-4 usually has the most relevant context.
Gemini is Google's model with access to current search data. Good at fact-checking specific claims and numbers.
Llama is open-source and trained differently. Provides a contrasting perspective, especially useful for politically sensitive topics where commercial models might be cautious.
DeepSeek excels at mathematical reasoning. For markets involving probabilities, statistics, or logical deduction, DeepSeek often outperforms.
Mistral trained in France, offers slightly different cultural perspective. Good as a tie-breaker when other models are split.
The Prompt
All models receive the same standardized prompt. This ensures comparable outputs.
Prompt Structure
Why This Format?
Structured output makes parsing reliable. Free-form responses would require complex NLP to extract the action and confidence.
Explicit factors and risks forces the model to justify its recommendation, which improves quality and enables dissent analysis.
Numerical confidence allows mathematical aggregation. "Very confident" can't be averaged; 85 can.
Response Parsing
AI models don't always follow instructions perfectly. The parser handles variations.
Common Variations
Models might respond with:
ACTION: BUY YES(missing underscore)Action: **BUY_YES**(markdown formatting)Recommendation: BUY_YES(wrong keyword)
The parser normalizes all of these:
Weighted Voting
Not all votes count equally. We weight by model strength and confidence.
The Formula
Example with 3 models:
Claude
BUY_YES
75
1.5
112.5
GPT-4
BUY_YES
80
1.2
96.0
Llama
SKIP
60
1.0
60.0
BUY_YES total: 112.5 + 96.0 = 208.5 SKIP total: 60.0
Winner: BUY_YES with 78% agreement (208.5 / 268.5)
Code Implementation
Dissent Detection
When models disagree, that's valuable information. It often indicates:
Ambiguous situation
Missing information
Edge case the majority missed
Flagging Dissent
How We Use Dissent
Show to user: "5/6 models recommend BUY_YES, but Claude says SKIP because..."
Reduce confidence: High dissent = lower calibrated confidence
Log for analysis: Track which models tend to be contrarian (and whether they're right)
Confidence Calibration
Raw confidence from models is often overconfident. We calibrate based on agreement.
The Problem
AI models tend to report 70-90% confidence even when they shouldn't. A single model saying "85% confident" doesn't mean much.
The Solution
Multiply raw confidence by agreement ratio:
Example
Raw confidence: 80%
Agreement: 4/6 models agree (67%)
Multiplier: 0.8
Calibrated confidence: 64%
This tells the user: "Most models agree, but there's meaningful dissent."
Cost Analysis
Running 6 AI models per request costs money. Here's the breakdown:
Per-Request Costs (approximate)
Claude 3.5
$0.0015
$0.0045
$0.006
GPT-4o
$0.00125
$0.003
$0.004
Gemini Pro
$0.000625
$0.0015
$0.002
Llama 70B
$0.0004
$0.00024
$0.0006
DeepSeek
$0.00007
$0.000084
$0.0002
Mistral Large
$0.001
$0.0018
$0.003
Total for 6 models: ~$0.016
Pricing Margins
Quick (1 AI)
$0.01
$0.003
70%
Standard (3 AI)
$0.05
$0.012
76%
Deep (6 AI)
$0.10
$0.016
84%
Higher tiers have better margins because fixed costs (API overhead, etc.) are spread across more value.
Performance Tracking
Over time, we track which models are most accurate to adjust weights.
What We Track
When markets resolve, we update the record and calculate accuracy per model.
Dynamic Weights
If a model consistently outperforms, its weight increases:
Last updated

