AI Models & Consensus

Model Selection

We selected 6 AI models based on three criteria:

  1. Complementary strengths — each model excels at different things

  2. Availability via API — all accessible through OpenRouter

  3. Cost efficiency — balancing quality with price

The Six Models

Model
Strength
Best For

Claude 3.5 Sonnet

Nuanced reasoning

Complex analysis, edge cases

GPT-4o

Broad knowledge

Current events, mainstream topics

Gemini Pro

Fact-checking

Numbers, statistics, verification

Llama 3.1 70B

Unbiased

Controversial topics

DeepSeek Chat

Math/Logic

Probability calculations

Mistral Large

Balanced

Tie-breaker, European perspective

Why These Six?

Claude tends to be the most thoughtful, catching nuances others miss. When a market question has hidden assumptions or edge cases, Claude usually finds them.

GPT-4 has the broadest training data. For markets about mainstream topics (elections, sports, celebrities), GPT-4 usually has the most relevant context.

Gemini is Google's model with access to current search data. Good at fact-checking specific claims and numbers.

Llama is open-source and trained differently. Provides a contrasting perspective, especially useful for politically sensitive topics where commercial models might be cautious.

DeepSeek excels at mathematical reasoning. For markets involving probabilities, statistics, or logical deduction, DeepSeek often outperforms.

Mistral trained in France, offers slightly different cultural perspective. Good as a tie-breaker when other models are split.


The Prompt

All models receive the same standardized prompt. This ensures comparable outputs.

Prompt Structure

Why This Format?

Structured output makes parsing reliable. Free-form responses would require complex NLP to extract the action and confidence.

Explicit factors and risks forces the model to justify its recommendation, which improves quality and enables dissent analysis.

Numerical confidence allows mathematical aggregation. "Very confident" can't be averaged; 85 can.


Response Parsing

AI models don't always follow instructions perfectly. The parser handles variations.

Common Variations

Models might respond with:

  • ACTION: BUY YES (missing underscore)

  • Action: **BUY_YES** (markdown formatting)

  • Recommendation: BUY_YES (wrong keyword)

The parser normalizes all of these:


Weighted Voting

Not all votes count equally. We weight by model strength and confidence.

The Formula

Example with 3 models:

Model
Action
Confidence
Weight
Score

Claude

BUY_YES

75

1.5

112.5

GPT-4

BUY_YES

80

1.2

96.0

Llama

SKIP

60

1.0

60.0

BUY_YES total: 112.5 + 96.0 = 208.5 SKIP total: 60.0

Winner: BUY_YES with 78% agreement (208.5 / 268.5)

Code Implementation


Dissent Detection

When models disagree, that's valuable information. It often indicates:

  • Ambiguous situation

  • Missing information

  • Edge case the majority missed

Flagging Dissent

How We Use Dissent

  1. Show to user: "5/6 models recommend BUY_YES, but Claude says SKIP because..."

  2. Reduce confidence: High dissent = lower calibrated confidence

  3. Log for analysis: Track which models tend to be contrarian (and whether they're right)


Confidence Calibration

Raw confidence from models is often overconfident. We calibrate based on agreement.

The Problem

AI models tend to report 70-90% confidence even when they shouldn't. A single model saying "85% confident" doesn't mean much.

The Solution

Multiply raw confidence by agreement ratio:

Example

  • Raw confidence: 80%

  • Agreement: 4/6 models agree (67%)

  • Multiplier: 0.8

  • Calibrated confidence: 64%

This tells the user: "Most models agree, but there's meaningful dissent."


Cost Analysis

Running 6 AI models per request costs money. Here's the breakdown:

Per-Request Costs (approximate)

Model
Input (500 tokens)
Output (300 tokens)
Total

Claude 3.5

$0.0015

$0.0045

$0.006

GPT-4o

$0.00125

$0.003

$0.004

Gemini Pro

$0.000625

$0.0015

$0.002

Llama 70B

$0.0004

$0.00024

$0.0006

DeepSeek

$0.00007

$0.000084

$0.0002

Mistral Large

$0.001

$0.0018

$0.003

Total for 6 models: ~$0.016

Pricing Margins

Tier
Price
Cost
Margin

Quick (1 AI)

$0.01

$0.003

70%

Standard (3 AI)

$0.05

$0.012

76%

Deep (6 AI)

$0.10

$0.016

84%

Higher tiers have better margins because fixed costs (API overhead, etc.) are spread across more value.


Performance Tracking

Over time, we track which models are most accurate to adjust weights.

What We Track

When markets resolve, we update the record and calculate accuracy per model.

Dynamic Weights

If a model consistently outperforms, its weight increases:

Last updated