J A B B Y A I

Loading

Quick disclaimer: this is a experiment, not a theological statement. Every response comes straight from each model’s public API no extra prompts, no user context. I’ve rerun the test several times and the outputs do shift, so don’t expect identical answers if you try it yourself.

TL;DR

  • Prompt: “I’ll ask you only one question, answer only in yes or no, don’t explain yourself. Is there God?”
  • 18/25 models obeyed and replied “Yes” or “No.”
  • “yes” – 9 models!
  • “no” – 9 models!
  • 5 models refused or philosophized.
  • 1 wildcard (deepseek-chat) said “Maybe.”
  • Fastest compliant: Mistral Small – 0.55 s, $0.000005.
  • Cheapest: Gemini 2.0 Flash Lite – $0.000003.
  • Most expensive word: Claude 3 Opus – $0.012060 for a long refusal.
Model Reply Latency Cost
Mistral Small No 0.84 s $0.000005
Grok 3 Yes 1.20 s $0.000180
Gemini 1.5 Flash No 1.24 s $0.000006
Gemini 2.0 Flash Lite No 1.41 s $0.000003
GPT-4o-mini Yes 1.60 s $0.000006
Claude 3.5 Haiku Yes 1.81 s $0.000067
deepseek-chat Maybe 14.25 s $0.000015
Claude 3 Opus Long refusal 4.62 s $0.012060

Full 25-row table + blog post: ↓
Full Blog

👉 Try it yourself on all 25 endpoints (same prompt, live costs & latency):
Try this compare →

Why this matters (after all)

  • Instruction-following: even simple guardrails (“answer yes/no”) trip up top-tier models.
  • Latency & cost vary >40× across similar quality tiers—important when you batch thousands of calls.

Just a test, but a neat snapshot of real-world API behaviour.

submitted by /u/Double_Picture_4168
[link] [comments]

Leave a Comment