IB-bench
Can Large Language Models Replace Investment Banking Analysts?|
33 public tasks just launched!
5 models evaluated, more coming soon!
Leaderboard
5 models evaluated · 33 total tasks
| # | Model | Provider | ||||
|---|---|---|---|---|---|---|
| 1 | claude-opus-4-5-20251101 | Anthropic | 45.0 | 40.0 | 16.7 | 30.5 |
| 2 | gpt-5.2-chat | OpenAI | 32.5 | 10.0 | 0.0 | 10.0 |
| 3 | gpt-5.2-2025-12-11 | OpenAI | 37.5 | 5.0 | 0.0 | 9.2 |
| 4 | Mistral-Large-3 | Mistral AI | 17.5 | 10.0 | 0.0 | 7.0 |
| 5 | gpt-4o | OpenAI | 12.5 | 10.0 | 0.0 | 6.0 |
#1 Anthropic
30.5claude-opus-4-5-20251101
Easy
45.0 20/20Medium
40.0 10/10Hard
16.7 3/3#2 OpenAI
10.0gpt-5.2-chat
Easy
32.5 20/20Medium
10.0 10/10Hard
0.0 3/3#3 OpenAI
9.2gpt-5.2-2025-12-11
Easy
37.5 20/20Medium
5.0 10/10Hard
0.0 3/3#4 Mistral AI
7.0Mistral-Large-3
Easy
17.5 20/20Medium
10.0 10/10Hard
0.0 3/3#5 OpenAI
6.0gpt-4o
Easy
12.5 20/20Medium
10.0 10/10Hard
0.0 3/3Results are preliminary: IB-bench is in active development and eval results may change.
Scoring: Overall score is weighted 20% Easy, 35% Medium, 45% Hard.
Difficulty levels: Easy (<1 hour), Medium (few hours), Hard (>1 day) - based on time a human analyst would need.