IB-bench
Can Large Language Models Replace Investment Banking Analysts?|
Check out how Claude Opus 4.5 and ChatGPT 5.2 performed on 18 out of 36 public tasks below!
Leaderboard
6 models evaluated · 36 total tasks
| # | Model | Provider | ||||
|---|---|---|---|---|---|---|
| 1 | claude-opus-4-5-20251101 | Anthropic | 85.0 (17/20) | 70.0 (7/10) | 66.7 (4/6) | 72.5 |
| 2 | gpt-5.2-2025-12-11 | OpenAI | 80.0 (16/20) | 65.0 (6/10) | 58.3 (3/6) | 68.3 |
| 3 | gemini-2.5-pro | 75.0 (15/20) | 55.0 (5/10) | 50.0 (3/6) | 61.2 | |
| 4 | claude-sonnet-4-20251101 | Anthropic | 70.0 (14/20) | 50.0 (5/10) | 41.7 (2/6) | 55.8 |
| 5 | gpt-4o-2024-11-20 | OpenAI | 65.0 (13/20) | 40.0 (4/10) | 33.3 (2/6) | 48.5 |
| 6 | gemini-2.0-flash | 60.0 (12/20) | 35.0 (3/10) | 25.0 (1/6) | 42.1 |
#1 Anthropic
72.5claude-opus-4-5-20251101
Easy
85.0 (17/20)Medium
70.0 (7/10)Hard
66.7 (4/6)#2 OpenAI
68.3gpt-5.2-2025-12-11
Easy
80.0 (16/20)Medium
65.0 (6/10)Hard
58.3 (3/6)#3 Google
61.2gemini-2.5-pro
Easy
75.0 (15/20)Medium
55.0 (5/10)Hard
50.0 (3/6)#4 Anthropic
55.8claude-sonnet-4-20251101
Easy
70.0 (14/20)Medium
50.0 (5/10)Hard
41.7 (2/6)#5 OpenAI
48.5gpt-4o-2024-11-20
Easy
65.0 (13/20)Medium
40.0 (4/10)Hard
33.3 (2/6)#6 Google
42.1gemini-2.0-flash
Easy
60.0 (12/20)Medium
35.0 (3/10)Hard
25.0 (1/6)Scoring: Overall score is weighted 20% Easy, 35% Medium, 45% Hard.
Difficulty levels: Easy (<1 hour), Medium (few hours), Hard (>1 day) — based on time a human analyst would need.