IB-bench

Can Large Language Models Replace Investment Banking Analysts?|

33 public tasks just launched!
5 models evaluated, more coming soon!

Leaderboard

5 models evaluated · 33 total tasks

#	Model	Provider
1	claude-opus-4-5-20251101	Anthropic	45.0	40.0	16.7	30.5
2	gpt-5.2-chat	OpenAI	32.5	10.0	0.0	10.0
3	gpt-5.2-2025-12-11	OpenAI	37.5	5.0	0.0	9.2
4	Mistral-Large-3	Mistral AI	17.5	10.0	0.0	7.0
5	gpt-4o	OpenAI	12.5	10.0	0.0	6.0

#1 Anthropic

30.5

claude-opus-4-5-20251101

Results are preliminary: IB-bench is in active development and eval results may change.

Scoring: Overall score is weighted 20% Easy, 35% Medium, 45% Hard.

Difficulty levels: Easy (<1 hour), Medium (few hours), Hard (>1 day) - based on time a human analyst would need.