Introducing IB-Bench: A Real-World Benchmark for AI in Investment Banking
· Danial A.
Today we’re publicly launching IB-Bench, a benchmark designed to answer a simple question: can large language models actually do investment banking work?
The Problem with Existing Benchmarks
Current AI benchmarks fall into two categories, neither of which helps finance practitioners:
Generic reasoning benchmarks like MMLU or ARC test broad knowledge. A model that scores well might know what EBITDA stands for, but that doesn’t mean it can calculate it from a 10-K.
Finance-specific benchmarks like FinQA or ConvFinQA focus on reading comprehension and basic math. They test whether a model can extract a number from a table—not whether it can build the table.
Neither approach tells you if an LLM can replace or augment an actual analyst.
What Makes IB-Bench Different
We built IB-Bench from the ground up with one principle: test real work.
Every task in IB-Bench comes from actual analyst workflows. We worked with industry professionals to identify the tasks that consume analyst time:
- Parsing and extracting data from SEC filings
- Building and debugging financial models
- Creating data rooms and due diligence checklists
- Synthesizing information across multiple documents
Then we created benchmark tasks that mirror these workflows exactly.
Our Initial Task Set
The current benchmark includes 36 tasks across three difficulty levels:
- 12 Easy tasks: Basic calculations, simple extractions, formatting
- 12 Medium tasks: Multi-step analysis, model modifications, document synthesis
- 12 Hard tasks: Full model builds, complex parsing, multi-document reasoning
Each task includes real (or realistic synthetic) materials: actual 10-Ks, professional financial models, and industry-standard templates.
How Scoring Works
We use a three-tier credit system:
- Full credit (1 point): Score 80% or higher
- Partial credit (0.5 points): Score between 50-79%
- No credit (0 points): Score below 50%
Overall scores weight difficulty: 20% easy, 35% medium, 45% hard. This ensures that models are rewarded for tackling complex work, not just nailing the basics.
What We’ve Learned So Far
Our initial benchmark runs reveal that frontier models are better at investment banking tasks than most people expect—and worse than the hype suggests.
The good news: modern LLMs can reliably handle many routine tasks. Data extraction, basic calculations, and simple model modifications are largely solved.
The challenge: complex, multi-step tasks remain difficult. Building a complete LBO model, synthesizing a 100-page CIM, or debugging a circular reference in a merger model—these tasks still require human oversight.
Open Source and Transparent
IB-Bench is fully open source. You can view all tasks, prompts, and scoring rubrics on GitHub. We believe transparency is essential for benchmark credibility.
We’re also actively seeking contributions. If you have tasks that would make good additions, or if you want to see your model benchmarked, reach out.
Get Started
Visit the leaderboard to see current model rankings. Check out the methodology page to understand how we evaluate. And follow us on X for updates as we add new tasks and benchmark new models.
The future of AI in finance is being written now. IB-Bench aims to measure it honestly.