Methodology
IB-Bench evaluates LLMs on real-world investment banking tasks.
Overview
Most finance benchmarks test financial concepts and CFA-style trivia—useful for measuring knowledge, but not reflective of real investment banking work. IB-Bench tests the tasks a junior analyst would actually encounter such as parsing 10-Ks, fixing and building LBO models, as well as synthesizing equity research.
Task materials are derived from real work created by industry professionals or synthetic data generated under supervision by IB practitioners. 36 tasks across three difficulty levels, weighted to emphasize harder, more complex work.
Tasks by Difficulty
Explore the dataset by difficulty tier. Drill down into individual tasks for performance analysis, or inspect the source code on GitHub.
Tasks an analyst would require less than 1 hour to complete.
Tasks an analyst would require a few hours, but less than a day, to complete.
No tasks available for this difficulty level.
Tasks an analyst would require more than 1 day to complete.
No tasks available for this difficulty level.
Scoring
Each task is scored 0-100 by verified ground truth, LLM judge, or hybrid. Credit is awarded based on the score:
Difficulty Score
Easy = (credits earned / total credits possible) × 100
Same formula applies to Medium and Hard.
Overall Score
Overall = (0.2 × Easy) + (0.35 × Medium) + (0.45 × Hard)
Hard tasks are weighted more heavily because they better represent the complex, high-value work that defines investment banking.
Limitations
- Results reflect isolated task performance, not end-to-end workflows
- Some tasks were blocked by API providers leading to instant failure due to over-refusal
- Performance may vary with different prompting strategies and configurations
Get in Touch
Interested in private evaluations or training data? Want your model benchmarked on IB-Bench? Reach out on X or GitHub.
Check out the repository for full details on tasks, prompts, and scoring rubrics. If you find IB-Bench useful, consider giving us a star!