Methodology

IB-Bench evaluates LLMs on real-world investment banking tasks.

Overview

Most finance benchmarks test financial concepts and CFA-style trivia—useful for measuring knowledge, but not reflective of real investment banking work. IB-Bench tests the tasks a junior analyst would actually encounter such as parsing 10-Ks, fixing and building LBO models, as well as synthesizing equity research.

Task materials are derived from real work created by industry professionals or synthetic data generated under supervision by IB practitioners. 36 tasks across three difficulty levels, weighted to emphasize harder, more complex work.

Tasks by Difficulty

Explore the dataset by difficulty tier. Drill down into individual tasks for performance analysis, or inspect the source code on GitHub.

Scoring

Each task is scored 0-100 by verified ground truth, LLM judge, or hybrid. Credit is awarded based on the score:

≥80: Full credit (1 point)

50-79: Half credit (0.5 points)

<50: No credit (0 points)

Difficulty Score

Easy = (credits earned / total credits possible) × 100

Same formula applies to Medium and Hard.

Overall Score

Overall = (0.2 × Easy) + (0.35 × Medium) + (0.45 × Hard)

Hard tasks are weighted more heavily because they better represent the complex, high-value work that defines investment banking.

Limitations

Results reflect isolated task performance, not end-to-end workflows
Some tasks were blocked by API providers leading to instant failure due to over-refusal
Performance may vary with different prompting strategies and configurations

Get in Touch

Interested in private evaluations or training data? Want your model benchmarked on IB-Bench? Reach out on X or GitHub.

Check out the repository for full details on tasks, prompts, and scoring rubrics. If you find IB-Bench useful, consider giving us a star!