Tasks

IB-bench evaluates LLMs on real-world investment banking tasks.

Overview

Most finance benchmarks test financial concepts and CFA-style trivia - useful for measuring knowledge, but not reflective of real investment banking. IB-bench tests the tasks a junior analyst would actually encounter in their day-to-day work.

Task materials are derived from real work created by industry professionals or synthetic data generated under supervision by IB practitioners. 33 tasks across three difficulty levels, weighted to emphasize harder, more complex work.

Examples

Public Set

Explore the dataset by difficulty tier. Drill down into individual tasks for performance analysis, or inspect the source code on GitHub.

Scoring

Each task is scored 0-100 by verified ground truth, LLM judge, human judge, or hybrid. Credit is awarded based on the score:

≥90: Full credit (1 point)

50-89: Half credit (0.5 points)

<50: No credit (0 points)

Difficulty Score

Easy = (credits earned / total credits possible) × 100

Same formula applies to Medium and Hard.

Overall Score

Overall = (0.2 × Easy) + (0.35 × Medium) + (0.45 × Hard)

Limitations

Results reflect isolated task performance, not end-to-end workflows
Some tasks were blocked by API providers leading to instant failure due to over-refusal
v1 of IB-bench does not include slide-building workflows

Get in Touch

Interested in private evaluations or training data? Want your model benchmarked on IB-bench? Reach out on X or GitHub.

Check out the repository for full details on tasks, prompts, and scoring rubrics. If you find IB-bench useful, consider giving us a star!