Introducing IB-bench: The First Open-Source Eval for Investment Banking

TL;DR

IB-bench evaluates LLMs on 33 real investment banking tasks - not Q&A, but actual deliverables: building and debugging Excel models, extracting data from SEC filings, synthesising equity research, generating M&A target lists
Best result: Claude Opus 4.5 scores 30.5%, GPT-5.2 scores 9.2% (11 tasks blocked by provider filters)
Bounded tasks with provided inputs succeed; long-horizon tasks requiring independent sourcing or multi-step execution fail
Fully open source: tasks, evaluation harness, scoring logic, and results at github.com/daaa1m/ib-bench

Introduction

IB-bench is an open-source benchmark for evaluating large language models on investment banking tasks. Given a real-world task, an LLM must produce a work output consistent with the standards expected from a junior investment banking analyst. These tasks include building, updating, and fixing financial models in Excel, as well as summarising economic terms in merger agreements.

Motivation

Execution-based benchmarks have proven effective at measuring progress. In software engineering, SWE-bench (2023) demonstrated this by testing whether LLMs can resolve real GitHub issues rather than just write isolated functions. This gave the field a clear, realistic target, and top LLMs now resolve over 70% of verified tasks.

The principle generalises to other domains with concrete outputs and verifiable correctness. Investment banking is a good candidate: analysts produce spreadsheets, outputs can be checked against expected values, and the work is economically valuable but structured enough to grade.

But no such benchmark exists for finance. Existing evaluations like FinBen and FinanceBench test knowledge through Q&A - an LLM can score well on these while being unable to construct a realistic financial model. On the other hand, SpreadsheetBench (NeurIPS 2024) tests generic Excel operations rather than domain-specific financial work. IB-bench addresses this gap.

Task Design

Each task represents a realistic deliverable - work that would be assigned to a junior analyst:

Category	Description	Example
Fix-error	Debug a broken financial model	Identify and correct circular reference errors in an LBO model
Build-model	Construct a valuation model from inputs	Build a DCF given historical financials and assumptions
Extract-data	Parse documents into structured format	Extract key metrics from a 10-K into a summary table
Summarise	Synthesise information from documents	Compare key terms across two equity research reports

Input types include Excel files (.xlsx), PDFs, and web data to reflect realistic data sources.

Task Sourcing

Tasks are sourced from investment banking practitioners. This ensures:

Realistic complexity - Tasks reflect actual workflows, not toy problems
Appropriate difficulty calibration - Time estimates come from domain experience
Valid grading criteria - Expected outputs match professional standards

Where confidential data cannot be used, practitioners generate synthetic inputs (financial models, company scenarios) designed to mirror real-world complexity.

SWE-bench demonstrated that underspecified problems degrade benchmark quality. Hence, each task is designed to be solvable given the provided information, but in some cases an LLM judge is used for subjective grading - this is necessary given the nature of investment banking work.

Scoring

Tasks are categorised by estimated completion time for a human analyst. Scoring is weighted toward harder tasks, to better differentiate LLM capabilities.

Difficulty	Time Estimate	Weight
Easy	< 1 hour	20%
Medium	Few hours	35%
Hard	> 1 day	45%

Overall Score = 0.20 × Easy + 0.35 × Medium + 0.45 × Hard

Example Tasks by Tier

Easy: Fix a broken LBO model. Given a leveraged buyout model that does not balance, diagnose the issue and repair the Excel file so it ties. This tests basic debugging capability - identifying circular reference errors, broken formulas, or incorrect cell references.

Medium: Generate an M&A target list. Given an acquirer profile and target criteria (e.g., European food ingredients companies, €50-500M revenue, PE-backed or family-owned), identify plausible acquisition targets with company details, estimated revenue, ownership structure, and strategic rationale. This requires research, filtering, and structured output.

Hard: Build a 3-statement model from an activist presentation. Given an activist investor’s presentation (e.g., Elliott’s case for Phillips 66), source relevant SEC filings (10-K, quarterly 10-Qs), build a complete three-statement financial model, incorporate the activist’s value-creation drivers, and deliver an invest/pass recommendation with scenario analysis. This represents multiple days of analyst work: research, modelling, and synthesis from a cold start.

Scoring Methodology

Unlike code benchmarks that use pass@k (probability of success across k attempts), IB-bench uses single-attempt scoring with partial credit.

Rationale: Partial correctness has value - an LLM scoring 70% is more useful than one scoring 0%. Financial models admit degrees of correctness that binary pass/fail does not capture.

Each task is scored 0-100 based on numerical accuracy, formula integrity, and completeness. Credit is awarded as follows:

Score	Credit	Interpretation
≥ 90	1.0	Correct, minor issues at most
50-89	0.5	Partially correct, but with clear errors
< 50	0.0	Unusable work product: Incorrect or incomplete

Per-tier score = (total credits earned / total tasks in tier) × 100

Anatomy of a Task

To illustrate how IB-bench works in practice, here is task e-001: find and fix an error in an LBO model.

Prompt (abbreviated):

You are an investment banking analyst tasked with auditing a simple, integrated LBO model for ‘Dave & Buster’s’ that is currently broken. The Balance Sheet does not tie (Total Assets ≠ Total Liabilities + Equity). Your goal is to identify the specific row causing the imbalance and provide a structural fix. There is exactly one source of error.

The prompt also provides diagnostic guidance - similar to how a VP would coach a junior analyst:

To diagnose the error, apply the following systematic accounting checks:

1. Variance Analysis - Identify the first year where the check row is non-zero. Determine if the variance is constant (suggesting a hard-code error) or growing (suggesting a formula linkage issue).

2. The “Half-Number” Sign Check - If the variance is a specific dollar amount, scan for half that amount. A sign-convention flip results in a variance double the original value.

3. Linkage Integrity - Audit summation ranges, confirm every balance sheet delta flows through the cash flow statement exactly once, and verify the ending cash link.

The LLM receives the Excel file and must return a JSON object with the error location, corrected formula, and explanation. The guidance teaches methodology but does not reveal the answer.

Rubric:

Criterion	Points	Requirement
Error location	55	Identify Row 140 (Cash from Investing subtotal)
Corrected formula	45	Include Row 138 (Maintenance Capex) in the sum

Results: Claude scored 100, GPT scored 100. Both LLMs correctly identified that Row 140’s formula omitted Row 138, and provided valid corrections.

This task is representative of easy-tier work: a well-defined problem with a verifiable answer. Medium and hard tasks introduce ambiguity, multi-step reasoning, and less structured data sources.

Early Results

We evaluated Claude Opus 4.5 (Anthropic) and GPT-5.2 (OpenAI) across all 33 public tasks.

Benchmark Results by Model

Claude Opus 4.5 achieves an overall score of 30.5, driven by stronger performance on medium and hard tasks. GPT-5.2 scores 9.2, with 11 tasks blocked by provider content filters before execution began - a significant limitation.

Performance Analysis

Breaking down results by tier reveals where LLMs succeed and fail:

Task-Level Performance

Easy (20 tasks, 20% weight): On tasks estimated to take a human analyst under an hour, both LLMs perform reasonably well - Claude Opus 4.5 at 45.0, GPT-5.2 at 37.5. Formula debugging and document synthesis worked; web-based data extraction did not - LLMs hallucinated values rather than navigating to source filings. Light analysis tasks (target lists, diligence questions) produced mixed results.

Medium (10 tasks, 35% weight): Here the gap widens considerably. Claude Opus 4.5 scores 40.0; GPT-5.2 scores 5.0, though provider refusals explain much of the difference - GPT-5.2 was blocked on 5 of 10 medium tasks, all involving Excel file manipulation. On tasks where both LLMs executed, results were mixed: document synthesis was adequate, but model-building tasks defeated both.

Hard (3 tasks, 45% weight): No LLM passed a hard task outright. Claude Opus 4.5 achieved a best score of 66 on an LBO model (partial credit); GPT-5.2’s best was 20. Tasks at this tier require end-to-end workflows - sourcing filings, constructing models, synthesising recommendations - that remain beyond current capabilities.

Performance by Input Type

The averages obscure important nuance. GPT-5.2’s low Excel score (19.8) reflects provider blocks, not capability - on the five Excel tasks where it actually executed, GPT was competitive with Claude. Conversely, GPT’s PDF lead is fragile: it’s driven largely by one merger agreement task (86 vs 22), and both LLMs failed completely on the longest document (a 200-page 10-K). Multi-input tasks show high variance - combining sources creates unpredictable results even within the same difficulty tier.

Key Takeaways

Provider content filters block Excel tasks. GPT-5.2 was refused on 11 tasks, nearly all involving Excel file manipulation. OpenAI’s filters appear to treat spreadsheet operations as higher-risk regardless of content. This is a key limitation in an otherwise competent LLM and needs to be addressed.

Provided inputs work; independent sourcing does not. When source materials are provided directly, LLMs perform well - synthesising two equity research reports scored 97 and 93. When tasks require navigating to SEC EDGAR or company IR sites, both LLMs failed consistently, hallucinating values rather than locating specific footnote data. This suggests model companies need to pair LLMs with reliable data feeds (Bloomberg, FactSet) rather than expect them to source independently.

Context limits constrain real-world use. Investment banking tasks often involve multiple PDFs alongside Excel models - enough to hit context limits on current frontier LLMs. Much of this content is legal boilerplate or other low-value text that nonetheless consumes tokens. Selective ingestion or retrieval-augmented approaches will be necessary for production workflows.

Single errors are tractable; multiple are harder. Both LLMs scored 100 on finding a single formula error in an LBO model - in superhuman time. Finding multiple errors proved inconsistent: Claude scored 95 on one three-error task but 0 on another, while GPT was blocked on both.

Excel handling needs targeted training. Current LLMs default to recalculating Excel formulas manually in Python rather than using proper spreadsheet engines. RL training on Excel debugging tasks - using tools like LibreOffice for recalculation - would likely improve performance on financial modelling work. See How Do LLMs Work with Excel Files? for more detail.

Long-horizon tasks remain unsolved. Some easy tasks failed while some medium tasks passed - the differentiator was task horizon, not difficulty. Bounded tasks succeeded: clear inputs, single objective, template provided. Long-horizon tasks failed: sourcing data across periods, building models from scratch, iterating through multiple errors. LLMs handle atomic operations well but struggle to chain them reliably.

Computer use may unlock progress. Many failure modes - web navigation, Excel manipulation, context overload - stem from LLMs working through code rather than interfaces. Computer use agents that browse SEC EDGAR directly, operate Excel through the GUI, and scroll through documents selectively could sidestep these limitations. We plan to evaluate computer use agents as they mature.

Limitations

Data contamination risk. Some of the materials used exist in public datasets - financial models, SEC filings, equity research. However, IB-bench tests multi-step reasoning and execution, not knowledge recall. Pre-training data rarely includes end-to-end task completions of this nature; the greater risk would be RL fine-tuning on similar workflows, which is less common. We are still developing a private test set as an additional safeguard.

Grading subjectivity. Tasks with open-ended outputs (target lists, investment recommendations) use LLM-based judging against reference solutions. This introduces variance. We mitigate this with detailed rubrics and multiple evaluation criteria, but some subjectivity remains.

Limited LLM coverage. This release includes two LLMs. We plan to expand coverage as more frontier LLMs become available and as we refine the evaluation harness. If you would like your LLM to be benchmarked please reach out.

Task coverage gaps. IB-bench v1 focuses on Excel-based modelling and document analysis. Slide-building (a core analyst deliverable) is not yet included. We also do not test live data terminal access (Bloomberg, FactSet) or real-time market tasks.

Provider blocking affects comparability. GPT-5.2 was refused on 11 tasks, primarily Excel manipulation. This makes direct LLM comparison difficult for affected task categories. We report blocked tasks as failures but acknowledge this limitation.

Results are preliminary. The benchmark is in active development, and we expect scores to improve as LLMs and scaffolds evolve.

Availability

The benchmark is fully open source:

Task specifications and expected outputs
Evaluation harness
Scoring implementation
Model outputs and grades

We are developing a private test set to address contamination concerns.

Conclusion

IB-bench provides the first execution-based benchmark for investment banking tasks. At 30.5%, the best LLM is far from production-ready - and that score is carried by bounded tasks with provided inputs. Long-horizon work, independent sourcing, and multi-step model building remain unsolved.

For LLMs to meaningfully augment analyst workflows, we’d need to see scores above 70% with consistent performance on medium-tier tasks. That likely requires reliable integrations with data providers, larger context windows, targeted RL training on Excel operations, computer use capabilities, and better handling of multi-step execution.

Results are available on the leaderboard. We welcome task contributions from practitioners - the benchmark improves with broader coverage of real-world scenarios. Code and tasks are available at github.com/daaa1m/ib-bench.