Why We Weight Hard Tasks 45%: The Case for Difficulty-Based Scoring
· Danial A.
When we designed IB-Bench, one of our most debated decisions was how to weight tasks by difficulty. We settled on 20% for easy tasks, 35% for medium, and 45% for hard. Here’s why.
The Problem with Equal Weighting
Most benchmarks treat all tasks equally. Pass 8 out of 10 tasks? You score 80%. This approach has a fundamental flaw: it doesn’t reflect real-world value creation.
In investment banking, an analyst who can:
- Format a pitch deck perfectly (easy task)
- Calculate basic valuation multiples (easy task)
- Build a merger model from scratch (hard task)
…is far more valuable than one who can only do the first two. Yet with equal weighting, the model that aces easy tasks but fails hard ones looks comparable to one with the opposite profile.
How We Categorize Difficulty
Our difficulty tiers are based on time-to-completion for a competent human analyst:
| Difficulty | Time Required | Weight |
|---|---|---|
| Easy | Less than 1 hour | 20% |
| Medium | A few hours | 35% |
| Hard | More than 1 day | 45% |
This mapping directly reflects the economic value of automation. Automating a task that takes 8 hours saves 8x more time than automating a 1-hour task.
The Skill Gap Argument
Hard tasks also have a higher skill barrier. Many analysts can format a model correctly; fewer can build one from a blank spreadsheet. By weighting hard tasks more heavily, we’re measuring a model’s ceiling, not just its floor.
Consider two hypothetical models:
Model A: 100% on easy, 100% on medium, 0% on hard Model B: 60% on easy, 70% on medium, 80% on hard
With equal weighting, Model A scores 67%. Model B scores 70%. But Model B is clearly more capable—it can attempt and partially succeed at tasks Model A can’t touch.
With our weighting, Model A scores 55%. Model B scores 64%. The gap better reflects the true capability difference.
Counterarguments We Considered
Some argue that reliability on easy tasks matters more than capability on hard ones. Fair point. But our target user isn’t someone who needs a calculator—they need an analyst.
Others suggest that partial credit on hard tasks inflates scores. That’s why we use strict thresholds: below 50% gets zero credit, 50-79% gets half credit, and only 80%+ gets full credit.
Looking Forward
We’re continuously refining our methodology. As models improve, we expect to add even harder tasks and potentially adjust weights. The goal remains constant: measure what matters for real investment banking work.
Have thoughts on our scoring approach? Reach out on X or GitHub.