AI Can't Read an Investor Deck
Saumya Chauhan, Ayushi Sinha, Chirag Mahapatra, Abhi Kottamasu April 8, 2026 research medium credibility
View source
Referenced in catalog
AI Can’t Read an Investor Deck
Source: Mercor Blog | Author: Saumya Chauhan, Ayushi Sinha, Chirag Mahapatra, Abhi Kottamasu | Published: 2026-04-08 Category: research | Credibility: medium
Executive Summary
- Frontier AI models (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) achieve only 56–80% accuracy on 25 tasks drawn from authentic financial documents — well below the headline benchmark numbers used in marketing materials.
- The text-to-image performance gap is 16–20 percentage points; all three models degrade systematically when forced to extract values from charts, tables, and slides rather than parsed text.
- Parametric (memorized) knowledge is nearly useless for real financial analysis: models answered only 1 of 25 questions correctly from memory alone, confirming that retrieval and extraction — not latent knowledge — are the bottleneck.
Critical Analysis
Claim: “Visual extraction from real financial documents is a bottleneck for every frontier model, not a quirk of any single one”
- Evidence quality: case-study (vendor-conducted, small sample)
- Assessment: The 16–20 point gap between text-only and image-only performance is directionally credible. It aligns with independent research showing VLMs hallucinate when visual cues are ambiguous and struggle with nested layouts and embedded charts. However, 25 tasks is a very small sample, and the specific financial documents are not publicly released for replication.
- Counter-argument: The tested models (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) are frontier-class but represent a single configuration: zero-shot, image-as-input, no preprocessing pipeline. Real financial analyst workflows would combine OCR preprocessing, structured data extraction, and LLM reasoning in a multi-step pipeline — not a single-shot image prompt. The gap may reflect prompt engineering choices, not an inherent ceiling.
- References:
Claim: “Models answered only 1 of 25 questions correctly using memorized parametric knowledge (Claude and GPT at 4%, Gemini at 0%)”
- Evidence quality: case-study (vendor-conducted)
- Assessment: This finding is credible in a narrow sense: specific numerical values from private investor decks and earnings reports would not appear in training data. It correctly identifies that frontier LLMs are not encyclopedias of current financial data. However, it’s not a novel finding — it’s well-established that LLMs require retrieval (RAG or tool use) for current factual lookups.
- Counter-argument: This framing is slightly misleading. No production financial AI system relies on parametric recall for live document values. The result proves that models need document context to answer document-specific questions — which practitioners already assume. Framing this as a “failure mode” conflates the absence of a design anti-pattern with a model limitation.
- References:
Claim: “Standard benchmarks don’t reflect real investor workflows”
- Evidence quality: opinion (expert reasoning, no controlled comparison to benchmark scores)
- Assessment: This claim is directionally correct and aligns with the broader benchmark saturation pattern documented in the catalog. GPT-5.4 scores ~84% on MMMU (multimodal benchmark) yet achieves 64% image-only on these financial tasks. The gap suggests benchmark conditions differ materially from messy real-world documents. However, the article does not formally compare its task set to any specific existing benchmark score.
- Counter-argument: “Real investor workflows” themselves vary enormously. A quant analyst’s workflow (structured data extraction from SEC XBRL filings) differs from an associate’s workflow (reading slide decks for qualitative signals). The article’s 25 tasks represent a particular slice — visual extraction + arithmetic — and may not generalize across all financial AI use cases.
- References:
Claim: “Two failure modes: visual extraction errors and reasoning failures”
- Evidence quality: case-study (qualitative categorization from 25 tasks)
- Assessment: The taxonomy is useful but coarse. “Visual extraction errors” (anchoring to wrong chart elements) and “reasoning failures” (applying incorrect financial operations) are real and documented. They map to known VLM failure modes: hallucination under visual ambiguity and domain calibration gaps in financial arithmetic. The evidence is qualitative categorization, not a systematic error analysis with confidence intervals.
- Counter-argument: Surge AI’s independent evaluation of frontier models on 200+ finance tasks identified six distinct failure modes — a richer taxonomy than the binary split presented here. Mercor’s two-category framing may compress real variation for narrative clarity. Operational failure categories include multi-step workflow breakdown, missing professional conventions, and framework misalignment — none of which map cleanly to either of Mercor’s two modes.
- References:
Credibility Assessment
- Author background: Saumya Chauhan, Ayushi Sinha, Chirag Mahapatra, and Abhi Kottamasu are listed as Mercor researchers. No independent academic or industry profiles are publicly available for these authors; they appear to be internal research/engineering staff.
- Publication bias: Mercor is an AI hiring and RLHF data platform with direct commercial interest in demonstrating AI’s limitations in professional knowledge work — the more AI appears limited, the more human expert labor (which Mercor supplies) is needed. This is a significant source of bias that the article does not disclose.
- Methodology transparency: 25 tasks, 3 models, no holdout set, no replication data, no inter-rater reliability for error classification. Results are directionally credible but statistically under-powered. The financial documents themselves are not released.
- Verdict: medium — The directional findings (image vs. text gap, parametric knowledge failure) are credible and corroborated by independent research. However, the small sample size, undisclosed vendor bias, and non-reproducible methodology prevent high-credibility classification. Treat as a useful signal, not evidence.