Skip to content

Model Evaluation & Benchmarks

28 entries

Benchmarks, evaluation frameworks, and safety testing tools for measuring and comparing LLM capabilities.

AI Safety Evaluation (Pre-Deployment)

assess
pattern

Pre-deployment testing pattern where frontier AI models are assessed by independent third parties for dangerous autonomo...

Anthropic

adopt
vendor

AI safety company behind the Claude model family — including Claude Opus, Sonnet, Haiku, and the restricted Claude Mytho...

Apollo Research

assess
vendor

AI safety research organization focused on detecting and evaluating deceptive capabilities in frontier AI models.

Augment Code

trial
vendor

AI coding agent platform for professional software teams, built around a proprietary Context Engine that semantically in...

Benchmark Saturation

assess
pattern

Recurring dynamic where AI models approach maximum scores on benchmarks, rendering them unable to distinguish between sy...

DeepEval

trial
open-source

Open-source Apache-2.0 LLM evaluation framework by Confident AI with 50+ metrics spanning RAG, agents, multi-turn conver...

Devin

assess
vendor

Cognition's commercial autonomous AI software engineer with full shell and browser access, SaaS and VPC deployment optio...

Epoch AI

assess
vendor

AI research institute tracking compute trends, model capabilities, and publishing data-driven analyses of AI progress.

Google Agents CLI

assess
open-source

Google's open-source CLI wrapping the Agent Development Kit (ADK) to automate the full AI agent development lifecycle —...

HCAST (Human-Calibrated Autonomy Software Tasks)

assess
open-source

METR's primary benchmark measuring frontier AI autonomous software task completion, calibrated against 140 human experts...

Humanity's Last Exam (HLE)

assess
open-source

A 2,500-question expert-level benchmark curated by ~1,000 specialists to measure AI capabilities where frontier models s...

Inspect AI

trial
open-source

An open-source LLM evaluation framework by the UK AI Safety Institute with 100+ pre-built evals for safety, coding, reas...

Langfuse

trial
open-source

Open-source LLM engineering platform (MIT-licensed, 21k+ GitHub stars) covering observability traces, evaluation, prompt...

LangSmith

assess
vendor

Observability and evaluation platform for LLM applications, providing tracing, prompt testing, and experiment comparison...

LiveCodeBench

trial
open-source

Contamination-resistant LLM coding benchmark that continuously collects new competitive programming problems from LeetCo...

LlamaIndex

trial
open-source

Open-source MIT-licensed data framework for building RAG and document agent applications on top of LLMs, with 38k+ GitHu...

METR (Model Evaluation & Threat Research)

assess
vendor

Nonprofit research org that evaluates frontier AI models for dangerous autonomous capabilities before deployment.

METR Task Standard

assess
open-source

A portable specification for defining AI agent evaluation tasks with standardized environment setup, instructions, and s...

MMLU (Massive Multitask Language Understanding)

hold
open-source

A benchmark of 15,908 multiple-choice questions across 57 academic subjects for evaluating LLM knowledge, now effectivel...

OpenHands

trial
open-source

An open-source platform for autonomous AI coding agents with Docker-sandboxed execution, multi-model support, and a Pyth...

RAGAS

trial
open-source

Open-source Apache-2.0 evaluation framework for RAG pipelines and LLM applications by ExplodingGradients (YC W24), provi...

RE-Bench

assess
vendor

AI benchmark suite from METR for evaluating autonomous AI agent capabilities on real-world research engineering tasks.

Redwood Research

assess
vendor

Nonprofit AI safety research organization focused on AI control, alignment, and pre-deployment evaluation, home to the t...

Runloop

assess
vendor

Persistent sandboxed dev environments for AI agents with git-style state management and built-in SWE-bench integration.

SWE-bench

assess
open-source

A benchmark evaluating whether AI agents can resolve real-world GitHub issues by generating code patches that pass repos...

TruLens

assess
open-source

Open-source MIT-licensed LLM evaluation and tracing framework by TruEra, now maintained by Snowflake, combining OpenTele...

VectorDBBench

assess
open-source

Open-source benchmarking tool for vector databases, covering 30+ databases with CLI and visual interface; maintained by...

Vivaria

hold
open-source

METR's open-source platform for running AI agent evaluations and elicitation research, now deprecated in favor of Inspec...

Related Reviews

Related Topics