Model Evaluation &amp; Benchmarks

AI safety company behind the Claude model family — including Claude Opus, Sonnet, Haiku, and the restricted Claude Mytho...

Apollo Research

AI safety research organization focused on detecting and evaluating deceptive capabilities in frontier AI models.

Augment Code

AI coding agent platform for professional software teams, built around a proprietary Context Engine that semantically in...

Benchmark Saturation

pattern

Recurring dynamic where AI models approach maximum scores on benchmarks, rendering them unable to distinguish between sy...

DeepEval

Open-source Apache-2.0 LLM evaluation framework by Confident AI with 50+ metrics spanning RAG, agents, multi-turn conver...

Devin

Cognition's commercial autonomous AI software engineer with full shell and browser access, SaaS and VPC deployment optio...

Epoch AI

AI research institute tracking compute trends, model capabilities, and publishing data-driven analyses of AI progress.

Google Agents CLI

Google's open-source CLI wrapping the Agent Development Kit (ADK) to automate the full AI agent development lifecycle —...

HCAST (Human-Calibrated Autonomy Software Tasks)

METR's primary benchmark measuring frontier AI autonomous software task completion, calibrated against 140 human experts...

Humanity's Last Exam (HLE)

A 2,500-question expert-level benchmark curated by ~1,000 specialists to measure AI capabilities where frontier models s...

Inspect AI

An open-source LLM evaluation framework by the UK AI Safety Institute with 100+ pre-built evals for safety, coding, reas...

Langfuse

Open-source LLM engineering platform (MIT-licensed, 21k+ GitHub stars) covering observability traces, evaluation, prompt...

LangSmith

Observability and evaluation platform for LLM applications, providing tracing, prompt testing, and experiment comparison...

LiveCodeBench

Contamination-resistant LLM coding benchmark that continuously collects new competitive programming problems from LeetCo...

LlamaIndex

Open-source MIT-licensed data framework for building RAG and document agent applications on top of LLMs, with 38k+ GitHu...

METR (Model Evaluation & Threat Research)

Nonprofit research org that evaluates frontier AI models for dangerous autonomous capabilities before deployment.

METR Task Standard

A portable specification for defining AI agent evaluation tasks with standardized environment setup, instructions, and s...

MMLU (Massive Multitask Language Understanding)

hold

A benchmark of 15,908 multiple-choice questions across 57 academic subjects for evaluating LLM knowledge, now effectivel...

OpenHands

An open-source platform for autonomous AI coding agents with Docker-sandboxed execution, multi-model support, and a Pyth...

RAGAS

Open-source Apache-2.0 evaluation framework for RAG pipelines and LLM applications by ExplodingGradients (YC W24), provi...

RE-Bench

AI benchmark suite from METR for evaluating autonomous AI agent capabilities on real-world research engineering tasks.

Redwood Research

Nonprofit AI safety research organization focused on AI control, alignment, and pre-deployment evaluation, home to the t...

Runloop

Persistent sandboxed dev environments for AI agents with git-style state management and built-in SWE-bench integration.

SWE-bench

A benchmark evaluating whether AI agents can resolve real-world GitHub issues by generating code patches that pass repos...

TruLens

Open-source MIT-licensed LLM evaluation and tracing framework by TruEra, now maintained by Snowflake, combining OpenTele...

VectorDBBench

Open-source benchmarking tool for vector databases, covering 30+ databases with CLI and visual interface; maintained by...

Vivaria

hold