Skip to content

mlx-lm

★ New
trial
AI / ML open-source MIT open-source

At a Glance

Apple Silicon LLM inference, fine-tuning, and quantization package built on MLX, supporting thousands of Hugging Face Hub models with LoRA/QLoRA, 4-bit quantization, and an OpenAI-compatible server for local Mac deployment.

Type
open-source
Pricing
open-source
License
MIT
Adoption fit
small, medium
Top alternatives

What It Does

mlx-lm is the official LLM-focused Python package built on top of Apple MLX. It provides the tooling needed to run, fine-tune, and quantize large language models locally on Apple Silicon Macs. It integrates directly with Hugging Face Hub, meaning models in the standard Transformers format can be downloaded and run with a single command — or converted and re-uploaded as MLX-quantized variants.

The package covers the full local LLM workflow: inference (text generation with streaming), parameter-efficient fine-tuning (LoRA and QLoRA), model quantization (4-bit and 8-bit), and a server that exposes an OpenAI-compatible REST API for local agent integrations. It is the primary runtime for the mlx-community Hugging Face organization, which publishes pre-converted MLX-quantized versions of popular models.

Key Features

  • CLI inference: mlx_lm.generate --model <hf-repo> --prompt "..." downloads and runs any supported Hub model in one command.
  • LoRA and QLoRA fine-tuning: Parameter-efficient fine-tuning with configurable rank, learning rate, and target layers. Supports gradient checkpointing for memory efficiency on models up to ~13B parameters on 16–32GB unified memory.
  • 4-bit quantization: mlx_lm.convert quantizes safetensors models to MXFP4 or Q4 and can upload back to Hugging Face Hub. Pre-quantized community models are available on mlx-community.
  • OpenAI-compatible server: mlx_lm.server exposes a local REST endpoint compatible with the OpenAI Chat Completions API, enabling drop-in use with LangChain, LlamaIndex, and other tools that speak the OpenAI protocol.
  • Streaming generation: Token-level streaming in both CLI and server modes.
  • Model architecture support: Supports the most common open-weight transformer architectures (Llama, Mistral, Gemma, Qwen, Phi, OlMo, Falcon, etc.). New architectures require explicit porting from Transformers.
  • Per-layer numerical verification: The transformers-to-mlx Skill builds on mlx-lm’s architecture to add detailed per-layer comparison against Transformers baselines, exposing RoPE and dtype issues during porting.

Use Cases

  • Local LLM inference for Mac users: Running open-weight models without GPU cloud costs, with hardware acceleration via Apple Silicon’s GPU and Neural Engine.
  • Privacy-sensitive inference: On-device processing of documents or code where sending data to cloud APIs is unacceptable.
  • Local agent backends: Backing local coding agents (Claude Code with local model proxy, custom tooling) via the OpenAI-compatible server.
  • Fine-tuning on proprietary data: LoRA fine-tuning of 7B–13B models directly on a MacBook Pro or Mac Studio without provisioning cloud GPU instances.
  • mlx-community model contribution: Converting and uploading MLX-quantized variants of new open-weight models to the Hugging Face Hub mlx-community organization for community use.

Adoption Level Analysis

Small teams (<20 engineers): Strong fit for individual researchers, ML engineers with Macs, and small AI product teams building Mac-native features. Near-zero setup friction. Free and fully local.

Medium orgs (20–200 engineers): Selective fit. Useful for teams where developers primarily use Macs and want local inference without cloud costs for development/testing. Not a fit for production serving at scale (use vLLM or SGLang on NVIDIA hardware for that).

Enterprise (200+ engineers): Limited fit. Enterprise LLM serving infrastructure almost universally runs on NVIDIA GPU clusters. mlx-lm would serve niche use cases: on-device Mac features, privacy-first edge deployments, or developer tooling that should run fully locally.

Alternatives

AlternativeKey DifferencePrefer when…
OllamaWraps llama.cpp; supports Linux/Windows/Mac; broader hardware support; simpler model managementYou need cross-platform local inference, or Linux/Windows support
LLM.swiftSwift-native llama.cpp wrapper for iOS/macOS app integrationYou’re building a shipping Swift app and need structured output or @Generatable macro support
vLLMMulti-user, high-throughput NVIDIA GPU serving with PagedAttentionYou’re serving LLMs to multiple concurrent users in a cloud environment

Evidence & Sources

Notes & Caveats

  • Apple Silicon only. Same hardware constraint as the base MLX framework — no Linux, no Windows, no NVIDIA GPU support in mlx-lm for production use.
  • Architecture support requires explicit porting. Not every model on Hugging Face Hub has an mlx-lm implementation. New architectures must be ported from Transformers, which requires understanding both frameworks and is error-prone (RoPE bugs, dtype contamination). The transformers-to-mlx Skill was created to systematize this porting process.
  • RoPE bugs are a documented recurring problem. The article that prompted this catalog entry was written specifically because RoPE implementation bugs in ported models produce plausible outputs that silently degrade at long sequences. The community test harness (mlx-lm-tests) exists to catch these.
  • Float32 contamination kills speed. If any layer in the model retains float32 weights when the rest is bfloat16 or quantized, inference throughput drops dramatically with no obvious error. This is a documented class of conversion bugs.
  • Quantization pipeline issues at scale. Community issues report malloc errors with 71B+ models during MXFP4 conversion when the allocation exceeds the 30GB buffer limit. Large model quantization is not robust for all hardware configurations.
  • Not a fit for multi-user serving. mlx-lm has no equivalent of PagedAttention or continuous batching. It is single-user/single-request inference. For multi-user applications, even on Apple hardware, you would need a different architecture.

Related