Cohere Command A+ Goes Apache 2.0: A 218B MoE Built for Sovereign Enterprise AI

7 days ago
3 min read

Cohere released Command A+ on May 20, 2026 under a full Apache 2.0 license. It is the first Command model to ship as truly open source, and it lands as a Mixture-of-Experts (MoE) architecture with 218 billion total parameters and 25 billion active per generation step. The pitch is sovereign, enterprise-grade AI you can actually self-host: text and vision inputs, 48 languages, native source citations, and a near-lossless 4-bit quantization that fits the model on a single NVIDIA Blackwell B200.

Below is a builder's tour of what is actually shipping, with the numbers Cohere reports and the model identifiers you will use day one.

Architecture and Hardware

Command A+ is a sparse MoE with vision and tool-use built in. Inputs accepted: text, image, tool calls. Outputs: text, reasoning, tool use. Context length is 128K input tokens and up to 64K output tokens. Cohere ships three weight formats on Hugging Face on day one: BF16, FP8, and a W4A4 (4-bit weights, 4-bit activations) variant that Cohere describes as lossless.

Minimum hardware to run W4A4: one NVIDIA Blackwell B200, or two NVIDIA H100s. That single-GPU B200 footprint is the headline efficiency number, and it is what makes Command A+ realistic to deploy inside a regulated environment.

Performance Numbers Cohere Reports

Versus the previous Command A Reasoning model, Command A+ delivers a 63% increase in output tokens per second and a 17% reduction in time-to-first-token. The W4A4 quantization adds an additional 47% speed boost. The headline measurement on W4A4: 375 tokens per second sustained, with a time-to-first-token of 113 milliseconds.

Cohere also highlights tokenization efficiency wins for non-English languages: 20% better for Arabic, 16% better for Korean, and 18% better for Japanese.

Native Citations Without Prompt Tricks

This is the feature most likely to matter in production. When Command A+ uses external tools or retrieved context, it emits explicit grounding spans using special tags inside the generated output. Every factual claim is anchored to the specific source document or database row it came from, so a downstream renderer can attach a citation marker without any string-matching heuristics. For RAG and agentic workflows where auditability is non-negotiable, native citations are a meaningful upgrade over post-hoc attribution.

Quickstart: Hugging Face Transformers

The exact model identifier for the lossless 4-bit variant on Hugging Face is CohereLabs/command-a-plus-05-2026-w4a4. The model registers as an image-text-to-text task. Install transformers and load the model:

pip install transformers
# For vLLM W4A4 support (optional)
uv pip install vllm>=0.21.0
uv pip install cohere_melody>=0.9.0

Then run a basic chat completion using the official chat template:

from transformers import AutoTokenizer, AutoModelForImageTextToText

model_id = "CohereLabs/command-a-plus-05-2026-w4a4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id)

messages = [{"role": "user", "content": "What has keys but can't open locks?"}]
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)

gen_tokens = model.generate(
    input_ids,
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
)

print(tokenizer.decode(gen_tokens[0]))

Pipeline form

If you would rather skip the tokenizer plumbing, the standard Transformers pipeline works:

from transformers import pipeline, AutoTokenizer
import torch

model_id = "CohereLabs/command-a-plus-05-2026-w4a4"
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model_id,
    dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain the Transformer architecture"}]
outputs = pipe(messages, max_new_tokens=300)
print(outputs[0]["generated_text"][-1])

Where It Fits

Command A+ is positioned squarely at enterprises that need to deploy a frontier-class model behind their own firewall. The combination of Apache 2.0 license, vision plus tool use in one checkpoint, sub-second time-to-first-token at production throughput on a single B200, and citation tags emitted directly by the model is unusual at this performance tier. For teams already evaluating open-weight MoE options, this is the first Cohere release where the open license and the deployment story line up.