Code-Along With ZAYA1-8B: Zyphra's AMD-Trained Reasoning MoE in Transformers and vLLM

7 days ago
3 min read

Zyphra released ZAYA1-8B on May 6, 2026, and the technical story is more interesting than the parameter count. It is an Apache 2.0 reasoning Mixture-of-Experts with 8.4 billion total parameters and only 760 million active per token, and it was trained end to end on AMD Instinct MI300X GPUs with AMD Pensando Pollara networking. No NVIDIA in the training loop.

This is a hands-on walk through what is in the model card, what numbers Zyphra published, and how to actually run inference today in either Transformers or vLLM. Every snippet below is taken directly from the official Zyphra blog post and Hugging Face model card. Sources are listed at the end.

What Is Actually New

ZAYA1-8B introduces three architectural pieces that the Zyphra team highlights:

1. Compressed Convolutional Attention (CCA), a more parameter-efficient attention variant than standard multi-head attention.

2. An MLP-based expert router that improves routing stability versus standard linear routers used in most MoE models.

3. Learned residual scaling that manages residual-norm growth with minimal overhead during training.

Training was done on a custom cluster of 1,024 AMD Instinct MI300X GPUs on IBM Cloud infrastructure, with AMD Pensando Pollara interconnect. The fact that this works end to end (kernels, communications, training recipe) is itself the news for builders who care about supply diversity.

Benchmark Numbers Zyphra Reports

With under one billion active parameters, ZAYA1-8B reaches or beats first-generation frontier reasoning models on hard math:

HMMT 2025: 89.6 (vs Claude 4.5 Sonnet at 88.3)

Competitive on AIME, LCB coding, GPQA-Diamond, IFEval, and IFBench

With Markovian RSA test-time compute, it approaches DeepSeek-V3.2 and Qwen3-A22B performance tiers

Step 1: Install

ZAYA1-8B currently requires Zyphra's transformers fork on the zaya1 branch. Install the patched library:

pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"

Step 2: Load and Generate With Transformers

The Hugging Face model identifier is Zyphra/ZAYA1-8B. Load the model and tokenizer, apply the chat template, and generate:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Zyphra/ZAYA1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello. How is it going?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(input_ids, max_length=200)
response = tokenizer.decode(outputs[0])
print(response)

Sampling parameters

Zyphra recommends different sampling profiles depending on workload:

General use: temperature=1.0, top_p=0.95, top_k=-1

Agent and code: temperature=0.6, top_p=0.95, top_k=-1

Step 3: Serve With vLLM

For production-style inference, vLLM is the recommended path and also requires Zyphra's fork:

pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1-pr"

Start the server. Note the reasoning parser, the tool-call parser, and the mamba-cache dtype flags, which are required for ZAYA1's hybrid attention path:

vllm serve Zyphra/ZAYA1-8B --port 8010 \
   --mamba-cache-dtype float32 --dtype bfloat16 \
   --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser zaya_xml

Then call the OpenAI-compatible chat completions endpoint:

curl http://localhost:8010/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Zyphra/ZAYA1-8B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello. How is it going?"}
        ]
    }'

Step 4: Skip Local Hosting

If you do not want to set up GPUs at all, Zyphra ships a free serverless endpoint with the model at cloud.zyphra.com. That is the fastest path to a working benchmark or evaluation against your existing eval harness.

Why It Matters

Two things make ZAYA1-8B worth tracking even if you do not deploy it. First, it is one of the cleanest public proof points that a modern reasoning MoE can be trained from scratch entirely on AMD hardware at scale, which materially affects the supply story for builders. Second, the active-parameter-to-quality ratio (760M active scoring above Claude 4.5 Sonnet on HMMT 2025) is the kind of result that pulls cost curves down for everyone building reasoning agents on commodity infrastructure.