Code-Along With HiDream-O1-Image: An 8B Pixel-Level Unified Transformer Under MIT

May 27
3 min read

HiDream AI open-sourced HiDream-O1-Image on May 8, 2026 under the MIT License, with the technical report posted to arXiv on May 10, 2026 as arXiv:2605.11061. It is one of the more interesting open text-to-image releases this year for two reasons: it is a single 8B Pixel-level Unified Transformer (UiT) that drops the external VAE and disjoint text encoder of latent diffusion stacks, and it currently posts top-tier benchmark numbers among open 8B image models.

This is a hands-on walk through the official quickstart paths, with code snippets pulled directly from the Hugging Face model card and inference script. Sources are listed at the end.

What Is the Pixel-Level Unified Transformer

Most modern open text-to-image stacks (SDXL, FLUX, SD3) factor the problem into a VAE that compresses pixels into latents, a text encoder, and a diffusion transformer. HiDream-O1-Image collapses that into a single transformer that natively encodes raw pixels, text, and task-specific conditions in a shared token space. There is no external VAE and no separate text encoder.

Practically, this changes two things you care about as a builder. First, training and serving are simpler because there is one model artifact, one tokenizer, and one forward pass. Second, conditioning is uniform across modalities, which is why the same checkpoint handles text-to-image, instruction-based editing, multi-reference subject personalization, layout-controlled personalization, and long-text rendering with layout control.

Benchmark Numbers Worth Knowing

GenEval overall: 0.90 (best among 8B models; FLUX.1 Dev is 0.66, SD3.5 Large is 0.71, Janus-Pro-7B is 0.80)

DPG-Bench overall: 89.83 (DALL-E 3 is 83.50, FLUX.1 Dev is 83.84, Z-Image-Turbo is 84.86)

HPSv3 all categories: 10.37 (GPT Image 2 is 10.21, Nano Banana 2.0 is 10.01)

CVTG-2K average NED: 0.9128, CLIP Score: 0.8076

LongText-Bench: English 0.979, Chinese 0.978 (top of evaluated set)

The Dev variant (HiDream-O1-Image-Dev-2604) debuted at #8 in the Artificial Analysis Text to Image Arena. Native generation supports resolutions up to 2048 by 2048.

Path A: Transformers Quickstart

Because UiT is a unified image-text-to-text-style transformer, the Transformers entry points are the standard AutoProcessor and AutoModelForImageTextToText. Below is the minimal load step from the official model card.

from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HiDream-ai/HiDream-O1-Image")
model = AutoModelForImageTextToText.from_pretrained("HiDream-ai/HiDream-O1-Image")

Path B: The Provided inference.py Script

The repository ships an inference.py script that handles task selection, scheduler choice, and resolution. The command-line invocation for a basic text-to-image generation at 2048 resolution looks like this:

python inference.py \
    --model_path /path/to/HiDream-O1-Image \
    --prompt "medium shot, eye-level, front view. A woman is seated..." \
    --output_image results/t2i.png \
    --height 2048 \
    --width 2048

Pick the Right Variant for the Job

Two variants ship with the release, with different schedule and guidance settings documented on the model card.

Full: 50 steps, guidance scale 5.0, default scheduler. Use this when you want the model's best output quality for one-off generations.

Dev: 28 steps, guidance scale 0.0, Flash scheduler (use flow_match for editing). This is the cost-efficient path for batch image generation or interactive applications where you want sub-second iteration.

Prompt Refinement via the Built-In Reasoning Agent

The release ships a Reasoning-Driven Prompt Agent that expands a short user prompt into a long, structured generation prompt before it hits the model. It runs either on a local Gemma-4-31B-it or an OpenAI-compatible API. The Flask-based app.py demo wires it in by default so you can see the refined prompt before the image is generated. If you are productionizing this, the easier integration is to route prompt refinement through an OpenAI-compatible endpoint of your own and pass the refined string straight to inference.py.

Tasks the Same Checkpoint Handles

Text-to-Image Generation

Instruction-Based Image Editing

Multi-Reference Subject-Driven Personalization (up to 10 reference images)

Subject-Driven Personalization with Skeleton or Layout Control

Long-Text Rendering and Layout Control (the model is a top performer on LongText-Bench in both English and Chinese)

When to Reach for It

HiDream-O1-Image is a strong default when you need one open-weight model that does both text-to-image and editing well, when you want commercial freedom under MIT, when you need accurate in-image text rendering in English or Chinese, or when you want to consolidate a multi-model image pipeline behind one checkpoint. The Dev variant in particular is the cheapest path to high-quality 2K images you can self-host today.