Code-Along With PaddleOCR-VL 1.5: A 0.9B Vision-Language Model That Tops OmniDocBench

6 days ago
3 min read

If you are still pipelining a separate OCR engine, layout detector, formula recognizer, and table extractor, the PaddlePaddle team has a more direct answer. PaddleOCR-VL 1.5 is a single 0.9 billion parameter vision-language model that handles all of them, and it currently reports state of the art on OmniDocBench v1.5 at 94.5% overall accuracy, with SOTA on text, formulas, tables, and reading order recognition.

This is a hands-on walk through the official quickstart paths and what each task mode does, with snippets pulled directly from the Hugging Face model card. Sources are listed at the end.

What Makes the 0.9B VLM Interesting

The architecture pairs a NaViT-style dynamic-resolution visual encoder with the ERNIE-4.5-0.3B language model. The model exposes six task modes through a prompt prefix and the same checkpoint handles all of them:

ocr: free-form text recognition

table: structured table parsing

formula: math formula recognition

chart: chart understanding

spotting: text-line localization plus recognition (polygonal detection)

seal: seal recognition

On Real5-OmniDocBench, PaddleOCR-VL 1.5 reports SOTA on each of the five real-world distortion categories: scanning artifacts, page skew, page warping, screen photography, and uneven illumination. It also handles cross-page table merging and paragraph heading recognition automatically. Language coverage includes English, Simplified and Traditional Chinese, Bengali, Tibetan script, and broad multilingual support with rare characters.

Headline Benchmark Numbers

OmniDocBench v1.5 overall: 94.5% (SOTA)

ParseBench mean: 65.95 (Text Content 82.72, Layout 77.78, Table 67.38, Chart 47.62)

MDPBench overall: 78.3 (Digital subset: 87.4)

Path A: Hugging Face Transformers Quickstart

The Hugging Face model id is PaddlePaddle/PaddleOCR-VL-1.5, published under Apache 2.0. The model registers as an image-text-to-text task. Below is the official quickstart from the model card, lightly compacted. It handles the spotting-mode upscale rule (a 2x upscale when both image dimensions are below 1500 pixels) and the per-task max pixel budget.

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
image_path = "test.png"
task = "ocr"  # 'ocr' | 'table' | 'chart' | 'formula' | 'spotting' | 'seal'

image = Image.open(image_path).convert("RGB")
orig_w, orig_h = image.size
spotting_upscale_threshold = 1500

if task == "spotting" and orig_w < spotting_upscale_threshold and orig_h < spotting_upscale_threshold:
    process_w, process_h = orig_w * 2, orig_h * 2
    try:
        resample_filter = Image.Resampling.LANCZOS
    except AttributeError:
        resample_filter = Image.LANCZOS
    image = image.resize((process_w, process_h), resample_filter)

max_pixels = 2048 * 28 * 28 if task == "spotting" else 1280 * 28 * 28

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
PROMPTS = {
    "ocr": "OCR:",
    "table": "Table Recognition:",
    "formula": "Formula Recognition:",
    "chart": "Chart Recognition:",
    "spotting": "Spotting:",
    "seal": "Seal Recognition:",
}

model = AutoModelForImageTextToText.from_pretrained(
    model_path, torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text",  "text":  PROMPTS[task]},
        ],
    }
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    images_kwargs={
        "size": {
            "shortest_edge": processor.image_processor.min_pixels,
            "longest_edge":  max_pixels,
        }
    },
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)

Notes on the snippet

Two task-aware decisions worth keeping when you adapt this:

Spotting-mode upscale: small documents (both sides under 1500 pixels) are scaled 2x with LANCZOS before being passed in. This materially helps text-line localization on dense scans.

Per-task pixel budget: spotting gets a higher max-pixel budget (2048 patches squared) than the other modes (1280 patches squared). The processor expects this through images_kwargs.size.

Path B: The Official PaddleOCRVL Pipeline

If you do not want to manage tasks and prompts yourself, the official PaddleOCR pipeline handles all six modes, layout, and serialization. It outputs structured JSON and Markdown in one call:

from paddleocr import PaddleOCRVL

pipeline = PaddleOCRVL(pipeline_version="v1.5")
output = pipeline.predict("path/to/document_image.png")
for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

This is the path to favor if you are building a document ingestion service and you want to dump straight to Markdown for LLM-side post-processing.

When to Reach for It

PaddleOCR-VL 1.5 is the right tool when you need accurate, multilingual document parsing on a budget and you do not want to stitch together four or five single-purpose models. A 0.9B-parameter VLM that runs in BF16 on a single mid-tier GPU and beats much larger systems on OmniDocBench v1.5 is rare. The Apache 2.0 license closes the deal for commercial use.

Where it is less of a fit: tasks that go beyond document parsing (general scene understanding, long-form image captioning). It was trained for the document domain and the prompt vocabulary reflects that.

Sources

Hugging Face model card: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

Paper page on Hugging Face: https://huggingface.co/papers/2601.21957

PaddleOCR repository: https://github.com/PaddlePaddle/PaddleOCR

arXiv preprint: https://arxiv.org/abs/2601.21957