Code-Along With PaddleOCR-VL 1.5: A 0.9B Vision-Language Model That Tops OmniDocBench
- 6 days ago
- 3 min read
If you are still pipelining a separate OCR engine, layout detector, formula recognizer, and table extractor, the PaddlePaddle team has a more direct answer. PaddleOCR-VL 1.5 is a single 0.9 billion parameter vision-language model that handles all of them, and it currently reports state of the art on OmniDocBench v1.5 at 94.5% overall accuracy, with SOTA on text, formulas, tables, and reading order recognition.
This is a hands-on walk through the official quickstart paths and what each task mode does, with snippets pulled directly from the Hugging Face model card. Sources are listed at the end.
What Makes the 0.9B VLM Interesting
The architecture pairs a NaViT-style dynamic-resolution visual encoder with the ERNIE-4.5-0.3B language model. The model exposes six task modes through a prompt prefix and the same checkpoint handles all of them:
ocr: free-form text recognition
table: structured table parsing
formula: math formula recognition
chart: chart understanding
spotting: text-line localization plus recognition (polygonal detection)
seal: seal recognition
On Real5-OmniDocBench, PaddleOCR-VL 1.5 reports SOTA on each of the five real-world distortion categories: scanning artifacts, page skew, page warping, screen photography, and uneven illumination. It also handles cross-page table merging and paragraph heading recognition automatically. Language coverage includes English, Simplified and Traditional Chinese, Bengali, Tibetan script, and broad multilingual support with rare characters.
Headline Benchmark Numbers
OmniDocBench v1.5 overall: 94.5% (SOTA)
ParseBench mean: 65.95 (Text Content 82.72, Layout 77.78, Table 67.38, Chart 47.62)
MDPBench overall: 78.3 (Digital subset: 87.4)
Path A: Hugging Face Transformers Quickstart
The Hugging Face model id is PaddlePaddle/PaddleOCR-VL-1.5, published under Apache 2.0. The model registers as an image-text-to-text task. Below is the official quickstart from the model card, lightly compacted. It handles the spotting-mode upscale rule (a 2x upscale when both image dimensions are below 1500 pixels) and the per-task max pixel budget.
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
image_path = "test.png"
task = "ocr" # 'ocr' | 'table' | 'chart' | 'formula' | 'spotting' | 'seal'
image = Image.open(image_path).convert("RGB")
orig_w, orig_h = image.size
spotting_upscale_threshold = 1500
if task == "spotting" and orig_w < spotting_upscale_threshold and orig_h < spotting_upscale_threshold:
process_w, process_h = orig_w * 2, orig_h * 2
try:
resample_filter = Image.Resampling.LANCZOS
except AttributeError:
resample_filter = Image.LANCZOS
image = image.resize((process_w, process_h), resample_filter)
max_pixels = 2048 * 28 * 28 if task == "spotting" else 1280 * 28 * 28
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
PROMPTS = {
"ocr": "OCR:",
"table": "Table Recognition:",
"formula": "Formula Recognition:",
"chart": "Chart Recognition:",
"spotting": "Spotting:",
"seal": "Seal Recognition:",
}
model = AutoModelForImageTextToText.from_pretrained(
model_path, torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPTS[task]},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
images_kwargs={
"size": {
"shortest_edge": processor.image_processor.min_pixels,
"longest_edge": max_pixels,
}
},
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:-1])
print(result)Notes on the snippet
Two task-aware decisions worth keeping when you adapt this:
Spotting-mode upscale: small documents (both sides under 1500 pixels) are scaled 2x with LANCZOS before being passed in. This materially helps text-line localization on dense scans.
Per-task pixel budget: spotting gets a higher max-pixel budget (2048 patches squared) than the other modes (1280 patches squared). The processor expects this through images_kwargs.size.
Path B: The Official PaddleOCRVL Pipeline
If you do not want to manage tasks and prompts yourself, the official PaddleOCR pipeline handles all six modes, layout, and serialization. It outputs structured JSON and Markdown in one call:
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL(pipeline_version="v1.5")
output = pipeline.predict("path/to/document_image.png")
for res in output:
res.print()
res.save_to_json(save_path="output")
res.save_to_markdown(save_path="output")This is the path to favor if you are building a document ingestion service and you want to dump straight to Markdown for LLM-side post-processing.
When to Reach for It
PaddleOCR-VL 1.5 is the right tool when you need accurate, multilingual document parsing on a budget and you do not want to stitch together four or five single-purpose models. A 0.9B-parameter VLM that runs in BF16 on a single mid-tier GPU and beats much larger systems on OmniDocBench v1.5 is rare. The Apache 2.0 license closes the deal for commercial use.
Where it is less of a fit: tasks that go beyond document parsing (general scene understanding, long-form image captioning). It was trained for the document domain and the prompt vocabulary reflects that.
Sources
Hugging Face model card: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
Paper page on Hugging Face: https://huggingface.co/papers/2601.21957
PaddleOCR repository: https://github.com/PaddlePaddle/PaddleOCR
arXiv preprint: https://arxiv.org/abs/2601.21957



