MathGen is a benchmark for evaluating mathematical correctness in text-to-image generation. It contains 900 problems across 7 core mathematical domains, supports both clean-scene and open-scene evaluation, and introduces a Script-as-a-Judge protocol that replaces subjective judging with deterministic executable verification.

MathGen domain overview

MathGen covers seven mathematical domains spanning counting, fractions, angles, functions, plane geometry, solid geometry, and sets. The benchmark is designed to test exact correctness rather than approximate visual plausibility.

Abstract

While text-to-image (T2I) models have achieved remarkable photorealism, their ability to reason about strict mathematical constraints remains largely unproven. Existing benchmarks primarily focus on semantic alignment or aesthetic quality, often relying on VLM-based judges that struggle with precise logical verification.

To address this gap, MathGen introduces a rigorous benchmark evaluating mathematical generation capabilities across 900 problems spanning seven core domains. Unlike prior work, MathGen employs a Script-as-a-Judge protocol, where each prompt is paired with a dedicated executable verifier to ensure deterministic and objective pass/fail evaluation.

Experiments across representative open-source and proprietary T2I models reveal that mathematical fidelity is a major bottleneck: open-source models achieve only about 1-11% overall accuracy, with several structured domains often near zero, while the best closed-source models reach 42.0% overall but remain far from reliable across domains. Together, MathGen and its tool-based evaluation provide a principled testbed for measuring and improving mathematical correctness in T2I generation.

Benchmark Design

The MathGen Benchmark

MathGen is built to measure whether text-to-image models generate mathematically correct images instead of merely plausible ones. The benchmark separates clean-scene tasks from open-scene tasks, enabling researchers to distinguish failures in core mathematical rendering from failures caused by richer visual composition.

MathGen pipeline

Overview of the MathGen benchmark and evaluation pipeline. Each prompt is paired with a dedicated verifier so correctness is decided by explicit executable constraints.

Counting

Tests exact numerical correctness when generating target objects.

Requires precise instance-level control.

Fractions and Ratios

Measures whether models can render proportional structure accurately.

Targets exact region and area relationships.

Angles

Evaluates angular construction, ray layout, and geometric annotation.

Requires consistent directional geometry.

Functions

Checks coordinate axes, curves, and function-specific structural correctness.

Especially difficult for current models.

Plane and Solid Geometry

Measures 2D and 3D geometric structure under strict verification.

Includes intersections, projections, and composition.

Sets and Open-Scene Math

Extends evaluation to logical set structure and realistic scene conditions.

Tests robustness beyond clean diagrams.

Evaluation Protocol

The benchmark uses a Script-as-a-Judge protocol. Each prompt is paired with a dedicated deterministic verifier implemented with classical vision and recognition tools. A generation is marked correct only when all task-specific constraints are satisfied.

This makes MathGen especially useful for studying whether a model can preserve exact structure under image synthesis. Instead of asking a multimodal judge whether an output looks right, the evaluation checks concrete geometric, numerical, and logical properties directly from the rendered image.

Component Role Purpose
Prompt Defines a mathematical requirement Specifies the exact structure the model must render
Generator Produces one image Evaluated under default or standardized inference settings
Verifier Script Checks constraints deterministically Uses contour detection, OCR, geometry checks, or detection tools
Final Decision Pass or fail Marked correct only if all constraints are simultaneously satisfied

Task Design Philosophy

Exactness over aesthetics

Tasks are written so that a visually attractive image can still fail if it violates the mathematical requirement.

Fine-grained diagnostics

Different domains reveal different weaknesses: counting control, symbolic mapping, geometric consistency, and topology.

Realistic stress testing

Open-scene prompts preserve the same mathematical condition while adding natural visual complexity.

Experimental Findings

Finding 1

Closed-source models are substantially stronger, but still not reliably mathematically correct.

Finding 2

Counting is easier than structural domains such as functions, geometry, and sets.

Finding 3

Open-scene evaluation often introduces further failures beyond what clean-scene tasks reveal.

Headline Results

MathGen shows a large gap between image realism and mathematical fidelity. Open-source systems often remain in the single-digit range overall, while even strong closed-source systems remain far from perfect across domains.

This gap is not uniform across tasks. Models may perform better on counting-like patterns while still failing function visualization, set logic, or geometry construction, which require stronger global structure preservation.

Nano Banana Pro

42.0%

Best clean-scene overall accuracy reported in the paper.

GPT-Image-1.5

35.7%

Strong closed-source performance but still far from reliable.

Typical Open Source

1-11%

Many open-source models struggle across most mathematical domains.

Model Leaderboard

This leaderboard keeps the paper-style ranking feel, but each row now links directly to the official model page. Readers can compare the clean-scene results and immediately jump to the corresponding model website, repository, or API documentation.

Rank
Model
Family
Count
Angle
Frac
Func
Plane
Set
Solid
Overall
Visit
Diffusion Models
#9
FLUX-2-Pro
Diffusion
Diffusion
22.9
14.3
11.4
14.3
31.4
20.0
16.7
19.1
Open Site
#10
Qwen-Image
Diffusion
Diffusion
31.4
5.7
2.9
2.9
0.0
2.9
30.0
10.8
Open Site
#11
FLUX-Kontext-Pro
Diffusion
Diffusion
14.3
14.3
2.9
5.7
2.9
2.9
13.3
7.2
Open Site
#12
FLUX-2
Diffusion
Diffusion
11.4
5.7
11.4
2.9
5.7
5.7
3.3
7.1
Open Site
#14
SD-3.5-Large
Diffusion
Diffusion
14.3
8.6
2.9
0.0
0.0
2.9
6.7
5.0
Open Site
#17
HiDream-I1
Diffusion
Diffusion
8.6
11.4
2.9
2.9
0.0
0.0
0.0
3.7
Open Site
#20
SD-3-Medium
Diffusion
Diffusion
0.0
11.4
5.7
0.0
0.0
0.0
3.3
2.9
Open Site
#21
SD-3.5-Medium
Diffusion
Diffusion
5.7
5.7
2.9
0.0
2.9
0.0
3.3
2.9
Open Site
#24
PixArt-Sigma
Diffusion
Diffusion
8.6
5.7
2.9
0.0
0.0
0.0
0.0
2.5
Open Site
#28
PixArt-XL-2
Diffusion
Diffusion
0.0
2.9
2.9
0.0
0.0
0.0
3.3
1.3
Open Site
Autoregressive Models
#15
Infinity-8B
Autoregressive
Autoregressive
8.6
2.9
8.6
0.0
2.9
0.0
0.0
3.8
Open Site
#22
GoT-R1-7B
Autoregressive
Autoregressive
5.7
5.7
2.9
0.0
2.9
0.0
10.0
2.9
Open Site
Unified Models
#13
BAGEL
Unified
Unified
11.4
17.1
5.7
0.0
0.0
0.0
3.3
5.7
Open Site
#16
BLIP3o-4B
Unified
Unified
2.9
11.4
5.7
0.0
0.0
2.9
3.3
3.8
Open Site
#18
show-o2-1.5B
Unified
Unified
0.0
14.3
0.0
5.7
2.9
0.0
3.3
3.7
Open Site
#19
Janus-Pro-7B
Unified
Unified
0.0
20.0
0.0
0.0
0.0
0.0
0.0
3.3
Open Site
#23
BLIP3o-8B
Unified
Unified
5.7
5.7
5.7
0.0
0.0
0.0
0.0
2.9
Open Site
#26
show-o2-7B
Unified
Unified
0.0
5.7
0.0
5.7
0.0
0.0
3.3
2.1
Open Site
#27
OmniGen2-7B
Unified
Unified
5.7
2.9
2.9
0.0
0.0
0.0
0.0
1.6
Open Site
#29
Janus-Pro-1B
Unified
Unified
0.0
5.7
0.0
0.0
0.0
0.0
0.0
1.0
Open Site
Closed-Source Models
#1
Nano Banana Pro
Closed Source
Closed
42.9
20.0
25.7
51.4
74.3
40.0
40.0
42.0
Open Site
#2
GPT-Image-1.5
Closed Source
Closed
60.0
17.1
37.1
17.1
45.7
37.1
50.0
35.7
Open Site
#3
GPT-Image-1
Closed Source
Closed
34.3
20.0
28.6
17.1
48.6
17.1
33.3
28.4
Open Site
#4
Nano Banana
Closed Source
Closed
22.9
5.7
11.4
14.3
25.7
14.3
10.0
14.9
Open Site
#5
Imagen 4 Ultra
Closed Source
Closed
25.7
2.9
8.6
5.7
31.4
8.6
16.7
14.2
Open Site
#6
Seedream 4.0
Closed Source
Closed
22.9
8.6
5.7
8.6
2.9
5.7
33.3
12.5
Open Site
#7
Z-Image-Turbo
Closed Source
Closed
11.4
17.1
11.4
0.0
0.0
2.9
26.7
9.9
Open Site
#8
Seedream 3.0
Closed Source
Closed
8.6
11.4
0.0
0.0
0.0
0.0
30.0
7.1
Open Site
#25
Ideogram v3 Turbo
Closed Source
Closed
8.6
5.7
0.0
0.0
2.9
0.0
0.0
2.5
Open Site

Clean-Scene vs. Open-Scene

MathGen clean scene versus open scene

Open-scene evaluation shows how performance changes once the same mathematical requirement is embedded in a richer and more realistic environment.

Qualitative Comparison

MathGen case study

Representative qualitative examples illustrate where models succeed, where they fail, and how mathematically incorrect outputs can still appear visually convincing at a glance.

Qualitative and Error Analysis

MathGen error analysis

Error analysis indicates that models often fail numerical consistency or symbolic structure even when the image appears superficially plausible.

Mathematical Fidelity in T2I

A Benchmark for Constraint-Faithful Generation

Many real-world uses of image generation need more than style or realism. Educational diagrams, charts, geometry figures, UI mockups, and structured visual explanations all depend on exact relationships. MathGen provides a foundation for measuring this capability in a way that is objective, reproducible, and sensitive to mathematical correctness.

In that sense, MathGen is not only a benchmark but also a diagnostic tool. It helps distinguish whether a model is weak because it cannot count, cannot preserve exact geometry, cannot render symbolic structure, or simply breaks down once realistic scene composition is introduced.

Objective

Evaluation is based on executable rules, not subjective preference or weak proxy judges.

Reproducible

The full evaluation pipeline can be run consistently with the same deterministic logic.

Diagnostic

The benchmark highlights where models fail: counts, ratios, geometry, symbolic structure, or realism shifts.

BibTeX

@article{liu2026mathgen,
  title={MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation},
  author={Liu, Ruiyao and Shen, Hui and Zhang, Ping and Hsieh, Yunta and Zhang, Yifan and Xu, Jing and Chen, Sicheng and Li, Junchen and Lu, Jiawei and Ma, Jianing and others},
  journal={arXiv preprint arXiv:2603.27959},
  year={2026}
}