MathGen

MathGen is a benchmark for evaluating mathematical correctness in text-to-image generation. It contains 900 problems across 7 core mathematical domains, supports both clean-scene and open-scene evaluation, and introduces a Script-as-a-Judge protocol that replaces subjective judging with deterministic executable verification.

Abstract

While text-to-image (T2I) models have achieved remarkable photorealism, their ability to reason about strict mathematical constraints remains largely unproven. Existing benchmarks primarily focus on semantic alignment or aesthetic quality, often relying on VLM-based judges that struggle with precise logical verification.

To address this gap, MathGen introduces a rigorous benchmark evaluating mathematical generation capabilities across 900 problems spanning seven core domains. Unlike prior work, MathGen employs a Script-as-a-Judge protocol, where each prompt is paired with a dedicated executable verifier to ensure deterministic and objective pass/fail evaluation.

Experiments across representative open-source and proprietary T2I models reveal that mathematical fidelity is a major bottleneck: open-source models achieve only about 1-11% overall accuracy, with several structured domains often near zero, while the best closed-source models reach 42.0% overall but remain far from reliable across domains. Together, MathGen and its tool-based evaluation provide a principled testbed for measuring and improving mathematical correctness in T2I generation.

The MathGen Benchmark

MathGen is built to measure whether text-to-image models generate mathematically correct images instead of merely plausible ones. The benchmark separates clean-scene tasks from open-scene tasks, enabling researchers to distinguish failures in core mathematical rendering from failures caused by richer visual composition.

Overview of the MathGen benchmark and evaluation pipeline. Each prompt is paired with a dedicated verifier so correctness is decided by explicit executable constraints.

Counting

Tests exact numerical correctness when generating target objects.

Requires precise instance-level control.

Fractions and Ratios

Measures whether models can render proportional structure accurately.

Targets exact region and area relationships.

Angles

Evaluates angular construction, ray layout, and geometric annotation.

Requires consistent directional geometry.

Functions

Checks coordinate axes, curves, and function-specific structural correctness.

Especially difficult for current models.

Plane and Solid Geometry

Measures 2D and 3D geometric structure under strict verification.

Includes intersections, projections, and composition.

Sets and Open-Scene Math

Extends evaluation to logical set structure and realistic scene conditions.

Tests robustness beyond clean diagrams.

Evaluation Protocol

The benchmark uses a Script-as-a-Judge protocol. Each prompt is paired with a dedicated deterministic verifier implemented with classical vision and recognition tools. A generation is marked correct only when all task-specific constraints are satisfied.

This makes MathGen especially useful for studying whether a model can preserve exact structure under image synthesis. Instead of asking a multimodal judge whether an output looks right, the evaluation checks concrete geometric, numerical, and logical properties directly from the rendered image.

Component	Role	Purpose
Prompt	Defines a mathematical requirement	Specifies the exact structure the model must render
Generator	Produces one image	Evaluated under default or standardized inference settings
Verifier Script	Checks constraints deterministically	Uses contour detection, OCR, geometry checks, or detection tools
Final Decision	Pass or fail	Marked correct only if all constraints are simultaneously satisfied

Task Design Philosophy

Exactness over aesthetics

Tasks are written so that a visually attractive image can still fail if it violates the mathematical requirement.

Fine-grained diagnostics

Different domains reveal different weaknesses: counting control, symbolic mapping, geometric consistency, and topology.

Realistic stress testing

Open-scene prompts preserve the same mathematical condition while adding natural visual complexity.

Finding 1

Closed-source models are substantially stronger, but still not reliably mathematically correct.

Finding 2

Counting is easier than structural domains such as functions, geometry, and sets.

Finding 3

Open-scene evaluation often introduces further failures beyond what clean-scene tasks reveal.

Headline Results

MathGen shows a large gap between image realism and mathematical fidelity. Open-source systems often remain in the single-digit range overall, while even strong closed-source systems remain far from perfect across domains.

This gap is not uniform across tasks. Models may perform better on counting-like patterns while still failing function visualization, set logic, or geometry construction, which require stronger global structure preservation.

Nano Banana Pro

42.0%

Best clean-scene overall accuracy reported in the paper.

GPT-Image-1.5

35.7%

Strong closed-source performance but still far from reliable.

Typical Open Source

1-11%

Many open-source models struggle across most mathematical domains.

Model Leaderboard

This leaderboard keeps the paper-style ranking feel, but each row now links directly to the official model page. Readers can compare the clean-scene results and immediately jump to the corresponding model website, repository, or API documentation.

Diffusion Autoregressive Unified Closed-Source

Rank

Model

Family

Count

Angle

Frac

Func

Plane

Set

Solid

Overall

Visit

Diffusion Models

MathGen

MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Abstract

Benchmark Design

The MathGen Benchmark

Counting

Fractions and Ratios

Angles

Functions

Plane and Solid Geometry

Sets and Open-Scene Math

Evaluation Protocol

Task Design Philosophy

Exactness over aesthetics

Fine-grained diagnostics

Realistic stress testing

Experimental Findings

Finding 1

Finding 2

Finding 3

Headline Results

Nano Banana Pro

GPT-Image-1.5

Typical Open Source

Model Leaderboard

Clean-Scene vs. Open-Scene

Qualitative Comparison

Qualitative and Error Analysis

Mathematical Fidelity in T2I

A Benchmark for Constraint-Faithful Generation

Objective

Reproducible

Diagnostic

BibTeX