Easy tasks in a raster editor, still out of reach for frontier image editing models.
The highest-performing model we evaluated (Nano Banana 2) scores only 17.1%.
Click a row to expand per-category scores.
Scores are mIoU (%).
Can PaintBench performance predict data visualization editing skills? To find out, we created TinyGrafixBench: a procedurally generated, deterministically evaluated benchmark of 20 such tasks spanning 5 chart types.