visp evals

Model Performance Stats

A more honest view of visp: strong on structured Three.js benchmarks, but much more brittle under casual prompt variation.

Back to builder

Structured Deep Eval

Benchmark-shaped prompts across shader creation, troubleshooting, and Three.js performance.

+77.24 percentage points

visp score

83.19%

base score

5.95%

strict passes

7/12 vs 0/12

visp TS pass rate

75%

Base5.95%

visp83.19%

General Robustness Eval

Natural-language prompts that test how well visp handles casual phrasing and prompt variation.

+43.52 percentage points

visp score

43.73%

base score

0.21%

strict passes

1/12 vs 0/12

visp TS pass rate

50%

Base0.21%

visp43.73%

What These Numbers Mean

The structured eval shows visp learned the project's expected Three.js patterns. The general robustness eval is more important for product expectations: small prompt wording changes caused the adapter to miss details, produce TypeScript errors, or fall back to generic code. The result is still better than the base model, but it is not yet a reliable general Three.js assistant.

Structured Deep Eval Categories

Category	Base	visp	Gain	Relative Lift	visp Pass
Shader Creation	8.75%	97.50%	+88.75 pts	1,014%	3/4
Troubleshooting	4.72%	61.46%	+56.74 pts	1,202%	1/4
Three.js Performance	4.38%	90.62%	+86.24 pts	1,969%	3/4

General Robustness Eval Categories

Category	Base	visp	Gain	Relative Lift	visp Pass
General Scene	0.00%	70.83%	+70.83 pts	n/a	1/3
General Interaction	0.00%	50.00%	+50.00 pts	n/a	0/3
General Game	0.00%	9.72%	+9.72 pts	n/a	0/2
General Shader	2.50%	0.00%	+-2.50 pts	-100%	0/1
General Performance	0.00%	28.57%	+28.57 pts	n/a	0/2
General Troubleshooting	0.00%	85.71%	+85.71 pts	n/a	0/1

Best Strength

Structured shader creation stayed very strong at 97.50%, showing visp can follow explicit Three.js shader recipes.

Biggest Weakness

Natural game prompts dropped to 9.72%, which matches the manual observation that prompt variation exposes brittleness.

Next Test

Browser-run evals should verify WebGL shader compilation, screenshots, console errors, interactions, and frame time.