visp evals
Model Performance Stats
A more honest view of visp: strong on structured Three.js benchmarks, but much more brittle under casual prompt variation.
Structured Deep Eval
Benchmark-shaped prompts across shader creation, troubleshooting, and Three.js performance.
visp score
83.19%
base score
5.95%
strict passes
7/12 vs 0/12
visp TS pass rate
75%
General Robustness Eval
Natural-language prompts that test how well visp handles casual phrasing and prompt variation.
visp score
43.73%
base score
0.21%
strict passes
1/12 vs 0/12
visp TS pass rate
50%
What These Numbers Mean
The structured eval shows visp learned the project's expected Three.js patterns. The general robustness eval is more important for product expectations: small prompt wording changes caused the adapter to miss details, produce TypeScript errors, or fall back to generic code. The result is still better than the base model, but it is not yet a reliable general Three.js assistant.
Structured Deep Eval Categories
| Category | Base | visp | Gain | Relative Lift | visp Pass |
|---|---|---|---|---|---|
| Shader Creation | 8.75% | 97.50% | +88.75 pts | 1,014% | 3/4 |
| Troubleshooting | 4.72% | 61.46% | +56.74 pts | 1,202% | 1/4 |
| Three.js Performance | 4.38% | 90.62% | +86.24 pts | 1,969% | 3/4 |
General Robustness Eval Categories
| Category | Base | visp | Gain | Relative Lift | visp Pass |
|---|---|---|---|---|---|
| General Scene | 0.00% | 70.83% | +70.83 pts | n/a | 1/3 |
| General Interaction | 0.00% | 50.00% | +50.00 pts | n/a | 0/3 |
| General Game | 0.00% | 9.72% | +9.72 pts | n/a | 0/2 |
| General Shader | 2.50% | 0.00% | +-2.50 pts | -100% | 0/1 |
| General Performance | 0.00% | 28.57% | +28.57 pts | n/a | 0/2 |
| General Troubleshooting | 0.00% | 85.71% | +85.71 pts | n/a | 0/1 |
Best Strength
Structured shader creation stayed very strong at 97.50%, showing visp can follow explicit Three.js shader recipes.
Biggest Weakness
Natural game prompts dropped to 9.72%, which matches the manual observation that prompt variation exposes brittleness.
Next Test
Browser-run evals should verify WebGL shader compilation, screenshots, console errors, interactions, and frame time.