I Taught Claude to Judge My UI. It Missed the Footer.
Published: December 7, 2025 • 6 min read
I did it. I finally completed my fourth case study, the one I promised when I introduced Claude God Tip #12.
If you haven't read that blog post or taken a look at the case study, this post will make absolutely no sense to you. Go check those out first. I'll wait.
Back? Good. Let me tell you what actually happened when I built this thing.
A Personal Note on Time Estimation (I'm Bad at It)
First things first: I always underestimate how long case studies take. Always.
I thought I'd be done with this in maybe 5 hours maximum. It probably took closer to double that time, and that's not even counting the hours it took to build Campaign Wave, the marketing analytics dashboard I used to demonstrate the concept.
10+ hours for something I budgeted 5 hours for. Classic.
When Opus 4.5 Said "Overloaded" (Twice)
Here's something the case study doesn't mention: I originally wanted to use Claude Opus 4.5, the most intelligent model available according to Claude's documentation.
It didn't work out.
I tried running the analysis scripts twice, and both times I kept getting "Overloaded" errors even after increasing the delay period between analyzing each image. Here's the terminal output from the first run:
$ npm run visual-qa:analyze
> campaign-wave@0.1.0 visual-qa:analyze
> node scripts/analyze-screenshots.js
[1/8] Analyzing: overview-desktop-dark.png... Score: 62/100, Issues: 7
[2/8] Analyzing: campaigns-desktop-dark.png... Score: 72/100, Issues: 8
[3/8] Analyzing: compare-desktop-dark.png... 529 {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}
[4/8] Analyzing: help-desktop-dark.png... 529 {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}
[5/8] Analyzing: overview-desktop-light.png... Score: 62/100, Issues: 9
[6/8] Analyzing: campaigns-desktop-light.png... 529 {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}
[7/8] Analyzing: compare-desktop-light.png... Score: 32/100, Issues: 9
[8/8] Analyzing: help-desktop-light.png... 529 {"type":"error","error":{"type":"overloaded_error","message":"Overloaded"}}
Duration: 4m 20s
The second run wasn't any better. Different screenshots failed, but the pattern was the same: Opus 4.5 couldn't handle the load and that is likely because a lot of users were making calls to it at the same time. It appears to be the smartest model after all.
So the case study uses Claude Sonnet 4.5 instead. Honestly, I still think it's a good choice. According to Claude's documentation, Sonnet offers "the best balance of intelligence, speed, and cost for most use cases, with exceptional performance in coding and agentic tasks." That's exactly what I needed for a Visual QA pipeline.
The Viewport Problem: Why Only 16 of 20 Bugs Were Detectable
Here's the big limitation I discovered.
I started with a severely broken version of the application where I intentionally introduced 20 bugs across the UI. But here's the thing: 4 of those 20 bugs could only be noticed on mobile view.
The scripts I wrote only took viewport screenshots at a desktop resolution (1440 x 900). This means only 16 of the 20 errors were detectable from the start.
Let me explain what "viewport screenshot" means in case it's not clear. If you're reading this blog post on a standard desktop or laptop, your screen is probably around 1440 pixels wide and 900 pixels tall. Don't overthink the word "pixels." Just see it as a unit of measurement like centimeters or meters. The viewport is everything visible on your screen without scrolling. A viewport screenshot captures exactly that: what you see without scrolling.
If you're reading this on a mobile phone (maybe an iPhone 12/13/14), the viewport dimensions are around 375 x 812. When testing on mobile, screenshots would have those dimensions. But I wasn't testing on mobile. I was testing desktop only.
The Footer Revelation
Here's where it gets interesting.
The Campaign Wave application has parts that cannot be seen unless you scroll down. My scripts visited each page and captured viewport screenshots from the top: 4 different pages against 2 different themes (light and dark mode), returning 8 screenshots total.
That means there were parts of the application the model never saw to evaluate.
I was impressed with how well the model improved the application on the first run. You can see the result here. But if you look at that version, you'll notice something: the major UI issues are in the footer.
This makes perfect sense. At viewport screenshot, the model does not see the footer and hence, cannot evaluate its problems.
The footer was completely invisible to Claude because it existed below the fold. Claude can't fix what Claude can't see.
Score Inconsistencies: Comparing Apples to Oranges
I realized I needed to do a second run that captured the bottom parts of the screen as well. And that's when things got confusing.
I noticed that the model rated the application after Phase 1 (viewport-only analysis and repair) with an overall score of 59, but it had rated the freshly broken version with a score of 63.
Wait. The fixed version scored worse than the broken version?
Then when I moved to Phase 2 repair, the "fixes" applied by Claude made the score drop to 61.
This seemed backwards until I understood what was happening: the first analysis was viewport-only, while the second analysis was viewport+scrolled. Getting a lower score on the second run makes sense because Claude was measuring more of the application. It was seeing the footer for the first time.
Here's the score progression that actually tells the story:
| Phase | Capture Method | Screenshots | Score | Why |
|---|---|---|---|---|
| Broken app | Viewport only | 8 | 63 | Baseline |
| After viewport fixes | Viewport only | 8 | 77 | +14 improvement |
| Re-analyze with scroll | Viewport + scrolled | 16 | 59 | Footer issues now visible |
| After scroll fixes | Viewport + scrolled | 16 | 61 | Partial improvement |
The key insight: scores from viewport-only captures cannot be compared to scores from scrolled captures. They're evaluating different portions of the application. The score dropped not because the app got worse, but because we were now analyzing 16 screenshots instead of 8, and the footer had significant issues that were previously invisible.
What I Would Do Differently
In hindsight, I think it would have been better if I wrote a prompt that specified what viewport the app was being analyzed at, so the model understands the context better.
There's also the fact that I could have captured full-page screenshots (even of the parts not visible unless scrolled) from the start. That probably would have been more effective.
But oh well. I'll leave it as it is for now and maybe re-run the experiment with either the same app or a different one in the future.
The Cost: Surprisingly Affordable
Here's something I didn't expect: the entire experiment cost less than $2.
| Model | Input Tokens | Output Tokens | Total |
|---|---|---|---|
| Claude Opus 4.5 (failed runs) | $0.12 | $0.35 | $0.47 |
| Claude Sonnet 4 | $0.01 | $0.01 | $0.02 |
| Claude Sonnet 4.5 | $0.34 | $0.99 | $1.33 |
| Total | $1.82 |
The entire Visual QA pipeline, including the two failed Opus 4.5 attempts, cost me $1.82.
If we exclude those failed Opus runs (since they didn't contribute to the final case study), the actual working experiment cost only $1.35.
For less than the price of a coffee, I was able to:
- Analyze 8 screenshots across 4 pages and 2 themes
- Get detailed issue reports with severity scores
- Have Claude generate fixes for identified problems
- Re-run the analysis with scrolled screenshots
This makes the Visual QA approach very accessible for anyone who wants to try it on their own projects. You're not looking at hundreds of dollars in API costs. You're looking at pocket change.
Final Thoughts
I'm glad I did this experiment. It confirmed that Claude God Tip #12 is possible to implement. Claude can absolutely judge its own visual work and fix issues.
But I also acknowledge there are limitations to the approach I followed:
- Opus 4.5 may be too resource-intensive for batch image analysis (at least right now)
- Viewport-only screenshots miss below-the-fold content entirely
- Mobile bugs require mobile-specific test runs
- Score comparisons are only valid when using the same capture method
I will revisit this workflow in the future and design better, more sophisticated scripts with more effective prompts. The concept is solid. The execution just needs refinement.
Check out the full Visual QA Testing case study if you want to see the complete methodology, the bugs I introduced, and the analysis results.
As always, thanks for reading!