Everyone wants bigger images. Photographers, archivists, AI artists, designers, historians – everyone.
But real-world image upscaling is far from solved. Especially when the stakes are high – when the image isn’t just pixels, but a portrait, a memory, a reference asset. For us, that image was a Spitz.
These dogs have ultra-fine fur, soft but dense texture, layered gradients, and delicate lighting interplay. When scanned or compressed, the pictures suffer from:
- Loss of fur strands
- Smudged contour edges
- Unnatural over-sharpening or GAN hallucinations
- Compression artifacts in shadow gradients
We wanted to upscale them, but not blindly. We wanted to restore detail, not invent fiction.
This goal led us to a months-long research process – exploring every class of upscalers across GANs, transformers, latent diffusion models, and hybrid inference pipelines.
Evaluation process
We explored three families of upscalers:
- GAN-based upscalers: Real-ESRGAN, BSRGAN, LapSRN, BasicSR, Waifu2x
- Latent diffusion upscalers: SR3, Stable Diffusion x4 Upscaler
- Transformer-based hybrid models: SwinIR
For test images, we picked portraits of Spitzes – with real lighting, real fur, real degradation.
Our scientific goals were the following:
- Preserve texture (no over-smoothing)
- Avoid fake detail hallucination
- Handle shadows, fur edges, and lighting subtleties
- Avoid seams in tiling
- Upscale with real fidelity, not just aesthetic guesses
In some cases, we did not upscale the entire dog – instead, we created custom binary masks that isolated specific regions like the fur halo, backlight edge, or damaged shadow zones.
These masked regions were routed through the Stable Diffusion ×4 Upscaler pipeline, while the rest of the image remained untouched.
Importantly, we were fully aware that some test photos contained visible artifacts – especially around the legs and areas near the head. This was intentional.
These test cases were designed to stress-test fur reconstruction only, not full-scene fidelity. By isolating the most challenging textures, we could measure the upscaler’s ability to preserve microstructure without introducing global distortions.
This allowed us to surgically enhance only what needed recovery, avoiding unnecessary hallucination and preserving the authenticity of the original capture.
Models that didn’t pass
While many models showed promise in certain aspects, they ultimately failed to meet our comprehensive standards for authentic visual preservation. Here are the models that fell short in critical areas.
1. BSRGAN
🧪 Blind SR model trained for noisy, low-res inputs
❌ Only works well with extremely small (64×64) images
❌ Hallucinates plastic textures
❌ Completely collapses on modern photo data
Verdict: Legacy research tool. Doesn’t generalize to large real-world images.
2. SR3 (Google research)
🧬 Diffusion-based upscaling model designed for portraits
✅ Conceptually interesting: progressive denoising in pixel space
❌ Extremely slow
❌ Poor support for large image inputs (needs cropped faces)
❌ Overwrites natural edges and lighting with “diffusion blur”
Verdict: Groundbreaking paper, but fails practical, high-res Spitz upscaling.
3. LapSRN / BasicSR
📉 Pyramid-based super-resolution networks
✅ Fast inference
❌ Low sharpness ceiling
❌ Cannot preserve delicate fur detail
❌ Tends to soften textures – fails naturalism
Verdict: Historically important, but obsolete in modern pipelines.
4. Waifu2x
🎨 Designed for anime and line art
✅ Excellent at smoothening flat color regions
❌ Fails natural photos
❌ Blurs fine fur detail, smooths shadows unnaturally
Verdict: Good for stylized art. Useless for Spitz photos.


5. Upscayl
🖥️ Desktop GUI upscaler built on Real-ESRGAN models
✅ Easy to use, with Vulkan acceleration (no PyTorch needed)
✅ Delivers decent results on general photos
❌ No control over hallucination strength, tile blending, or architectural internals
❌ Not scriptable, not precision-grade for scientific workflows
Verdict: A strong consumer tool, but lacks the control and fidelity we require for research-grade upscaling.


Models that passed scientific scrutiny
These models stood apart through their consistent ability to preserve authentic visual information while delivering meaningful enhancements.
1. SwinIR
SwinIR is a transformer-based super-resolution network using a windowed attention architecture. Instead of hallucinating texture like GANs, SwinIR learns spatial context over large receptive fields.
Strengths:
- Preserves real texture – doesn’t guess
- Handles blurry or compressed input gracefully
- Works well with synthetic degradations
Challenges:
- Slow on large images (heavy architecture)
- High VRAM usage — not deployable on consumer GPUs for 4K+ tasks
- Still prone to striping artifacts if not tiled carefully
When we used it:
- For benchmark comparison
- As a ground truth validator – if a hallucinated image matched SwinIR in structure, we trusted it more
Verdict:
✅ Scientifically solid
⚠️ Operationally limited
💡 Not our final tool – but an essential research reference.


2. Real-ESRGAN
This is a GAN-based model built on RRDBNet architecture. It is fast, powerful, and widely adopted. However, it is also known for aggressive hallucination and unnatural sharpness.
Problems in the vanilla model:
- Tends to sharpen outlines unrealistically
- Fur becomes “wired” or too harsh
- Lighting gradients get corrupted
So we went deeper, and added our own modifications (no raw code).
1. Architectural tweaks to RRDBNet
- Reduced layer depth in deep RRDB blocks
- Added channel attention gating to suppress false sharpness
- Modified upsampling layers to be less aggressive (downscaled weights)
2. Controlling GAN hallucination
- We retrained on a soft-edged dataset
- Injected loss balancing between perceptual + L1 + texture-aware loss
- Result: less fake fur, more real softness.
3. Tiling & half-precision
- Used –tile_pad control to reduce seams.
- Enabled float16 precision for large-scale inference (4 times faster).
Verdict:
✅ Powerful. Fast. Customizable.
⚠️ Needs internal modification to avoid artifacts.
👏 Great for controlled upscaling. Used in many real outputs.
3. Stable Diffusion ×4 Upscaler
This is a latent diffusion model trained to upscale via text-prompted denoising. It runs in a latent space, allowing stylized and highly detailed generation.
At first glance, this might sound dangerous – a model that hallucinates? But we didn’t use it blindly. We re-engineered the pipeline for precision control.
1. Full pipeline control
We broke the pipeline into three stages:
- VAE Encoder
It converts image to latent tensor - SD Upscaler
It Injects latents mid-way into the UNet. Then, it adds chunk-wise overlap (32px) with crossfade smoothing. - Custom Decoder
We replaced default VAE decoder with vae-ft-mse-840000-ema-pruned (sharper). It preserves mid-frequency features, avoids smudge.
2. Region-aware prompt injection
Instead of one global prompt, we used tile-specific prompts:
- Top: “blue sky, soft clouds”
- Middle: “fluffy orange dog, Spitz fur detail”
- Bottom: “dark grass, high contrast shadows”
This let us inject different generation styles across tiles without losing coherence.
3. Attention smoothing + Kornia blending
We used it to fix seam artifacts:
- Added 32px tile overlap
- Used Kornia for gradient-based tile blending
- Result: no seams, no sharp transitions
Why it won:
- Hallucinations? Controlled.
- Detail? Balanced with softness.
- Resolution? ×4 native upscale, artifact-free.
- Custom decoder? Sharp without destruction.
Verdict:
✅ Most advanced.
✅ Fully controllable.
✅ Final production tool.


Summary comparison table
Below is a side-by-side comparison of all evaluated upscaling technologies. This table presents our test results across key scientific metrics, making it easy to see how each model performed.
| Model | Detail quality | Hallucination control | Realism | Speed | Modifiable? | Verdict |
| BSRGAN | ❌ Weak | ❌ | ❌ | ✅ | ⚠️ | Rejected |
| SR3 | ⚠️ Decent | ⚠️ | ❌ | ❌ | ❌ | Rejected |
| Waifu2x | ⚠️ Stylized | ✅ | ❌ | ✅ | ❌ | Rejected |
| LapSRN | ❌ | ✅ | ❌ | ✅ | ❌ | Rejected |
| SwinIR | ✅ | ✅ | ✅ | ❌ | ⚠️ | Research use only |
| Real-ESRGAN (modded) | ✅ | ✅ | ✅ | ✅ | ✅ | Approved |
| SD x4 Upscaler (modded) | ✅✅ | ✅✅ | ✅✅ | ⚠️ | ✅✅ | Finalist |
Conclusion
This wasn’t a style experiment. It was a controlled scientific process built to solve one hard problem: upscale real-world photos – specifically Spitz portraits – without destroying the soul of the image.
We tried everything: GANs, transformers, diffusion. We rewired models, rebuilt pipelines, replaced decoders, smoothed tiles, and engineered prompts. The result is a tool set that’s surgical, modular, and ready for production.

Leave a Reply