Scaling the Impossible: A Scientific Guide to Real-World Image Upscaling

Everyone wants bigger images. Photographers, archivists, AI artists, designers, historians – everyone.

But real-world image upscaling is far from solved. Especially when the stakes are high – when the image isn’t just pixels, but a portrait, a memory, a reference asset. For us, that image was a Spitz.

These dogs have ultra-fine fur, soft but dense texture, layered gradients, and delicate lighting interplay. When scanned or compressed, the pictures suffer from:

Loss of fur strands
Smudged contour edges
Unnatural over-sharpening or GAN hallucinations
Compression artifacts in shadow gradients

We wanted to upscale them, but not blindly. We wanted to restore detail, not invent fiction.

This goal led us to a months-long research process – exploring every class of upscalers across GANs, transformers, latent diffusion models, and hybrid inference pipelines.

Evaluation process

We explored three families of upscalers:

GAN-based upscalers: Real-ESRGAN, BSRGAN, LapSRN, BasicSR, Waifu2x
Latent diffusion upscalers: SR3, Stable Diffusion x4 Upscaler
Transformer-based hybrid models: SwinIR

For test images, we picked portraits of Spitzes – with real lighting, real fur, real degradation.

Our scientific goals were the following:

Preserve texture (no over-smoothing)
Avoid fake detail hallucination
Handle shadows, fur edges, and lighting subtleties
Avoid seams in tiling
Upscale with real fidelity, not just aesthetic guesses

In some cases, we did not upscale the entire dog – instead, we created custom binary masks that isolated specific regions like the fur halo, backlight edge, or damaged shadow zones.

These masked regions were routed through the Stable Diffusion ×4 Upscaler pipeline, while the rest of the image remained untouched.

Importantly, we were fully aware that some test photos contained visible artifacts – especially around the legs and areas near the head. This was intentional.

These test cases were designed to stress-test fur reconstruction only, not full-scene fidelity. By isolating the most challenging textures, we could measure the upscaler’s ability to preserve microstructure without introducing global distortions.

This allowed us to surgically enhance only what needed recovery, avoiding unnecessary hallucination and preserving the authenticity of the original capture.

Models that didn’t pass

While many models showed promise in certain aspects, they ultimately failed to meet our comprehensive standards for authentic visual preservation. Here are the models that fell short in critical areas.

1. BSRGAN

🧪 Blind SR model trained for noisy, low-res inputs
❌ Only works well with extremely small (64×64) images
❌ Hallucinates plastic textures
❌ Completely collapses on modern photo data

Verdict: Legacy research tool. Doesn’t generalize to large real-world images.

2. SR3 (Google research)

🧬 Diffusion-based upscaling model designed for portraits
✅ Conceptually interesting: progressive denoising in pixel space
❌ Extremely slow

❌ Poor support for large image inputs (needs cropped faces)
❌ Overwrites natural edges and lighting with “diffusion blur”

Verdict: Groundbreaking paper, but fails practical, high-res Spitz upscaling.

3. LapSRN / BasicSR

📉 Pyramid-based super-resolution networks
✅ Fast inference
❌ Low sharpness ceiling
❌ Cannot preserve delicate fur detail
❌ Tends to soften textures – fails naturalism

Verdict: Historically important, but obsolete in modern pipelines.

4. Waifu2x

🎨 Designed for anime and line art
✅ Excellent at smoothening flat color regions
❌ Fails natural photos
❌ Blurs fine fur detail, smooths shadows unnaturally

Verdict: Good for stylized art. Useless for Spitz photos.

5. Upscayl

🖥️ Desktop GUI upscaler built on Real-ESRGAN models
✅ Easy to use, with Vulkan acceleration (no PyTorch needed)
✅ Delivers decent results on general photos
❌ No control over hallucination strength, tile blending, or architectural internals
❌ Not scriptable, not precision-grade for scientific workflows

Verdict: A strong consumer tool, but lacks the control and fidelity we require for research-grade upscaling.

Models that passed scientific scrutiny

These models stood apart through their consistent ability to preserve authentic visual information while delivering meaningful enhancements.

1. SwinIR

SwinIR is a transformer-based super-resolution network using a windowed attention architecture. Instead of hallucinating texture like GANs, SwinIR learns spatial context over large receptive fields.

Strengths:

Preserves real texture – doesn’t guess
Handles blurry or compressed input gracefully
Works well with synthetic degradations

Challenges:

Slow on large images (heavy architecture)
High VRAM usage — not deployable on consumer GPUs for 4K+ tasks
Still prone to striping artifacts if not tiled carefully

When we used it:

For benchmark comparison
As a ground truth validator – if a hallucinated image matched SwinIR in structure, we trusted it more

Verdict:

✅ Scientifically solid

⚠️ Operationally limited

💡 Not our final tool – but an essential research reference.

2. Real-ESRGAN

This is a GAN-based model built on RRDBNet architecture. It is fast, powerful, and widely adopted. However, it is also known for aggressive hallucination and unnatural sharpness.

Problems in the vanilla model:

Tends to sharpen outlines unrealistically
Fur becomes “wired” or too harsh
Lighting gradients get corrupted

So we went deeper, and added our own modifications (no raw code).

1. Architectural tweaks to RRDBNet

Reduced layer depth in deep RRDB blocks
Added channel attention gating to suppress false sharpness
Modified upsampling layers to be less aggressive (downscaled weights)

2. Controlling GAN hallucination

We retrained on a soft-edged dataset
Injected loss balancing between perceptual + L1 + texture-aware loss
Result: less fake fur, more real softness.

3. Tiling & half-precision

Used –tile_pad control to reduce seams.
Enabled float16 precision for large-scale inference (4 times faster).

Verdict:

✅ Powerful. Fast. Customizable.
⚠️ Needs internal modification to avoid artifacts.
👏 Great for controlled upscaling. Used in many real outputs.

3. Stable Diffusion ×4 Upscaler

This is a latent diffusion model trained to upscale via text-prompted denoising. It runs in a latent space, allowing stylized and highly detailed generation.

At first glance, this might sound dangerous – a model that hallucinates? But we didn’t use it blindly. We re-engineered the pipeline for precision control.

1. Full pipeline control

We broke the pipeline into three stages:

VAE Encoder
It converts image to latent tensor
SD Upscaler
It Injects latents mid-way into the UNet. Then, it adds chunk-wise overlap (32px) with crossfade smoothing.
Custom Decoder
We replaced default VAE decoder with vae-ft-mse-840000-ema-pruned (sharper). It preserves mid-frequency features, avoids smudge.

2. Region-aware prompt injection

Instead of one global prompt, we used tile-specific prompts:

Top: “blue sky, soft clouds”
Middle: “fluffy orange dog, Spitz fur detail”
Bottom: “dark grass, high contrast shadows”

This let us inject different generation styles across tiles without losing coherence.

3. Attention smoothing + Kornia blending

We used it to fix seam artifacts:

Added 32px tile overlap
Used Kornia for gradient-based tile blending
Result: no seams, no sharp transitions

Why it won:

Hallucinations? Controlled.
Detail? Balanced with softness.
Resolution? ×4 native upscale, artifact-free.
Custom decoder? Sharp without destruction.

Verdict:
✅ Most advanced.
✅ Fully controllable.
✅ Final production tool.

Summary comparison table

Below is a side-by-side comparison of all evaluated upscaling technologies. This table presents our test results across key scientific metrics, making it easy to see how each model performed.

Model	Detail quality	Hallucination control	Realism	Speed	Modifiable?	Verdict
BSRGAN	❌ Weak	❌	❌	✅	⚠️	Rejected
SR3	⚠️ Decent	⚠️	❌	❌	❌	Rejected
Waifu2x	⚠️ Stylized	✅	❌	✅	❌	Rejected
LapSRN	❌	✅	❌	✅	❌	Rejected
SwinIR	✅	✅	✅	❌	⚠️	Research use only
Real-ESRGAN (modded)	✅	✅	✅	✅	✅	Approved
SD x4 Upscaler (modded)	✅✅	✅✅	✅✅	⚠️	✅✅	Finalist

Conclusion

This wasn’t a style experiment. It was a controlled scientific process built to solve one hard problem: upscale real-world photos – specifically Spitz portraits – without destroying the soul of the image.

We tried everything: GANs, transformers, diffusion. We rewired models, rebuilt pipelines, replaced decoders, smoothed tiles, and engineered prompts. The result is a tool set that’s surgical, modular, and ready for production.

Scaling the Impossible: A Scientific Guide to Real-World Image Upscaling

Evaluation process