Mastering SDXL LoRA Fine-Tuning Part 2: Dealing with Memory Constraints

In advanced AI model fine-tuning, the SDXL model excels at producing high-resolution images. However, with great power comes great demand for resources, particularly GPU memory. In this second part on mastering SDXL LoRA fine-tuning, we face the challenge of managing SDXL’s large memory needs.

GPU memory: The elephant in the room

SDXL typically runs at higher resolutions, like 768×768. Even with a simple batch size of 1 or 2, you might be pushing 16–24 GB VRAM usage.

Here are some key strategies to consider:

Gradient accumulation: Keep batch size at 1, accumulate gradients for multiple steps before updating.
Mixed precision (fp16, bf16): Halves memory usage for some variables, but watch out for potential overflow in the gradient (less stable).
Offloading to CPU or disk: Slows down training but can keep you from crashing if your GPU is borderline.
Focusing on specific modules: Instead of BFS hooking every Linear, maybe only target attention blocks to reduce parameter overhead.

If your machine has a GPU with less than 16 GB VRAM, you’ll face constant out-of-memory errors at 768×768 or 1024×1024.

You can scale down to 512×512, but you lose the hallmark advantage of SDXL’s larger training images.

Combined workflow

Below is an expanded diagram that merges the best of both worlds: how we imagine bridging DreamBooth-style LoRA with BFS hooking, while also acknowledging the two encoders.

It’s not a single script yet, but a conceptual pipeline to unify them:

Resolving specific roadblocks

Recursion error with BFS

During BFS hooking, a common stumbling block is modules that show up multiple times or cyclical references. Our fix:

Maintain a global set (visited) of modules we’ve already processed.
If we see the same module again, skip it.
Carefully handle “named_children” vs. “children,” ensuring we don’t create new references inadvertently.

Two-conveyor encoders

To fully leverage SDXL, you’d want:

Encoder A: Hook or LoRA – enable this for the base generation.
Encoder B: Also attach LoRA, or at least handle it properly for the refiner stage. If not, your second pass might ignore the specialized fur details.

This implies a bifurcated approach – either replicate BFS hooking for both encoders or replicate the DreamBooth-lora method for each.

Dealing with lost gradients

If you see that your model does a forward pass but never actually updates your LoRA weights, confirm:

You’re calling accelerator.backward(loss) or .backward(loss) in the correct place.
The LoRA parameters are not excluded by an accidental .requires_grad_(False).
The forward hooks are returning a modified output, not reassigning it in an out-of-scope variable.

10-Step code adaptations

Despite hitting severe memory limits, we refused to settle for quick excuses. Here are the 10 major things we actually did to lighten the load and push SDXL-based inpainting as far as possible:

1. BFS hooking to inject LoRA

We systematically traversed every nn.Linear layer, attaching lora_down and lora_up submodules without missing important blocks.

Implemented a visited set to avoid recursion loops, ensuring a stable BFS approach.

2. Rank reduction

We lowered LoRA rank (e.g., 8, 4, etc.) to reduce trainable parameters. This was meant to cut VRAM usage.

3. Mixed precision (FP16)

Halved floating-point storage where possible, hoping to fit all the computations into ~16 GB or 20 GB of VRAM.

4. Batch size=1 + gradient accumulation

We used the smallest viable batch size – 1 – and accumulated gradients over multiple steps to minimize memory spikes.

5. One-step training tests

We proved we can run exactly 1 step with a single image without crashing. This might seem trivial, but it reveals the borderline memory constraints at 1024×1024 with SDXL.

6. Partial freezing

We specifically froze everything except LoRA layers. We did not attempt full model training, which would be even larger. This still wasn’t enough for multistep runs on typical paid Colab.

7. Testing different resolutions

We tried smaller resolutions (768×768, etc.). It helps, but still tends to blow up memory quickly in multistep or multi-image scenarios.

8. Advanced masking techniques

Our custom MaskedDogDataset merges multiple partial masks for face, tail, chest, etc. This approach is more complex but crucial for domain-specific fur inpainting.

9. Checkpoint saving & recovery

We wrote code to save partial LoRA state after each step. That way, if the environment crashed, we wouldn’t lose everything. This is a big step in practical training management.

10. Various prompt & negative prompt tweaks

We tested changes to guidance_scale, negative prompts for stylization/artifact removal, and explicit “fur detail” prompts. While important for final quality, it couldn’t bypass the VRAM overhead.

Staying with SDXL

Despite the resource pain, we’ve chosen not to revert to Stable Diffusion 1.5 for inpainting. Why?

Quality & resolution: SDXL’s improved capabilities for large, high-fidelity images are central to realistic fur texture.
Refiner stage: SDXL includes an optional second pass for detail. Dropping to SD 1.5 would lose that advantage, so we’re holding out for better solutions or bigger hardware.

We decided to continue working with LoRA, but to change the way we wrap encoders.

When we met with Colab RAM limitation, we already had an idea of how to launch our model.

After implementing our custom hooking to handle SDXL’s two-encoder system, we managed to train a specialized LoRA for black-and-white Spitz fur without fully rewriting the entire pipeline.

While we didn’t modify every encoder stage, our partial approach was enough to deliver notable improvements over baseline SDXL in several areas:

1. Enhanced fur detail

By selectively updating the UNet we injected new knowledge about black-and-white fur patterns into the model. The screenshots show crisp hair edges and more realistic texture, even in tricky lighting conditions.

2. Higher-resolution inpainting

Compared to using plain SDXL with no extra LoRA layers, our approach better preserves small details like whiskers or subtle fur transitions. Inpainting over partial masks, especially around the face or paws, showed smoother blends.

3. Moderate VRAM footprint

Although SDXL is large, focusing on only a subset of layers (rather than hooking every Linear module) helped us stay within Colab’s memory limits – especially at 768×768 resolution.

4. Fewer artifacts

With domain-specific LoRA training, the model’s final output had fewer random color smudges or texture distortions. This was particularly noticeable along the transitions between white and black fur.

Image without LoRA

Image with LoRA

Inpainting without LoRA

Overall, our modified LoRA approach shows promising results for domain-specific tasks like realistic fur inpainting without fully fine-tuning the entire SDXL model. By carefully balancing partial hooking, smaller ranks, and memory-friendly settings, we found a sweet spot that shows visually striking images while staying within hardware limits.

Continuing research

The strategies we’ve explored here not only solve immediate problems but also pave the way for more accessible and efficient high-quality image generation and manipulation. The journey of optimizing and fine-tuning large models like SDXL is far from over.

Here’s what we’d like to explore next:

Textual inversion for fur: Another way to increase the quality of fur is textual inversion. We want to integrate and adapt it for SDXL.

Multi-pass inpainting: Repeated smaller inpainting passes (or tiling) might be combined with advanced post-upsampling for better fur edges.

Mastering SDXL LoRA Fine-Tuning Part 2: Dealing with Memory Constraints

GPU memory: The elephant in the room

Combined workflow