In our journey to perfecting black-and-white Spitz fur inpainting, we’ve taken on baseline inpainting, dimension fixes, ControlNet guidance, LoRA-based domain adaptation, and other partial solutions for the giant that is SDXL.
Textual Inversion is another method that can be used alongside or instead of LoRA. While LoRA adds specific details like a unique fur pattern to different parts of the AI model, Textual Inversion works by changing token embeddings, or word meanings. This helps teach the AI new ideas or improve existing ones.
Why textual inversion as an alternative?
Lightweight embeddings
Textual Inversion focuses on training or injecting small embedding vectors that correspond to custom tokens. Instead of modifying entire model layers as LoRA does, you primarily alter how the token (e.g., <dog_fur_style>) is interpreted by the text encoders. This can require fewer parameters than a LoRA approach, making it easier to train on modest hardware.
Specific niche knowledge
For tasks like black-and-white fur inpainting, details matter: the directional flow of fur, subtle grayscale transitions, and the texture patterns of a Spitz coat. By customizing the embedding for <dog_fur_style>, we can teach SDXL’s text encoder how to interpret this specialized concept.
Seamless integration with existing pipelines
Textual Inversion typically integrates well with the standard Stable Diffusion pipeline: you add a token, load the embedding, and let the pipeline decode its meaning during inference. This means you can switch from baseline to specialized style by simply adding <dog_fur_style> to your prompt.
The SDXL two-encoder twist
Double text encoders
As discussed in our previous article, SDXL uses two text encoders: one in the base model stage, another in the refiner stage. Both can affect how tokens are interpreted at different steps in the diffusion process. For effective Textual Inversion in SDXL, you need to ensure that both text encoders understand the new token.
Embedding mismatch issues
When you add a new token like <dog_fur_style> into one encoder, you must also add it in the second to avoid shape mismatches. Our code example (see snippet below) does exactly that – expanding each tokenizer’s vocabulary and resizing the relevant embeddings.

Consistent token handling
If the base encoder knows <dog_fur_style> but the refiner does not, you’ll get partial learning or confusing results. Our approach modifies both embeddings, ensuring consistent recognition of the Spitz fur token across the entire pipeline.
BFS Hooking, recap
In the LoRA domain,“BFS hooking” means systematically traversing the entire model’s architecture – especially the large SDXL UNet – to attach LoRA layers or other modifications to each nn.Linear module. This ensures you don’t miss relevant modules for cross-attention, feed-forward, or skip connections that might matter for specialized tasks like fur detailing.
However, BFS Hooking can be:
- Powerful: Covers every corner of the model.
- Complex: Risky for memory usage, and can cause recursion or duplication errors if not handled carefully.
Textual Inversion’s alternative
With Textual Inversion, the concept injection primarily lives in the embedding layers, rather than hooking every linear transform. This can simplify memory requirements and reduce the risk of unintentional model bloat.
Still, for extremely nuanced tasks like very high-resolution inpainting or advanced structural transformations, LoRA hooking can do more heavy lifting than just adjusting the embeddings.
Below is a high-level walkthrough (similar to our previous snippet, but focused on embedding modifications). Notice how we:
- Load the pipeline: We pick AutoPipelineForInpainting for an inpainting scenario.
- Insert our custom tokens into both text encoders.
- Resize embeddings for each text encoder, ensuring shapes match the newly inserted token.
- Assign the new embedding vectors to the correct token index.
Note: The snippet below complements our BFS Hooking approach for LoRA. If you prefer a purely Textual Inversion method, you can skip hooking the UNet layers entirely and rely on the custom token embedding alone.

This snippet ties directly into our bigger SDXL inpainting pipeline, giving you domain-specific style or fur knowledge without hooking every layer.
Memory and VRAM considerations
- Mixed precision: Combining Textual Inversion with half-precision (fp16) is typically smoother than hooking every linear layer in LoRA.
- Batch size: If you’re training new token embeddings from scratch, you can often get away with a larger batch size than a BFS Hooking scenario, because there are fewer trainable parameters.
- Resolution: SDXL is designed for higher resolutions (768×768, 1024×1024). Textual Inversion can handle these sizes relatively well, but watch your GPU usage if you also add BFS Hooking or full LoRA on top.
Results and observations
When we tested Textual Inversion for black-and-white Spitz fur, we observed:
- Sharper fur striations: Fine, whisker-like details around the muzzle and ears looked crisp.
- Natural color transitions: Because the new embeddings understood the contrast between black and white fur, the boundary region had fewer artifacts than baseline.
- Efficiency gains: Compared to BFS Hooking, training was lighter on VRAM and simpler to debug – fewer hooking complexities or recursion issues.

No text inversion

Final result with text inversion
However, LoRA still holds advantages if you need deep structural changes or if you want to modify cross-attention layers for more dramatic stylization. Textual Inversion is typically more subtle and prompt-driven.
Prompt magic with LoRA and Inversion
We’re not done yet! The real power emerges when you:
- Combine LoRA with Textual Inversion for a full-stack domain adaptation. LoRA modifies the model’s internal transformations, and Textual Inversion modifies token embeddings
- Craft complex prompts that leverage both special tokens and tuned cross-attention.

Inpainting with text inversion for the dog’s body
Next stop: We’ll reveal how prompt engineering can bring together these two methods, giving you the ultimate flexibility to produce high-fidelity, domain-tailored images.
We’ll explore specialized prompting for black-and-white Spitz fur, layering <dog_fur_style> tokens over a LoRA-charged model, and highlight best practices for balancing memory usage and image fidelity.
Stay tuned – our ultimate goal is to produce the perfect, high-resolution, domain-specific inpainting for anything from puppy ears to advanced full-body Spitz transformations.

Leave a Reply