Enhancing Fine-Detail Preservation in a SegGPT + SAM Hybrid Segmentation Pipeline

Deep segmentation models have become very flexible thanks to new foundation models. SAM (Segment Anything Model) can be prompted to segment any object and works well without specific training (zero-shot).

These are the results we got using only SegGPT (above) and it’s combination with SAM (below):

SegGPT is a general segmentation model that learns from context to handle many segmentation tasks (objects, parts, contours, etc.) using a single transformer-based architecture. By using these models together, you can take advantage of SAM’s strong mask generation and SegGPT’s ability to understand context.

However, a major problem in segmentation is the preservation of fine details. Things like animal fur strands or thin object structures often get lost in large-scale models.

SAM’s output masks lack fine-grained details and precise boundaries because its vision encoder tends to smooth out detailed features. Even improved versions of SAM (like HQ-SAM and Pi-SAM) that use prompt tuning or high-resolution decoders only partly fix this problem and still struggle with very precise outlining.

Instead of changing model designs or retraining with special data, we focus on an inference-time strategy to improve detailed segmentation. We want to fix the “over-refinement” problem – where masks are too tight or smooth and miss thin extensions or small parts – and recover granular structures from the image. This means rethinking how segmentation works.

We draw inspiration from several advanced techniques: combining multiple masks, processing at different scales, test-time augmentation (TTA), and traditional image processing improvements.

In this article, we present a SegGPT + SAM hybrid pipeline that combines these techniques into one system for preserving fine details. Let’s get into it.

Methodology

Our hybrid pipeline works as a series of processing steps that gradually improve segmentation results, as shown in Figure 1. We start with initial masks from SegGPT and SAM, then use several enhancements to recover fine details. The main idea is to merge the strengths of both models and augment the inference process to capture information at different scales and edge detail.

On the left are input images; middle is prediction at 0.5× scale (lower resolution) and right show images at 2.0× scale (higher resolution). Thin structures, such as the posts and streetlamps, are missed or broken in the low-scale prediction, but are much better captured when the image is processed at 2× scale. This demonstrates how scaling affects fine detail segmentation, motivating our multiscale inference strategy.

Figure 1: Multi-scale segmentation results on a street scene (from NVIDIA’s semantic segmentation study).

1. Mask generation and initial merging

SegGPT initial segmentation

First, we get a basic segmentation of the scene using SegGPT’s in-context inference. SegGPT creates a mask (or set of masks) that covers the objects or regions we’re interested in.

This first output gives a coarse but context-aware division of the image – making sure that even small or camouflaged parts are at least roughly identified thanks to the model’s broad understanding. However, the boundaries from SegGPT might not be very precise or detailed because of its generalist design.

SAM refinement with multi-prompting

Next, we refine these initial masks using SAM, which is good at making accurate boundaries when given proper prompts. Each region found by SegGPT is sent to SAM as a prompt – for example, by giving SAM the bounding box or outline of the SegGPT mask as a starting point.

To improve this refinement, we use a multi-prompt strategy. From each coarse mask, we create different prompts:

(a) points along the object boundary (both inside and outside) to guide SAM toward the true edge,
(b) a tight bounding box around the region, and
(c) the coarse mask itself as a soft prompt (SAM can take mask input to refine an existing mask).

These prompts collaborate to mitigate defects of any single prompt, making sure SAM doesn’t miss thin extensions or holes.

For scenes with multiple objects, we use a ‘split-then-merge’ approach: each object or connected component is processed separately through SAM (to avoid confusion between multiple objects), and then the results are combined. This prevents SAM from wrongly merging separate objects or missing a smaller object next to a larger one.

Mask merging

After SAM processes each region, we combine the refined masks with the original SegGPT mask through logical merging operations. Basically, we take the union of the two masks to keep all details: any pixel labeled as object by either SegGPT or SAM stays in the merged mask.

The reason is that if SAM’s refinement accidentally cut off a fine detail (like a tuft of fur or a thin limb), that detail might still exist in SegGPT’s broader mask; a union will keep it. Similarly, if SegGPT missed a small region that SAM found (maybe because of SegGPT’s limited resolution in that area), the union includes it too.

Benčević uses a similar union-based merging in their two-stage “segment-then-segment” framework, where coarse and fine segmentations combine to make the final mask. By merging masks this way, we err on the side of recall for fine structures – making sure no small component detected by either model is left out. Later steps will handle smoothing of any noisy parts introduced by this union.

2. High-resolution refinement passes

One of the best ways to capture fine details is to analyze the image at higher resolution. Many segmentation models (including SAM) work on a resized version of the image (e.g., SAM uses an image size of ~1024 pixels).

As a result, very thin or tiny structures may only cover a few pixels and get lost. To fix this, our pipeline does target high-resolution refinement.

Tiled or focused upsampling

For each object mask (after the initial merging), we crop a region from the original image that tightly surrounds that object and upsample it to a larger size before feeding it to SAM. By increasing the pixel density on the object, features like fur strands or narrow appendages become more visible to the model.

This is similar to cropping strategies used in ultra high-resolution segmentation tasks – a coarse segmentation finds regions of interest, which are then processed in full detail. We include a small padding around the crop (to preserve context) so SAM still sees some background for correct boundary detection. SAM is then prompted (with the same multipoint and box prompts as before) on this high-res crop.

Because the model now sees a zoomed-in view, it often recovers fine structures that were missed before. For example, individual fur strands that were smaller than a pixel in the original input may now be segmented as thin wisps in the high-res crop.

We replace the corresponding region in the global mask with this refined high-res mask (converting coordinates back accordingly). This local high-res refinement is applied to each object or unclear region one by one, creating an overall mask that’s much more detailed, especially around important small structures.

This approach doesn’t change any model weights – it just gives SAM a better look at each area. It’s like how human annotators zoom into an image to label fine details.

Our experiments and past research confirm the benefit: processing an image at 2× scale gives much better segmentation of thin objects (compare the 2.0× vs 0.5× predictions in Figure 1, where posts and thin poles are much more complete). By adding this to the pipeline, we capture details that single-pass segmentation would miss.

3. Multiscale ensemble inference

We use a multiscale inference strategy to get the benefits of different image scales. In segmentation, no one scale is perfect for all structures: small features are clearer at larger scales, while very large or spread-out regions may be segmented more consistently at lower scales.

So we run the SegGPT+SAM pipeline at multiple scales of the input and then combine the results:

Downscaled context pass

First, we run the pipeline on a smaller version of the image (e.g., 50% of original size). This tends to smooth out noise and focus on the most obvious structures. Large objects (like an animal’s body) are often well-captured at this scale, and some false tiny regions might disappear. The result is a mask that gets the overall shape right, but likely misses fine details.

Original scale pass

Next, we run the pipeline at the original image resolution for a balanced result. This is our main result before high-res refinement.

Upscaled fine pass

Finally, we run an upsampled version of the entire image (e.g., 1.5× or 2× larger than original) through the pipeline. This takes more computing power, so it might be done only on important images or in a limited way (for example, focusing only on areas of interest). The high-scale pass is great at capturing ultra-fine segments – like individual blades of grass or fur edges that lower scales might treat as texture.

Each of these scale-specific segmentations creates a binary mask (or set of masks). We then ensemble them to make the final mask. A simple way is taking the pixel-wise majority vote or union: any pixel that is labeled as object in any two out of three scales (majority) is kept.

In practice, we prefer inclusion (similar to earlier merging) – basically a union across scales – to keep as many details as possible. The multiscale ensemble gives a mask that combines the strengths of each scale: the continuity and completeness of large structures from the coarse scale, plus the intricate details from the fine scale.

4. Test-time augmentation (TTA) for robust detail

We use test-time augmentation to further reduce any missing fine structures. TTA involves applying transformations to the input image, segmenting each transformed version, and then combining the results.

Geometric augmentations

We create augmented copies of the image by flipping horizontally, flipping vertically, and rotating (e.g., 90°). Each augmented image goes through the full SegGPT+SAM pipeline (at the original scale, for efficiency).

Flips and rotations can reveal details that might be missed due to model bias or artifacts in one orientation. For example, a nearly horizontal thin branch might be better segmented when it appears vertical in a rotated image (aligning differently with the model’s learned filters).

Random shifts

We also apply small random translations or “circular shifts” to the image before segmentation. This technique, recently used in medical image segmentation with SAM, addresses alignment sensitivities.

By shifting the image a few pixels in various directions, we simulate slight changes in object positioning. SAM (and SegGPT) may respond by sometimes capturing a thin detail in one shift that it missed in another. We generate N such random shifts (e.g., N=4 or 6) and segment each.

After getting masks from these augmented inputs, we combine the masks in the original image coordinate space by reversing the transformations (flipping back, rotating back, etc.). The combination can be done by averaging the mask probabilities (if available) or by majority voting on each pixel across all augmentations.

Basically, a pixel is labeled as an object if the majority of augmented runs agree on it, or if any high-confidence prediction includes it. This ensemble of augmentations usually creates a more robust segmentation that is less likely to miss fine pieces: as noted by Nazzal, “the method generates several input variations during inference that are combined…, improving robustness”.

Test-time augmentation helps fill in holes or add thin segments that a single run might wrongly drop, complementing our multiscale approach. The downside is extra computation, but since we don’t retrain models, this is an acceptable trade-off for better quality.

5. Edge-aware post-processing

By this stage, we have a detailed mask that has combined information from multiple models, scales, and augmentations. The final step is to fine-tune the mask boundaries and remove any random noise, using classical post-processing techniques that respect image edges.

Guided edge refinement

We apply a guided filter or a bilateral filter on the mask, using the original image as the guide for edge information. This effectively sharpens the mask along true image edges while smoothing it in uniform regions.

For example, around an object boundary with fine hairs or irregular edges visible in the image, the guided filter will adjust the mask to better match those pixel gradients. On the other hand, if the mask has some jitter that doesn’t match any real edge (maybe from noise in the multiscale ensemble), the filter will smooth that out.

The result is a cleaner yet detail-preserving outline. As one example puts it, “refining the mask using a guided filter… ensures that edges of the mask conform to edges in the image” – exactly what we want for high-quality segmentation.

Morphological smoothing and cleaning

We also use morphological operations to polish the mask. A small morphological closing (dilation followed by erosion) can fill tiny gaps or holes inside the segmented region, which might happen if, for instance, a patch of fur was segmented as isolated dots.

Closing will connect those dots into a continuous region if they are very close. Similarly, a slight opening (erosion then dilation) can remove isolated spots of false-positive segmentation not attached to the main object. Importantly, we choose the structuring element (kernel) for these operations to be very small (about 1–3 pixels) so we don’t accidentally remove legitimate thin structures.

The goal is gentle smoothing: remove obvious artifacts while preserving the intricate shapes that our earlier steps recovered. Basically, this acts as a noise filter for the mask. In summary, the edge-aware filtering and morphological smoothing together improve the mask’s visual coherence – edges become more natural and consistent with the image content, and any tiny out-of-place pixels are cleaned up.

Optional CRF refinement

As an optional step, especially for scientific or medical images with complex textures, we can apply a Dense Conditional Random Field (CRF) on the mask probabilities. The CRF considers low-level color and proximity to adjust the mask, making sure that pixels with similar color/intensity are likely to have the same label.

This can tighten the mask around object boundaries in a very fine-grained way, sometimes capturing whisker-thin details that align with color edges. However, we note that dense CRF is a purely low-level method and “lacks high-level semantic context, often struggling in complex scenarios.”

In practice, we found that our guided filter + morphology approach (which is simpler and faster) is enough for most cases, but CRF can be helpful when the object and background have clear color differences, and we want to use that for maximum boundary accuracy.

After post-processing, the resulting segmentation mask is our final output:

It maintains the full extent of the object including fine details, but with smoothed, realistic boundaries that avoid the jagged or noisy artifacts that a raw union of many masks might have.

Discussion

Our approach presents several important trade-offs and considerations worth analyzing:

Recall vs. precision balance

While we prioritize recall over precision for fine details, this creates a risk of oversegmentation – including background fragments as part of the object. Our edge-aware post-processing mitigates this by removing regions not aligned with image edges. This balancing act requires careful parameter tuning depending on the application domain.

Performance vs. quality trade-off

The computational cost of our pipeline is significant – potentially several times more expensive than single-pass segmentation. For time-sensitive applications, selective application of these techniques may be necessary. Options include:

Only upscaling regions with known fine details
Reducing the number of TTA variants
Using model confidence to skip refinement on high-confidence regions
Parallelizing the pipeline steps where possible

Limitations

Our approach still struggles with extremely subtle details (below a few pixels in width or completely uniform with background). The combination of multiple augmentations can sometimes create slightly ragged edges before smoothing due to pixel-level voting inconsistencies, requiring careful post-processing.

Domain applicability

This pipeline is particularly valuable for domains where fine details matter significantly:

Medical imaging (blood vessels, small lesions)
Wildlife photography (fur, whiskers, antennae)
Technical documentation (thin components, wires)
Satellite imagery (roads, rivers)

Different domains may require adjusting the balance between our components – some may benefit more from high-resolution refinement, while others from edge-aware filtering.

Conclusion

The SegGPT+SAM hybrid pipeline demonstrates that segmentation quality can be significantly improved through inference-time strategies alone, without modifying underlying models. This approach can be extended to other segmentation systems where detail preservation is critical, offering a practical solution that works with existing model checkpoints.