Advanced Metrics for Evaluating Spitz Visibility in Black-and-White Photos

Assessing the visibility of Japanese Spitzes in black-and-white images is trickier than it looks. Traditional methods like Laplacian and Tenengrad metrics fall short because they focus on overall image quality, ignoring the specific textures and details of the Spitz’s fur. They also evaluate the entire image instead of just the Spitz.

To overcome these limitations, we used machine learning-based metrics. These advanced tools analyze images more like humans do, considering textures, structures, and visual quality in detail.

Let’s explore the best metrics for this task, their advantages, and disadvantages, and identify the top choice.

Why machine learning metrics?

Traditional metrics like PSNR, SSIM, and VIF measure basic image quality aspects and are great for tasks like reducing noise or removing artifacts. However, they have notable drawbacks:

Missing localized textures: They don’t focus on specific features like fur patterns.
Quality misrepresentation: High PSNR values don’t always mean better-looking images.
Ignoring human perception: These methods overlook how humans perceive textures and structures differently from raw pixel data.

Machine learning-based metrics, on the other hand, are trained on real-world data and consider human-relevant factors like texture, structure, and visual quality. They can also focus on specific areas, such as the Spitz, rather than the entire image.

Metric selection and evaluation

We started with a wide range of metrics, including both traditional and machine learning-based methods, to find the best ones for evaluating Spitzes’ visibility.

First, we defined their categories: full-reference, no-reference, and conventional. Full-reference metrics compare “before” and “after” images to see changes. No-reference metrics work without needing a reference image, useful when the original isn’t available.

Conventional metrics are good for basic checks but often miss specific textures or objects, making them less suitable for our task.

Next, we analyzed each metric in detail, and here’s what we got:

✅ 1. LPIPS (Learned perceptual image patch similarity)

What it does: measures how similar two images look to humans by comparing small patches of the images. It uses deep neural networks such as VGG or ResNet.

Pros:
Excels at texture evaluation (e.g., fur).
Detects structural changes (e.g., body shape of a Spitz).

The image on the left has a score of 0.42 out of 1;

Cons:
Requires a reference image.
Provides global, not localized, evaluations – results may dilute if the image contains multiple objects.

The image on the right scores at 0.69. LPIPS shows an improvement of 0.27

✅ 2. DISTS (Deep image structure and texture similarity)

What it does: looks at both the structure and texture of images to determine how similar they are.

Pros:
Tracks both local and global texture changes effectively.

Cons:
Requires image pairs. Tailored for assessing structure and texture simultaneously.

DISTS rated the left image as 4/17 and the right one as 11/17

✅ 3. PieAPP (Perceptual image-error assessment through pairwise preference)

What it does: predicts which of two images is closer to a reference image based on human judgments.

Pros:
Incorporates human perception for nuanced evaluations.
Handles micro-level changes effectively.

The image on the left has a score of 0.22 out of 1;

Cons:
Complex to set up and interpret.
Requires image pairs.

The image on the right scores at 0.67. LPIPS shows an improvement of 0.45

⚠️ 4. FSIM/FSIMc (Feature similarity index)

What it does: measures image quality by comparing important features like edges and textures. FSIMc adds color information to make it more accurate for color images.

Pros:
Highly versatile and customizable.
Evaluates images based on descriptive text.

Cons:
Does not model human perception accurately.

✅ 5. CLIP (Contrastive language-image pretraining)

What it does: uses a combination of image and text data to understand and evaluate image quality. For example, it can be fine-tuned to respond to queries like “bright Spitz fur.”

Pros:
Highly versatile and customizable.
Evaluates images based on descriptive text.

Cons:
Requires fine-tuning with a well-annotated dataset.

CLIP score for these images left to right: 7/17 (below average), 11/17 (good), 13/17 (good)

❌ 6. DBCNN (Deep bilinear convolutional neural network)

What it does: evaluates image quality by analyzing patterns and features within the image itself.

Pros:
Does not need a reference image.

Cons:
Struggles with fine textures (e.g., fur details).

❌ 7. HIQA (Hallucinated-IQA)

What it does: predicts image quality by generating a high-quality version of the image and comparing it to the original.

Pros:
Effective for evaluating complex textures.

Cons:
Requires additional refinement and careful parameter tuning.

❌ 8. Mask R-CNN (Region-based convolutional neural network)

What it does: isolates specific regions in an image, allowing targeted evaluation of its visibility.

Pros:
Segments the object of interest precisely (e.g., separates Spitz from the background).
Ensures that evaluations are localized to the target object, avoiding irrelevant areas.

Cons:
Requires substantial computational resources for training and fine-tuning.
Difficult to implement.
Performance depends heavily on the quality of the dataset and annotations.

Final selection

After evaluating each metric, we chose the following as the best for our task:

LPIPS: Great for evaluating texture and structure.
DISTS: Complements LPIPS by focusing on texture and structural changes.
PieAPP: Incorporates human preference into the analysis.
CLIP (fine-tuned): Allows for task-specific evaluations based on descriptive queries.

We excluded metrics like DeepQA, DBCNN, MEON, and most conventional methods (PSNR, SSIM) because they were either irrelevant or outdated for this task.

Next, let’s implement and test these metrics in real-world scenarios to see how they perform.

We deliberately “ruined” the photo on the left with cropping, resizing and filter. For reference, we kept the right one as it is:

And now, let’s see what the metrics will tell us about the images’ similarities and differences.

LPIPS

LPIPS is a popular metric for evaluating how similar images look to humans, focusing on structure and texture. Unlike traditional metrics like MSE or PSNR, LPIPS uses deep neural networks to measure perceptual differences.

It uses pre-trained models (e.g., AlexNet, VGG) to extract features from images and then calculates the perceptual distance between these features. A lower LPIPS score means the images are more similar in appearance.

For accurate results, images are resized using the LANCZOS filter to maintain their original structure and smooth gradients, minimizing artifacts.

👉 LPIPS Score (Resize): 0.1053

This score shows a high level of perceptual similarity between the images. Visual heatmaps can highlight areas of notable differences, emphasizing the structural and textural details identified by LPIPS.

DISTS

DISTS is a metric that measures how similar images look by balancing overall structure and local texture. It’s great for tasks where texture is important, working well alongside LPIPS.

It calculates structural similarity through correlations and texture similarity through small patches, providing a comprehensive image quality assessment.

Cropping focuses the comparison on relevant parts of the image, reducing noise and emphasizing texture details.

👉 DISTS Score (Crop): 0.3931710720062256

This score indicates strong similarity in both texture and structure between the images. Cropping helps refine the analysis by highlighting important details, especially in texture-rich areas.

PieAPP

PieAPP is a perceptual metric that aligns closely with human visual preferences. It uses subjective pairwise comparisons during training to offer a human-centric evaluation of image quality.

PieAPP learns to predict perceptual error scores by minimizing the difference between its rankings and human-labeled comparisons, ensuring it reflects human judgments of quality.

Resize: Evaluates the entire image, giving a general sense of perceptual similarity.
Crop: Focuses on specific regions, highlighting finer details and localized errors.

👉 PieAPP Score (Resize): 1.1631

👉 PieAPP Score (Crop): 2.1342

The higher score for the cropped version highlights localized errors and inconsistencies that are less noticeable in the resized version, showing PieAPP’s sensitivity to human preferences.

CLIP

CLIP connects visual and textual data, allowing for semantic image evaluation. When fine-tuned, it can assess images based on descriptive queries.

CLIP encodes images and text into a shared space, making comparisons possible. Fine-tuning improves its ability to evaluate images against specific descriptions (e.g., “fur is soft and realistic”).

CLIP’s image analysis results

CLIP is very versatile in terms of evaluation criteria. It’s possible to edit the code for it to focus on the objects or metrics we need:

Analysis summary and future steps

The metrics LPIPS, DISTS, PieAPP, and CLIP comprehensively assess image quality by covering structural, textural, and semantic alignment. This research identified the strengths of each metric for future system development.

LPIPS: Demonstrates high perceptual similarity, validating its effectiveness for overall quality evaluation in synthesis tasks.
DISTS: Useful for tasks emphasizing texture detail and image structure, balancing local texture and overall composition.
PieAPP: Sensitive to localized errors and human visual preferences, valuable for analyzing fine details.

CLIP: Strong for semantic verification, ensuring images meet specific textual requirements and maintaining stylistic elements or key attributes. It offers unique potential for evaluating alignment between visual and textual data, with further exploration planned for its standalone applications.

Why is this important?

Three of these metrics – LPIPS, DISTS, and PieAPP – are reference-dependent and complement each other well, providing a comprehensive evaluation of image quality. Although they may not be part of our final system pipeline because of their reliance on reference images, this research is crucial because:

Systematic quality assessment: Understanding these evaluation approaches helps avoid errors in system development.
Integration framework: Even if not included in the final architecture, these metrics serve as valuable intermediate verification tools during development.
Quality standards’ formation: Results from these evaluations allow us to define clear benchmarks and set realistic expectations for our system.

Next steps

Deepen research: Further investigate the applicability of these metrics for targeted use cases to refine their roles.
Develop a custom approach: Create a tailored image evaluation strategy by leveraging insights from LPIPS, DISTS, and PieAPP while addressing their limitations.
Focus on CLIP: Explore CLIP’s capabilities as a standalone tool for semantic evaluation. Its alignment between textual and visual data makes it particularly promising for independent use in thematic and stylistic assessments.

This research establishes a well-rounded foundation for advancing our system while ensuring its outputs align with structural, textural, and semantic quality benchmarks.

At Furnets, we love to upgrade existing technologies and find their new applications. Keep an eye on our publications to see what we’re doing next.