Duplicate / Similar Image Eliminator with Low Resource Usage (Database Management)

Welcome to the "Duplicate Image Eliminator"!

This program is designed to help you detect and remove duplicate images in a specific folder. If you have a collection of images and suspect that there are duplicates taking up unnecessary space, this program will make the cleaning process easier for you.

The "Duplicate Image Eliminator" uses an algorithm based on the Structural Similarity Index Measure (SSIM) score. The algorithm compares each pair of images in the folder and calculates their similarity score. If the score exceeds a set threshold, the image is identified as a duplicate and automatically removed.

You can adjust the similarity threshold according to your preferences. A higher value will only remove highly similar images, while a lower value will even remove images with lower similarity. The program will provide you with information about the total number of duplicate images removed, helping you maintain an organized and duplicate-free image collection.

To use the program, simply specify the path of the folder containing the images and run the code. The "Duplicate Image Eliminator" will do the rest for you.

Enjoy a cleaner and organized image folder with the "Duplicate Image Eliminator"!

The main idea behind the "Duplicate Image Eliminator" is to provide you with a simple and automated way to discard images that add little value to your dataset. In many cases, when working with large image datasets, it's common to encounter duplicates that occupy unnecessary space and hinder subsequent analysis and processing.

This program allows you to save time and effort by automating the process of identifying and removing duplicate images. By setting a similarity threshold, you can adjust the algorithm's sensitivity and decide what level of similarity to consider as a duplicate. This gives you the flexibility to tailor the program to your specific needs.

Imagine you're working on a computer vision project and you have a large dataset of images. Some images may be repeated due to multiple downloads or backups. By using the "Duplicate Image Eliminator," you can easily eliminate those duplications and reduce the size of your dataset, improving processing efficiency and avoiding biased results.

Furthermore, the program can handle large datasets seamlessly. You can apply it to folders with hundreds or even thousands of images, allowing you to quickly clean and organize your data collection.

import os
from skimage import io
from skimage.metrics import structural_similarity as ssim

def calculate_similarity_score(image1, image2):
    # Calcular el puntaje de similitud SSIM entre dos imágenes
    score = ssim(image1, image2, multichannel=True)
    return score

def remove_duplicate_images(folder_path, similarity_threshold):
    image_files = os.listdir(folder_path)
    num_images = len(image_files)
    num_duplicates = 0  # Contador de imágenes duplicadas eliminadas
    duplicate_files = []  # Lista de nombres de archivo de imágenes duplicadas

    for i in range(num_images):
        image1 = io.imread(os.path.join(folder_path, image_files[i]))

        for j in range(i + 1, num_images):
            image2 = io.imread(os.path.join(folder_path, image_files[j]))

            similarity_score = calculate_similarity_score(image1, image2)

            if similarity_score > similarity_threshold:
                # Agregar el nombre de archivo de la imagen duplicada a la lista
                duplicate_files.append(image_files[j])

    # Eliminar las imágenes duplicadas
    for file_name in duplicate_files:
        file_path = os.path.join(folder_path, file_name)
        if os.path.exists(file_path):
            os.remove(file_path)
            num_duplicates += 1

    print(f"Número total de imágenes duplicadas eliminadas: {num_duplicates}")

# Ruta de la carpeta que contiene las imágenes
EeveelutionsCollection = "Copia" #@param {type:"string"}
folder_path = f"/content/drive/MyDrive/Dataset/{EeveelutionsCollection}"
similarity_threshold = 0.9  #@param {type:"slider", min:0.0, max:1.0, step:0.01} 
#Umbral de similitud para considerar una imagen como duplicada

remove_duplicate_images(folder_path, similarity_threshold)

~~~~~

This code is a program that searches for duplicate images in a folder and removes them using the calculation of the Structural Similarity (SSIM) score. Here is an explanation of the key parts of the code:

import os: This line imports the os module, which provides functions to interact with the operating system, such as accessing files and directories.

from skimage import io: This imports the io module from the scikit-image library, which is used to read images.

from skimage.metrics import structural_similarity as ssim: This imports the structural_similarity function from the scikit-image.metrics library and assigns it the alias ssim.

calculate_similarity_score(image1, image2): This is a function that takes two images as input and calculates the Structural Similarity (SSIM) score between them using the previously imported ssim function. The SSIM score is used to measure the similarity between two images.

remove_duplicate_images(folder_path, similarity_threshold): This is the main function of the program. It takes the path of the folder containing the images and a similarity threshold as input. The function searches for duplicate images in the folder using a pairwise comparison approach. It compares each pair of images, and if the SSIM score between them is greater than the set threshold, they are considered duplicates and added to a list of duplicate image file names.

Then, the program iterates over the list of duplicate image file names and checks if each file still exists in the specified path. If it exists, the file is removed using the os.remove() function, and the count of removed duplicate images is updated.

Finally, the total number of removed duplicate images is displayed on the screen.

The last lines of the code set the folder path containing the images and the desired similarity threshold. Then, they call the remove_duplicate_images() function with these parameters to initiate the process of removing duplicate images.

When using this code, there are some considerations and possible errors to keep in mind:

Compatible image formats: The code is designed to work with images in formats supported by the scikit-image library, such as JPEG and PNG. If you try to use images in other unsupported formats, errors may occur.

Folder path and file names: Make sure to provide the correct folder path containing the images in the folder_path variable. Also, verify that the image file names are correct and match the actual names in the folder. If there are discrepancies, errors related to missing files or directories are likely to occur.

Write permissions: The program attempts to remove duplicate images using the os.remove() function. Make sure you have the proper permissions to write to the folder and delete files. If you don't have the necessary permissions, a permission denied error may occur when attempting to delete files.

Similarity threshold: The similarity_threshold determines when two images are considered duplicates. Adjust this value according to your needs and the level of similarity you want to detect. If the threshold is too low, some duplicate images may not be detected. If the threshold is too high, non-duplicate images may be removed.

Performance: Depending on the number of images in the folder and the complexity of the similarity calculations, the process of removing duplicate images can take time. If you have a large number of images, the runtime can be considerable. Consider this aspect and be patient during program execution.

Image backup: Before running the program and removing duplicate images, make sure you have a backup of the original images. This will allow you to revert any undesired deletions or errors that may occur during program execution.

By taking these considerations into account and verifying the mentioned aspects, you should be able to effectively use the code to detect and remove duplicate images in a folder.

Open in Gith

Open in Colab

Financial assistance: Hello everyone!

This is Tomas Agilar speaking, and I'm thrilled to have the opportunity to share my work and passion with all of you. If you enjoy what I do and would like to support me, there are a few ways you can do so:

~~Ko-fi (Dead)~~
Patreon
Buymeacoffee

Duplicate / Similar Image Eliminator with Low Resource Usage (Database Management)

Open in Gith

Open in Colab

Comments