Last Updated: Friday 3rd May 2013

Sometimes we need to find the duplicate files in our file system, or inside a specific folder. In this tutorial we are going to code a Python script to do this. This script works in Python 3.x.

The program is going to receive a folder or a list of folders to scan, then is going to traverse the directories given and find the duplicated files in the folders.

This program is going to compute a hash for every file, allowing us to find duplicated files even though their names are different. All of the files that we find are going to be stored in a dictionary, with the hash as the key, and the path to the file as the value: { hash: [list of paths] }.

To start, import the os, sys and hashlib libraries:

Then we need a function to calculate the MD5 hash of a given file. The function receives the path to the file and returns the HEX digest of that file:

Now we need a function to scan a directory for duplicated files:

The findDup function is using os.walk to traverse the given directory. If you need a more comprehensive guide about it, take a look at the How to Traverse a Directory Tree in Python article. The os.walk function only returns the filename, so we use os.path.join to get the full path to the file. Then we'll get the file's hash and store it into the dups dictionary.

When findDup finishes traversing the directory, it returns a dictionary with the duplicated files. If we are going to traverse several directories, we need a method to merge two dictionaries:

joinDicts takes 2 dictionaries, iterates over the second dictionary and checks if the key exists in the first dictionary, if it does exist, the method appends the values in the second dictionary to the ones in the first dictionary. If the key does not exist, it stores it in the first dictionary. At the end of the method, the first dictionary contains all the information.

To be able to run this script from the command line, we need to receive the folders as parameters, and then call findDup for every folder:

The os.path.exists function verifies that the given folder exists in the filesystem. To run this script use python dupFinder.py /folder1 ./folder2. Finally we need a method to print the results:

Putting everything together:

  • Todor Lubenov

    Thank you for usefull tool.

    The line 67 should be ” joinDicts(dups, findDup(i))”

    not as in code ” joinDicts(dups, findDup(i)) dictionary”

    • Jackson Cooper

      Hi Todor! Thanks for pointing that out, it’s been fixed.

  • David

    Excellent tutorial. I have been looking for a good duplicate finder, and python is my language of choice. I have modified it slightly both to exclude “dot” files and folders (typically seen on Mac and Linux systems) and to exclude zero-byte files (zero-byte files all have the same md5hash, and so are considered identical by this script).

    Here is my modified function def:

    def findDup(parentDir):
    “””Dups in format {hash:[paths]}”””
    dups = {}
    for dirName, subDirs, fileList in os.walk(parentDir):
    # Modify the lists in place to exclude dot folders and files
    subDirs[:] = [d for d in subDirs if d[0] != “.”]
    fileList = [f for f in fileList if f[0] != “.”]
    print “Scanning %s” % dirName
    for fileName in fileList:
    # Get path of file
    path = os.path.join(dirName, fileName)
    # Exclude 0 byte files
    if os.path.getsize(path) > 0:
    # Calculate the hash
    file_hash = hashfile(path)
    # Add or append the path to the dict
    if file_hash in dups:
    dups[file_hash].append(path)
    else:
    dups[file_hash] = [path]
    return dups

    • http://jacksonc.com Jackson Cooper

      Hi David. Thanks for the contribution!

      • john

        subDirs[:] = [d for d in subDirs if d[0] != “.”]
        results = list(filter(lambda x: len(x) > 1, dict1.values()))
        Seems like every language needs a bit of ‘Chinese’.

        • http://jacksonc.com Jackson Cooper

          如果只有我谈到中国 ;-)

  • Ian Lee

    Awesome! I had a bash version of this I had found somewhere, but I’d been wanting to make a fast Python version of it. Used your script here as a base and made an improvement that substantially improves the performance.

    First you can get the file size of all the files in the directories. This takes O(1) time vs O(filesize) to compute. Then for all files that have the same file size, compute the md5sum only on those.

    Benchmarks (All times are wall clock times):

    Video Directory (88 files, total of 6.5 GB)
    - Original = 59.591 seconds
    - New = 0.786 seconds

    Music Directory (4427 files, total of 22 GB)
    - Original = 199.27 seconds
    - New = 0.438 seconds

    A side benefit is that I have it setup to print out files that have the same size, these might potentially be the same file even if the md5sums don’t match (e.g. files with timestamps embedded that are really the same file).

    I’ve made the code available on GitHub here: https://github.com/IanLee1521/utilities/blob/master/bin/find_duplicates.py

    • http://jacksonc.com Jackson Cooper

      Hey Ian, that’s awesome! Can you re-base the original code to see what was changed?