Last Updated: Wednesday 14th August 2013

A hash function is a function that takes input of a variable length sequence of bytes and converts it to a fixed length sequence. It is a one way function. This means if f is the hashing function, calculating f(x) is pretty fast and simple, but trying to obtain x again will take years. The value returned by a hash function is often called a hash, message digest, hash value, or checksum. Most of the time a hash function will produce unique output for a given input. However depending on the algorithm, there is a possibility to find a collision due to the mathematical theory behind these functions.

hash1

Now suppose you want to hash the string "Hello Word" with the SHA1 Function, the result is 0a4d55a8d778e5022fab701977c5d840bbc486d0.

hash2

Hash functions are used inside some cryptographic algorithms, in digital signatures, message authentication codes, manipulation detection, fingerprints, checksums (message integrity check), hash tables, password storage and much more. As a Python programmer you may need these functions to check for duplicate data or files, to check data integrity when you transmit information over a network, to securely store passwords in databases, or maybe some work related to cryptography.

I want to make clear that hash functions are not a cryptographic protocol, they do not encrypt or decrypt information, but they are a fundamental part of many cryptographic protocols and tools.

Some of the most used hash functions are:

  • MD5: Message digest algorithm producing a 128 bit hash value. This is widely used to check data integrity. It is not suitable for use in other fields due to the security vulnerabilities of MD5.
  • SHA: Group of algorithms designed by the U.S's NSA that are part of the U.S Federal Information processing standard. These algorithms are used widely in several cryptographic applications. The message length ranges from 160 bits to 512 bits.

The hashlib module, included in The Python Standard library is a module containing an interface to the most popular hashing algorithms. hashlib implements some of the algorithms, however if you have OpenSSL installed, hashlib is able to use this algorithms as well.

This code is made to work in Python 3.2 and above. If you want to run this examples in Python 2.x, just remove the algorithms_available and algorithms_guaranteed calls.

First, import the hashlib module:

Now we use algorithms_available or algorithms_guaranteed to list the algorithms available.

The algorithms_available method lists all the algorithms available in the system, including the ones available trough OpenSSl. In this case you may see duplicate names in the list. algorithms_guaranteed only lists the algorithms present in the module. md5, sha1, sha224, sha256, sha384, sha512 are always present.

MD5

The code above takes the "Hello World" string and prints the HEX digest of that string. hexdigest returns a HEX string representing the hash, in case you need the sequence of bytes you should use digest instead.

It is important to note the "b" preceding the string literal, this converts the string to bytes, because the hashing function only takes a sequence of bytes as a parameter. In previous versions of the library, it used to take a string literal. So, if you need to take some input from the console, and hash this input, do not forget to encode the string in a sequence of bytes:

SHA1

SHA224

SHA256

SHA384

SHA512

Using OpenSSL Algorithms

Now suppose you need an algorithm provided by OpenSSL. Using algorithms_available, we can find the name of the algorithm you want to use. In this case, "DSA" is available on my computer. You can then use the new and update methods:

Practical example: hashing passwords

In the following example we are hashing a password in order to store it in a database. In this example we are using a salt. A salt is a random sequence added to the password string before using the hash function. The salt is used in order to prevent dictionary attacks and rainbow tables attacks. However, if you are making real world applications and working with users' passwords, make sure to be updated about the latest vulnerabilities in this field. I you want to find out more about secure passwords please refer to this article








  • Dre Peters

    Andres, thanks for your wonderful tutorial, I learnt some things in it and I’ve had to edit to taste.

  • Kristoffer Legind

    Very nice overview, cudos.
    There is also pyBlake and pyStein that provide nice alternatives.

  • Python Lover

    hehhe, u wanna say username or password not matched :P we dont want to give out any hints now !!

  • alain

    can you salt using a static value? and if so how would i do that ?

    • http://jacksonc.com Jackson Cooper

      You could, but using the same salt for everything is the same as not using a salt. Salts should be randomly generated every time. This article uses uuid.uuid4() to do that. There’s other ways too – random, time, /dev/random, etc.

  • Akshay

    How to delete the image from the directory if its hash value is same

    • http://jacksonc.com Jackson Cooper

      os.remove()

      • Akshay

        Its not working for me ….Here is my code i want to delete same hash value data from the disk

        import hashlib

        import os

        k=open(‘C:HomepandpKoala.jpg’)

        l=hashlib.md5(k.read()).hexdigest()

        j=open(‘C:HomepandpKoala1.jpg’)

        m=hashlib.md5(j.read()).hexdigest()

        #dic={l:k.name,m:j.name}

        if(l==m):

        os.remove(k)

        or tell me how to do it with the help of dictionary

        • http://jacksonc.com Jackson Cooper

          if (l == m) os.remove(dic[m])

          • Akshay

            but before that i have to the close it otherwise it will show error

          • Akshay

            But before that i have to close the file otherwise it will show error

        • http://jacksonc.com Jackson Cooper

          If you store hashes as keys in a dictionary, duplicate keys will be missing. e.g. dic = {'hash': '123', 'hash': '234'} evaluates to {'hash': '234'}.

          Although you could use a dictionary for each file to check, containing the filename and hash. Here’s an example (untested):


          from os import chdir
          from hashlib import md5

          def file_hash(filepath):
          with open(filepath) as f:
          return md5(f.read()).hexdigest()

          chdir(r'C:Homepandp')

          file1 = {'hash': file_hash('Koala.jpg'), 'filename': 'Koala.jpg'}
          file2 = {'hash': file_hash('Koala1.jpg'), 'filename': 'Koala1.jpg'}

          if file1['hash'] == file2['hash']:
          os.remove(file2['filename'])

          Note the os.remove() call should be indented (Disqus is playing up).

          • Akshay

            Got it ……..thanks a lot jackson !!!

          • Akshay

            Jackson i need to ask one more thing that if i want to scan whole directory and remove duplicate .jpg file from that directory …. for each image hash key and its value ie path Then how to tackle with that one.. because dictionary one contain unique values

          • http://jacksonc.com Jackson Cooper

            Depends how you want to handle deleting the duplicates. The simplest way would be to simply store a list of hashes, adding them as you find & hash each file. If you come across a file where the hash is already in the list, simply delete the duplicate file. e.g (pseudocode):

            
            hashes = []
            for filename in walk():
                file_hash = hash(filename)
                # Duplicate!
                if file_hash in hashes:
                    os.remove(filename)
                else:
                    hashes.append(file_hash)
            
          • Akshay

            ok …. thanks for the pseudocode jackson !!!

      • Akshay

        Its not working for me ….Here is my code i want to delete same hash value data from the disk

        import hashlib

        import os

        k=open(‘C:HomepandpKoala.jpg’)

        l=hashlib.md5(k.read()).hexdigest()

        j=open(‘C:HomepandpKoala1.jpg’)

        m=hashlib.md5(j.read()).hexdigest()

        #dic={l:k.name,m:j.name}

        if(l==m):

        os.remove(k.name)

        or tell me how to do it with the help of dictionary

      • Akshay

        Thanks Jackson for your reply …. after closing that file it worked for me !!!!