One of the features that came out of our little hackaton and will be rolling out in the next couple of weeks is the ability to detect duplicate icons upon submission. So, for example if a user for example downloads an existing icon and tries to re-upload it in order to sell it and make a profit (yes, we had some of those cases) we will be able to detect it as a duplicate of an already existing icon and mark the account as fraudulent.
A frequently used solution to identify if a given file already exists in a large collection of files is to calculate a hash for each individual file in the dataset, store the hashes in a database and then when we want to locate a particular file calculate the hash for that file and look up to see if it’s hash value exists in the database.
A commonly used type of hash algorithms for this are the cryptographic hashing algorithms. Implementations of the most common used ones like MD5, SHA-1, SHA-256 are available in the standard library of almost any programming language and they are really easy to use and work really well for the most simple use cases.
For instance, in Python you would simply import the hashlib module and call a function to generate the hash value for a particular string or file.
>>> import hashlib # Calculating the hash value of a string. >>> hashlib.md5('The quick brown fox jumps over the lazy dog').hexdigest() '9e107d9d372bb6826bd81d3542a419d6' # Loading an image file into memory and calculating it's hash value. >>> image_file = open('data/cat_grumpy_orig.png').read() >>> hashlib.md5(image_file).hexdigest() '3e1f6e9f2689d59b9ed28bcdab73455f' |
This would work fine and dandy if we where sure the files uploaded aren’t tampered with but due to the nature of how cryptographic hashing algorithms work even the slightest change to the input will result in an avalanche effect so the generated hash of the new file will be totally different from the one of the original file.
Let’s take an example where we add a period at the end of the sentence.
# Original text. >>> hashlib.md5('The quick brown fox jumps over the lazy dog').hexdigest() '9e107d9d372bb6826bd81d3542a419d6' # Slight modification of the text. >>> hashlib.md5('The quick brown fox jumps over the lazy dog.').hexdigest() 'e4d909c290d0fb1ca068ffaddf22cbd0' |
As you can see in the example above, the values 9e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 are completely different.
In the case of images if the background color is changed, the image is cropped or rotated or if just one pixel is modified out of the original image, we won’t be able to match the hash of the image to an already existing one. This is not really practical since we want to detect similar images, even if they have been modified a little.
For example when we modify the color of the nose on the cat the computed hash value will be completely different.
# Load the original image into memory and calculate it's hash value. >>> image_file = open('data/cat_grumpy_orig.png').read() >>> hashlib.md5(image_file).hexdigest() '3e1f6e9f2689d59b9ed28bcdab73455f' # Load the modified image into memory and calculate it's hash value. >>> image_file_modified = open('data/cat_grumpy_modif.png').read() >>> hashlib.md5(image_file_modified).hexdigest() '12d1b9409c3e8e0361c24beaee9c0ab1' |
There are a number of perceptual hashing algorithms but the one that I’m gonna present is the dHash (difference hashing) algorithm which computes the difference in brightness between adjacent, identifying the relative gradient direction.
Before we dive into the algorithm, let’s start with some basics. An color image is comprised out of red, green and blue (RGB) pixels and we can think of it as an list of sets of red, green and blue values. For example, let’s take an image, load it using Python’s imaging library (PIL) and print the pixel values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
>>> from PIL import Image >>> test_image = Image.open('data/test_image.jpg') # The image is an RGB image with a size of 8x8 pixels. >>> print 'Image Mode: %s' % test_image.mode Image Mode: RGB >>> print 'Width: %s px, Height: %s px' % (test_image.size[0], test_image.size[1]) Width: 4 px, Height: 4 px # Get the pixel values from the image and print them into rows based on # the image's width. >>> width, height = test_image.size >>> pixels = list(test_image.getdata()) >>> for col in xrange(width): ... print pixels[col:col+width] ... [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 255)] [(0, 0, 0), (212, 45, 45), (51, 92, 154), (130, 183, 47)] [(206, 210, 198), (131, 78, 8), (131, 156, 180), (117, 155, 201)] [(104, 133, 170), (215, 130, 20), (153, 155, 155), (104, 142, 191)] |
Now let’s get back on track, the dHash algorithm has four steps, I’m gonna go through each one and illustrate it’s effects on the original and modified image.
By grayscaling the image we reduce each pixel value to an luminous intensity value. For example, a white pixel (255, 255, 255) will be reduced to an intensity value of 255 and a black pixel (0, 0, 0) will be reduced to an intensity value of 0.
By shrinking the image to a common base size, for example 9×8 pixels, where the width is 1px larger than the height (you’ll understand why the odd size in step 3). This way we remove all the high frequencies and details of the image and we are left with a sample size of 72 intensity values. This also means that resizing or stretching an image won’t affect it’s hash value, all images are normalized to a common size.
After the previous two steps we are left with an list containing intensity values, we then compare each intensity value to it’s adjacent value for each row resulting in a binary array.
>>> from PIL import Image >>> img = Image.open('data/cat_grumpy_orig_after_step_2.png') >>> width, height = img.size >>> pixels = list(img.getdata()) >>> for col in xrange(width): ... print pixels[col:col+width] ... [254, 254, 255, 253, 248, 254, 255, 254, 255] [254, 255, 253, 248, 254, 255, 254, 255, 255] [255, 253, 248, 254, 255, 254, 255, 255, 255] [253, 248, 254, 255, 254, 255, 255, 255, 222] [248, 254, 255, 254, 255, 255, 255, 222, 184] [254, 255, 254, 255, 255, 255, 222, 184, 177] [255, 254, 255, 255, 255, 222, 184, 177, 184] [254, 255, 255, 255, 222, 184, 177, 184, 225] [255, 255, 255, 222, 184, 177, 184, 225, 255] |
The first value (254) will be compared to the second value (254), the second value to the third and so on, giving us 8 boolean values per row.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
>>> difference = [] >>> for row in xrange(height): ... for col in xrange(width): ... if col != width: ... difference.append(pixels[col+row] > pixels[(col+row)+1]) ... >>> for col in xrange(width-1): ... print difference[col:col+(width-1)] ... [False, False, True, True, False, False, True, False] [False, True, True, False, False, True, False, False] [True, True, False, False, True, False, False, False] [True, False, False, True, False, False, False, True] [False, False, True, False, False, False, True, True] [False, True, False, False, False, True, True, False] [True, False, False, False, True, True, False, False] [False, False, False, True, True, False, False, True] |
To make the hash easy to store and use, we convert it into a hexadecimal string. A True value will have a binary value of 1 and a False value will have one of 0.
A full implementation of the algorithm in Python:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
def dhash(image, hash_size = 8): # Grayscale and shrink the image in one step. image = image.convert('L').resize( (hash_size + 1, hash_size), Image.ANTIALIAS, ) pixels = list(image.getdata()) # Compare adjacent pixels. difference = [] for row in xrange(hash_size): for col in xrange(hash_size): pixel_left = image.getpixel((col, row)) pixel_right = image.getpixel((col + 1, row)) difference.append(pixel_left > pixel_right) # Convert the binary array to a hexadecimal string. decimal_value = 0 hex_string = [] for index, value in enumerate(difference): if value: decimal_value += 2**(index % 8) if (index % 8) == 7: hex_string.append(hex(decimal_value)[2:].rjust(2, '0')) decimal_value = 0 return ''.join(hex_string) |
In the most simple case where the images differ only slightly the hashes most likely will be the same so we can directly compare them.
>>> from PIL import Image >>> from utility import dhash, hamming_distance >>> orig = Image.open('data/cat_grumpy_orig.png') >>> modif = Image.open('data/cat_grumpy_modif.png') >>> dhash(orig) '4c8e3366c275650f' >>> dhash(modif) '4c8e3366c275650f' >>> dhash(orig) == dhash(modif) True |
If we kept an SQL database with our hashes and we wanted to see if the ’4c8e3366c275650f’ hash existed in our database we could simply do something like this:
SELECT pk, hash, file_path FROM image_hashes WHERE hash = '4c8e3366c275650f'; |
Now, in the case that the images differ a bit more and the hashes are not exactly the same we need to compare them by calculating the minimum number of substitutions required to change one string into the other, this is called a hamming distance.
On the Wikipedia page there is some sample Python code that computes the hamming distance between two string but we can calculate the hamming distance and query based on it directly in MySQL:
SELECT pk, hash, BIT_COUNT( CONV(hash, 16, 10) ^ CONV('4c8e3366c275650f', 16, 10) ) as hamming_distance FROM image_hashes HAVING hamming_distance < 4 ORDER BY hamming_distance ASC; |
It will do a binary XOR of the two values, the first one from the database and the second one the one that we query, and count the bits that are different amongst the two values. Because BIT_COUNT works only on integers and we stored the hash as a hex string (base 16) we have to convert it to a decimal value (base 10), the same goes for the value that we want to query by.
Even though I’ve presented the algorithm using Python snippets the algorithm is pretty general and can implemented in any other programming language.
As I said in the introduction, we will be using using the algorithm in the future on Iconfinder to warn us if an submitted icon already exists in our collection but it can also have other practical uses. For example, because images with similar features have the hamming distance of their hashes close, it can also be used as a basic recomandation system where the recomended images are within a certain hashing distance of the current image.