Hashing Intro

by

in

Hashing is a technique by which data are compressed into a fixed-size value that can be compared quickly.  Invented in the 1950s to access data stored on early computers, it allowed programmers to avoid having to search for a name sequentially, one at a time, a slow and inefficient searching strategy.  

For example, if programmers wanted to look up the word “Anna” in a database, they would compress that name using something like this algorithm:  Assign place in the alphabet of each letter in the name and sum them.  For “Anna”, that’s 1 + 14 + 14 = 1 or 30.  In slot 30 would be placed information about where to find “Anna”, such as its address in memory.  Knowing the address, the computer then retrieves what’s stored under “Anna”.  

So hashing is any algorithm that reduces a set of text (or numbers or any series of data) into a single number.  

In the 1990s, programmers found a way to use hashing to judge whether two sets of text were similar to each other, or “near” each other in some sense.  The algorithm is as follows:  Take each pair of letters (called a “shingle”) that appears in each text and count how many are in common between the two texts.  Divide that number by the total number of pairs and that number will saying something about the similarity of the two texts.

As an example, take “Hello, world” and “World, hello”.  Those texts are simply the same words reversed with appropriate punctuation.  In common English, they mean exactly the same thing.  Here’s how you hash them using shingles.  First, turn all upper case to lower case and eliminate all punctuation, leaving “hello world” and “world hello”.  Then draw up a table of each phrase’s shingles:

“hello world” shingles are 

he
el
ll
lo
o-
-w
wo
or
rl
ld

“world hello” shingles are

wo
or
rl
dl
d-
-h
he
el
ll
lo

The eight shingles in common are

he
el
ll
lo
wo
or
rl
ld

There are four shingles that appear in just one of the other texts:

o-
-w
d-
-h

In total, there are 12 different shingles (the 8 in common and the 4 that are in just one text).  Dividing the ones in common by the total number of shingles is 8 divided by 12 or 0.67.  This would be considered a pretty high value indicating that the two texts are close to each other, as 67 percent of the shingles are held in common.

Other ways to has exist, but most are inferior to using shingles.  For instance, simply adding up the alphabet position numbers for each phrase gives the same total: 124.  This fails as a hash because it suggests the two phrases are identical when they are not.  

Hashing is important in the Skiptune project because, for a variety of reasons, sections of melodies are not necessarily stored next to one another in the database.  We may want to gather them together later, and hashing allows us to identify what titles are similar enough to each other to suggest they are part of the same piece of music.  We’ll cover that in next week’s post.