3 min read
Published in

3 min read

What are Bloom filters?

A tale of code, dinner, and a favour with unexpected consequences.

Genesis

“So how do you know whether someone’s read a post already?” asks Sarah, Medium’s legal counsel. We’re at dinner for my birthday, yet somehow have gotten onto the topic of work yet again. Sarah is referring to Medium’s personalised reading list — a recommendation system that I helped build — which suggested new and interesting posts to users when they visited Medium’s homepage.

Hashing

To understand Bloom filters, you first have to understand hashing. To understand hashing, you don’t have to understand maths, which is good, because I’m a middling mathematician at best. I sit at the median, maybe.

b3f9b3a3504ccb29c4183730a42c8d56
uint32_t SuperFastHash (const char * data, int len) {
uint32_t hash = len, tmp;
int rem;

if (len <= 0 || data == NULL) return 0;
rem = len & 3;
len >>= 2;

/* Main loop */
for (;len > 0; len--) {
hash += get16bits (data);
tmp = (get16bits (data + 2) << 11) ^ hash;
hash = (hash << 16) ^ tmp;
data += 2 * sizeof (uint16_t);
hash += hash >> 11;
}

/* Handle end cases */
switch (rem) {
case 3: hash += get16bits (data);
hash ^= hash << 16;
hash ^= ((signed char)data[sizeof (uint16_t)]) << 18;
hash += hash >> 11;
break;
case 2: hash += get16bits (data);
hash ^= hash << 11;
hash += hash >> 17;
break;
case 1: hash += (signed char)*data;
hash ^= hash << 10;
hash += hash >> 1;
}
/* Force "avalanching" of final 127 bits */
hash ^= hash << 3;
hash += hash >> 5;
hash ^= hash << 4;
hash += hash >> 17;
hash ^= hash << 25;
hash += hash >> 6;

return hash;
}

A quick favour

This is all Marcin’s fault, really. Marcin, if you didn’t know already, is obsessed with typography. This admirable trait makes him an easy mark if you need any tasks doing that involve Medium’s fonts, or icons, which are also a font. Sometime last year, I was working on a feature which required a new icon. I was keen to get it deployed to the world before the weekend deadline, so I sent him a quick message at 6pm:

This

The word this is interesting for JavaScript programmers, a source of confusion for newcomers to the language, and frustrating even to journeypeople who periodically forget about the language’s idiosyncratic functional scope. The gist of it is, whenever you see this in JavaScript code, you never quite know what it’s referring to. For programmers who are used to everything being in its proper place, this can be disconcerting. JavaScript is full of these objectionable, endearing nuances, and is possessed of a flexibility bordering on pathological. Picture a politician conversing with the lobbyist whose large donation made their election possible. JavaScript is more flexible than that.

Returning the favour

Unfortunately for me, the this that Marcin is talking about is a bug that I can’t help him with. He wants to me to change the ordering of how people appear in a particular list, so that your friends appear first. It’s not a complex change, but the way we store our data means the list will become slow to load, and infinite scrolling will break. We are sadly stuck with what we have until we restructure our data storage, which, given our other priorities, isn’t likely to happen soon. Trade-offs happen in product development too. I report back sheepishly to him that he will have to choose something else.

[mwichary][9:08 AM] BTW I was supposed to send you possible issues to work on: https://github.com/Medium/medium/issues/xxxx
[mwichary][1:24 PM] this is a periodic reminder that you owe me this https://github.com/Medium/medium/issues/xxxxx or a Paul Ford-esque article.

Time and space

The world hardly needs another definition of the word algorithm, but here I go anyway: an algorithm is a set of steps to solve a problem. Simples. Algorithms don’t really have anything to do with computers, although computers turn out to be very handy algorithm processing machines. The hash functions described earlier are all implementations of algorithms, but not all algorithms are as complicated as those. In fact, many of the most important ones are actually quite straightforward.

Check, please

The bistro in which we are sitting is starting to empty out. The check arrives. We resist the urge to power an infinite improbability drive, split it evenly, and do the waiter a favour by paying with a single card. Credit card processing — hashes are used there too.

In Bloom

Burton Howard Bloom, the eponymous hero of our tale, and, it has to be said, possessed of a name with a very pleasing cadence — Bur-ton-Ho-ward-Bloom — studied computer science at MIT from 1963 to 1965. He invented the data structure that came to be known as the Bloom filter in 1970 while working at Computer Usage Company in Newton Upper Falls, MA.

Digestif

We leave the restaurant and head down the road for a nightcap.