RESTful MVC

Creating Maintainable Web Apps Without Frameworks

Reddit Hotness Algorithm in SQL

An important feature of content aggregators is ordering so the best articles are at the top of the page. Algorithms to accomplish this don't require math any more complex than what you learned in high-school algebra. To demonstrate, I will walk through the basics of "hotness" ranking and demonstrate one implementation using SQL.

First, imagine you have the following information about your website's content in the article table in a database.

Title                 Likes Submitted
The Mangiest Cat          5 2012-12-01 19:48:45
My Political Rant         0 2012-12-04 06:06:12
Brave New World < 1984  146 2012-08-09 23:16:23

Your site will probably have a lot more content than this but we'll stick with these three to keep it simple. If you aren't familiar with databases, you can think of a table exactly like a spreadsheet page with named columns of data. Databases use a simple language called SQL to retrieve and manipulate their data.

In an SQL SELECT statement, we can use the ORDER BY clause like so…

SELECT Title, Likes, Submitted
FROM article
ORDER BY Submitted DESC

…to retrieve the data ordered by descending submission time (most recent on top to oldest on bottom.)

Title                 Likes Submitted
My Political Rant         0 2012-12-04 06:06:12
The Mangiest Cat          5 2012-12-01 19:48:45
Brave New World < 1984  146 2012-08-09 23:16:23

This is a useful way to order content so your visitors will be shown the newest articles first. Unfortunately the most recent one is the political rant. The article's Likes count of zero implies, not surprisingly, that people are more interested in looking at a picture of a mangy cat.

We could instead retrieve our data in order of how many Likes each article has earned.

SELECT Title, Likes, Submitted
FROM article
ORDER BY Likes DESC

Which will return with the most Likes first to the least Likes last:

Title                 Likes Submitted
Brave New World < 1984  146 2012-08-09 23:16:23
The Mangiest Cat          5 2012-12-01 19:48:45
My Political Rant         0 2012-12-04 06:06:12

The reasons that A Brave New World is mediocre sounds great, but at nearly 4 months old, it is starting to grow mold. To order by what is "hot", we need to order our articles in a way that takes both the number of Likes and time Submitted into account. This is going to require some math. Do you remember when you were a snotty little kid and asked your math teacher when you would need to know math for real life? You owe her an apology.

First, let's convert the Submitted date to an actual number so we can perform some simple mathematical operations on it (you know, like addition.) Unix time is the number of seconds since January 1st, 1970. This will give us a value of the submission date as a simple integer. MySQL (a popular database I am using for these examples) has a built-in conversion function.

SELECT Title, Likes, UNIX_TIMESTAMP(Submitted)
FROM article

Will give us this:

Title                 Likes UNIX_TIMESTAMP(Submitted)
The Mangiest Cat          5 1354409325
My Political Rant         0 1354619172
Brave New World < 1984  146 1344568583

Now that we have the data we need as basic numbers, the easiest way to combine them is to add the number of Likes to the Submitted Unix time. Let's see how well that works.

SELECT Title, Likes + UNIX_TIMESTAMP(Submitted) AS Hotness
FROM article
ORDER BY Hotness DESC

This will return our data in this order:

Title                   Hotness
My Political Rant       1354619172
The Mangiest Cat        1354409330
Brave New World < 1984  1344568729

This returns our articles in the exact same order as by date only. It should not be surprising because adding a measly 5 votes to a number that is over a billion does not make much of a difference. We will need to magnify the votes value to allow it to compete with the magnitude of the time stamp.

Let's start by deciding that we want a single vote to affect our hotness ranking as much as if the article were Submitted an entire day more recently. A day is 86,400 seconds long (24 hours * 60 minutes * 60 seconds) so let's multiply Likes by that.

SELECT Title, Likes * 86400 + UNIX_TIMESTAMP(Submitted) AS Hotness
FROM article
ORDER BY Hotness DESC

Gives us this new data order:

Title                   Hotness
Brave New World < 1984  1357182983
The Mangiest Cat        1354841325
My Political Rant       1354619172

This moved our cat picture above the political rant, but the article from back in August is still on top. We have a new problem: how do we give a small number of Likes a boost without allowing articles with a very high number to dominate the top of the list long past the time they they are relavant? Fortunately there are some mathematical operators to do just this.

Logarithms are one such operator. Log base 10 is roughly the same as counting the number of digits in a decimal number, minus 1. For example:

Number      log10
1           0
10          1
100         2
1,000       3
1,000,000   6
0           No! You can't do this.
146         2.16435285578444

I don't know what 2.16435285578444 digits looks like, but hey, it's math. Looking at the above chart, you can see that the log10 of 10 votes results in the number 1. Doubling the result to 2 requires 100 (10 times as many) votes. That's just the effect we were looking for. Lets throw this into our SQL query:

SELECT Title, LOG10(Likes) * 86400 + UNIX_TIMESTAMP(Submitted) AS Hotness
FROM article
ORDER BY Hotness DESC

This blows up because you are not allowed to take a logarithm of 0 and nobody liked the political rant. Let's add 1 to every Like count to protect from this and run it again.

SELECT Title, LOG10(Likes + 1) * 86400 + UNIX_TIMESTAMP(Submitted) AS Hotness
FROM article
ORDER BY Hotness DESC

Title                   Hotness
My Political Rant       1354619172
The Mangiest Cat        1354476557.268033
Brave New World < 1984  1344755839.2177222

That didn't work at all! Wait, because we are taking the log10 of the Likes count, one vote does not equal 1 day anymore. What does it equal now?

LOG10(1 + 1) = .301029995663981

Since any (non-zero) number divided by itself always equals one, let's make 1 vote equal 1 again by dividing it by .301029995663981.

SELECT Title, LOG10(Likes + 1) * 86400 / .301029995663981 + UNIX_TIMESTAMP(Submitted) AS Hotness
FROM article
ORDER BY Hotness DESC

Title                   Hotness
The Mangiest Cat        1354632666.0811288
My Political Rant       1354619172
Brave New World < 1984  1345190635.5848327

Oh hell yeah. That is exactly the result I was looking for—recent and popular content is on top. We can combine and round the two figures that modify the vote number to make the SQL statement a little simpler.

SELECT Title, LOG10(Likes + 1) * 287015 + UNIX_TIMESTAMP(Submitted) AS Hotness
FROM article
ORDER BY Hotness DESC

One vote being as important as one day is a pretty strong effect (Reddit has 1 vote equalling only 12.5 hours) so you may want to adjust the vote multiplier to order your content in a way that works best for your content and community.