Reddit Hotness Algorithm in SQL
An important feature of content aggregators is ordering so the best articles are at the top of the page. Algorithms to accomplish this don't require math any more complex than what you learned in high-school algebra. To demonstrate, I will walk through the basics of "hotness" ranking and demonstrate one implementation using SQL.
First, imagine you have the following information about your website's content in the article table in a database.
Title Likes Submitted The Mangiest Cat 5 2012-12-01 19:48:45 My Political Rant 0 2012-12-04 06:06:12 Brave New World < 1984 146 2012-08-09 23:16:23
Your site will probably have a lot more content than this but we'll stick with these three to keep it simple. If you aren't familiar with databases, you can think of a table exactly like a spreadsheet page with named columns of data. Databases use a simple language called SQL to retrieve and manipulate their data.
In an SQL SELECT statement, we can use the ORDER BY clause like so…
SELECT Title, Likes, Submitted FROM article ORDER BY Submitted DESC
…to retrieve the data ordered by descending submission time (most recent on top to oldest on bottom.)
Title Likes Submitted My Political Rant 0 2012-12-04 06:06:12 The Mangiest Cat 5 2012-12-01 19:48:45 Brave New World < 1984 146 2012-08-09 23:16:23
This is a useful way to order content so your visitors will be shown the newest articles first. Unfortunately the most recent one is the political rant. The article's Likes count of zero implies, not surprisingly, that people are more interested in looking at a picture of a mangy cat.
We could instead retrieve our data in order of how many Likes each article has earned.
SELECT Title, Likes, Submitted FROM article ORDER BY Likes DESC
Which will return with the most Likes first to the least Likes last:
Title Likes Submitted Brave New World < 1984 146 2012-08-09 23:16:23 The Mangiest Cat 5 2012-12-01 19:48:45 My Political Rant 0 2012-12-04 06:06:12
The reasons that A Brave New World is mediocre sounds great, but at nearly 4 months old, it is starting to grow mold. To order by what is "hot", we need to order our articles in a way that takes both the number of Likes and time Submitted into account. This is going to require some math. Do you remember when you were a snotty little kid and asked your math teacher when you would need to know math for real life? You owe her an apology.
First, let's convert the Submitted date to an actual number so we can perform some simple mathematical operations on it (you know, like addition.) Unix time is the number of seconds since January 1st, 1970. This will give us a value of the submission date as a simple integer. MySQL (a popular database I am using for these examples) has a built-in conversion function.
SELECT Title, Likes, UNIX_TIMESTAMP(Submitted) FROM article
Will give us this:
Title Likes UNIX_TIMESTAMP(Submitted) The Mangiest Cat 5My Political Rant 0 Brave New World < 1984 146
Now that we have the data we need as basic numbers, the easiest way to combine them is to add the number of Likes to the Submitted Unix time. Let's see how well that works.
SELECT Title, Likes + UNIX_TIMESTAMP(Submitted) AS Hotness FROM article ORDER BY Hotness DESC
This will return our data in this order:
Title Hotness My Political Rant 1354619172 The Mangiest Cat 1354409330 Brave New World < 1984 1344568729
This returns our articles in the exact same order as by date only. It should not be surprising because adding a measly 5 votes to a number that is over a billion does not make much of a difference. We will need to magnify the votes value to allow it to compete with the magnitude of the time stamp.
Let's start by deciding that we want a single vote to affect our hotness ranking as much as if the article were Submitted an entire day more recently. A day is 86,400 seconds long (24 hours * 60 minutes * 60 seconds) so let's multiply Likes by that.
SELECT Title, Likes * 86400 + UNIX_TIMESTAMP(Submitted) AS Hotness FROM article ORDER BY Hotness DESC
Gives us this new data order:
Title Hotness Brave New World < 1984 1357182983 The Mangiest Cat 1354841325 My Political Rant 1354619172
This moved our cat picture above the political rant, but the article from back in August is still on top. We have a new problem: how do we give a small number of Likes a boost without allowing articles with a very high number to dominate the top of the list long past the time they they are relavant? Fortunately there are some mathematical operators to do just this.
Logarithms are one such operator. Log base 10 is roughly the same as counting the number of digits in a decimal number, minus 1. For example:
Number log10 1 0 10 1 100 2 1,000 3 1,000,000 6 0 No! You can't do this. 146 2.16435285578444
I don't know what 2.16435285578444 digits looks like, but hey, it's math. Looking at the above chart, you can see that the log10 of 10 votes results in the number 1. Doubling the result to 2 requires 100 (10 times as many) votes. That's just the effect we were looking for. Lets throw this into our SQL query:
SELECT Title, LOG10(Likes) * 86400 + UNIX_TIMESTAMP(Submitted) AS Hotness FROM article ORDER BY Hotness DESC
This blows up because you are not allowed to take a logarithm of 0 and nobody liked the political rant. Let's add 1 to every Like count to protect from this and run it again.
SELECT Title, LOG10(Likes + 1) * 86400 + UNIX_TIMESTAMP(Submitted) AS Hotness FROM article ORDER BY Hotness DESC Title Hotness My Political Rant 1354619172 The Mangiest Cat 1354476557.268033 Brave New World < 1984 1344755839.2177222
That didn't work at all! Wait, because we are taking the log10 of the Likes count, one vote does not equal 1 day anymore. What does it equal now?
LOG10(1 + 1) = .301029995663981
Since any (non-zero) number divided by itself always equals one, let's make 1 vote equal 1 again by dividing it by .301029995663981.
SELECT Title, LOG10(Likes + 1) * 86400 / .301029995663981 + UNIX_TIMESTAMP(Submitted) AS Hotness FROM article ORDER BY Hotness DESC Title Hotness The Mangiest Cat 1354632666.0811288 My Political Rant 1354619172 Brave New World < 1984 1345190635.5848327
Oh hell yeah. That is exactly the result I was looking for—recent and popular content is on top. We can combine and round the two figures that modify the vote number to make the SQL statement a little simpler.
SELECT Title, LOG10(Likes + 1) * 287015 + UNIX_TIMESTAMP(Submitted) AS Hotness FROM article ORDER BY Hotness DESC
One vote being as important as one day is a pretty strong effect (Reddit has 1 vote equalling only 12.5 hours) so you may want to adjust the vote multiplier to order your content in a way that works best for your content and community.