I’m continuing to work with my CakePHP project and have run across some interesting math problems I thought I’d share that surround ratings and popularity ranking for the site.
The hypothetical new service provides a social network sensibility to local civic participation, allowing users to vote on the importance of issues and comment on them. The ranking system is a simple up-or-down voting system, held in the database as either a 1 or 0, depending on the vote.
So, being a social networking site, it is important to provide some rankings in order for people to know what’s hot and what’s not on the site. These rankings are: newest, highest rated, most popular and most active. Highest rated and most popular differ in that the highest rated issue is purely a function of the ratings system, whereas most popular needs to take into account how many people have commented. Most popular and most active differ in that most active is merely an indication of how many votes and comments a given issue has. Newest is obviously a function of time and therefore a straight-ahead dB query.
So, how to arrive at the other ratings? This seemed more obvious at first, but it got more complex as I went. I determined that the best thing to do was to get out the old spreadsheet and start laying out some numbers. Initially, I thought the highest rated function aught to be purely a count of the “yes” votes on each issue. But such a system does not take into account the power of the “no” votes. The solution was to divide the number of positive ratings by the total number of ratings. This gives you a percentage of the positive ratings, so one positive rating out of five makes an overall negative rating (20% positive), whereas one positive vote out of two is much more strongly weighted (50%).
This is not an entirely satisfactory, since a single positive vote can launch an issue to the top of the ratings board. There is also the issue of two or more pairs of ratings and positives equaling the same average, such as “4 ratings, 2 positives” and “2 ratings, 1 positive.” But since the ratings are not the only criteria, it’s acceptable to over-rate low numbers. The issue of matching averages will have to be dealt with in a sorting correction.
The next step was to determine the most popular issues. In this case, I opted to multiply the rating by the number of comments. This is a more satisfactory result overall, except that no matter how many comments an issue gets, if the rating is 0, the popularity is also 0. The solution to that is to add back in the number of comments, which has the effect of pushing up the lowest numbers without unduly affecting the higher popularity numbers.
I think I’ve gotten a decent handle on how to jiggle the numbers and get out of them what I think is most important. I’d be interested in hearing from any statistics experts or other folks with experience in this type of thing how they would change my metrics system to be more accurate.