Get it All
Together

Small reference to one of my favourite authors, Tad Williams, in the title of this post. But the quote, taken from the Memory, Sorrow and Thorn series, really paints the picture I wanted for this post in terms of the lesson I found myself suddenly learning yesterday. I don’t know how useful an article this will be, but I guess I just wanted to write out my thoughts because they seem important to me. Guess that’s why they call it blogging, huh?

The system I’m building is a kind of a specialized content management system for my company’s website. As such, each “page” can define its own “template,” or layout system that defines the specific view. My boss was very impressed by this system, which is good, but later he called me with another idea: why not create two folders for templates, one for templates that are reusable, and one for templates that pretty much only get used for a specific page and never again.

On the surface, this seems to make some sense: we do have one-off pages that would require a template that never gets used again. But for reasons I couldn’t quite articulate at the time, I thought it was a terrible idea and still do. After a night to think it over, I realized that the challenge in programming is always the edge case and that the goal is to eliminate them, not create them. Here’s what I mean:

Lets say for example that we want to create two “buckets” of information, Category A and Category B. A specific set of conditions defines what qualifies for each bucket. We setup logic to compare a data set against this set of conditions and place it in the right bucket. Immediately, we have created a decision point. If the decision is simple and mathematical – size of a file, lets say – there are no edge cases. But if the decision is more complex or based on human decision making, there may be thousands of specific data sets that either fit into both or neither bucket. Then what?

Well, then we have to define a more stringent set of rules, don’t we? Or else more rules. Herein lies the problem, because you can drive yourself nuts trying to come up with a solution for every case. Worse than that, you can riddle your code with flaws and – as the title of this post suggests – create a set of rules so stringent that entirely separate parts of your code are immobilized for fear of violating those rules.

Such is the case with the current website, frankly. Its very well programmed, but unfortunately, so many rules and so many eventualities have been programmed for that its almost as if they’ve painted themselves into a corner. What was built to be flexible is now anything but.

Object Oriented Programming will not help this kind of thing. If the rules of one portion of the program obligate another part of the program overmuch, even cloud-based, OOP web services would grind to a halt.

So, what I’m getting at here is that any new decision point needs to be carefully considered in context. Does the decision represent a really necessary component of the system? Forget the perceived benefits of that decision point, what are the flaws or disadvantages that this decision point would eliminate? If there aren’t any, are the benefits really that valuable? Finally – and perhaps most critically – how many edge cases can you think of and is it really worth the brain power to figure out solutions for each one?

I think another bad habit of us programmers is that we are so enamored of our ability to solve problems programmatically that we feel the need to do so at every step. Here again is an opportunity to introduce new and mind-numbingly confusing edge cases where they’re not needed. Every web page does not need to be assembled by a foreach() loop, if/then decisions and switches: sometimes, plain old HTML with a few choice includes will do just fine.

Its an irony that, after really nearly thirty years of off-and-on programming in everything from BASIC2.0 to CakePHP, I’m only just now appreciating in definable terms: the logical conditions that make programming possible and programs flexible can also be over used to the point of farce.

The below-linked article is an excellent example of how to use multiple databases in CakePHP, with an eye towards having a production, development and potentially, even more environments with which to work.

Easy peasy database config (Articles) | The Bakery, Everything CakePHP:

Like a lot of developers out there, I use Subversion to keep control of my code and projects, and I also use a different database for development and production. But when using Cake this can be a problem when checking out my code from development to production. Unless I edit my database.php with my production config, the production code would have problems, as it would be trying to access data from the development database.

The only thing I would add is a minor clarification to the one thing that tripped me up, abbreviated in this code snippet.
[code lang=”php” highlight=”4,8″]
var $development = array(…);
var $production = array(…);
var $test = array(…);
var $default = array();

function __construct()
{
$this->default = ($_SERVER[‘SERVER_ADDR’] == ‘127.0.0.1’) ?
$this->development : $this->production;
}[/code]
The important thing here is that CakePHP *always* requires the “default” database entry. The purpose of having the other database arrays is to set the correct one at runtime, which as you can see, the author does by checking the IP address. He then replaces the empty “default” array with one of the desired database arrays.

In my case, I’m running both environments off the same VPS, so I’m comparing domain names. But the idea is the same.

Only a very small handful of people from Rochester and the surrounding areas – those who like local music – will get the reference in the title of this post. Maybe a few people outside of the Rochester area. That’s OK. It’s a small homage to a favourite local band…

By now, most of us who are committed to social web development take it as a given that URLs should be as clean as possible for as much content on your site as possible[2. For a good discussion of clean URLs, their meaning and their use, see this article.]. Clean, clear URLs provide your audience with an intuitive way to understand the structure of your website, provide an easy method to return to favourite topics or sections of your site and not least-importantly, provide search engines with easily keyword-associated URLs to log and serve to their customers.

With CakePHP, clean URLs are part of the normal process of building pages, rather than an imposition. A URL to a given resource on your CakePHP website will have a format of domain.com/:controller/:action/:param/:param…[1. If you’re not familiar with this format, have a look at this page, which is our topic of discussion, anyway.] Without knowing how your site will be organized, this is a fairly intuitive way of organizing URLs. Your controller name should be fairly descriptive of what it works with, “posts,” or “teeshirts”; your action will probably be fairly descriptive of what its meant to do, “edit,” “view,” “index,” and so forth.

But not in all cases does the standard formatting work. In the case of the site that I’m in the process of developing, PotholePatrol.org, I needed to make some changes to the way CakePHP structured it’s URLs to make things slightly more intuitive to the layout of the website. For example, since the organization of data revolves around Metro areas containing one or more Towns having many Potholes, using /metros, /towns/ and /potholes didn’t make sense and wouldn’t have accurately reflected that organization.
Continue reading

I’m continuing to work with my CakePHP project and have run across some interesting math problems I thought I’d share that surround ratings and popularity ranking for the site.

The hypothetical new service provides a social network sensibility to local civic participation, allowing users to vote on the importance of issues and comment on them. The ranking system is a simple up-or-down voting system, held in the database as either a 1 or 0, depending on the vote.

So, being a social networking site, it is important to provide some rankings in order for people to know what’s hot and what’s not on the site. These rankings are: newest, highest rated, most popular and most active. Highest rated and most popular differ in that the highest rated issue is purely a function of the ratings system, whereas most popular needs to take into account how many people have commented. Most popular and most active differ in that most active is merely an indication of how many votes and comments a given issue has. Newest is obviously a function of time and therefore a straight-ahead dB query.

So, how to arrive at the other ratings? This seemed more obvious at first, but it got more complex as I went. I determined that the best thing to do was to get out the old spreadsheet and start laying out some numbers. Initially, I thought the highest rated function aught to be purely a count of the “yes” votes on each issue. But such a system does not take into account the power of the “no” votes. The solution was to divide the number of positive ratings by the total number of ratings. This gives you a percentage of the positive ratings, so one positive rating out of five makes an overall negative rating (20% positive), whereas one positive vote out of two is much more strongly weighted (50%).

This is not an entirely satisfactory, since a single positive vote can launch an issue to the top of the ratings board. There is also the issue of two or more pairs of ratings and positives equaling the same average, such as “4 ratings, 2 positives” and “2 ratings, 1 positive.” But since the ratings are not the only criteria, it’s acceptable to over-rate low numbers. The issue of matching averages will have to be dealt with in a sorting correction.

The next step was to determine the most popular issues. In this case, I opted to multiply the rating by the number of comments. This is a more satisfactory result overall, except that no matter how many comments an issue gets, if the rating is 0, the popularity is also 0. The solution to that is to add back in the number of comments, which has the effect of pushing up the lowest numbers without unduly affecting the higher popularity numbers.

I think I’ve gotten a decent handle on how to jiggle the numbers and get out of them what I think is most important. I’d be interested in hearing from any statistics experts or other folks with experience in this type of thing how they would change my metrics system to be more accurate.

OK, I’m playing around with CakePHP and have been banging my head against a wall trying to figure out the error message which is the title of this post. For the benefit of those who come after me, let me explain what I’ve been doing wrong:

The problem was the models I was using. In my haste to get some drudge work out of the way quickly, I setup one of my models and – assuming I had done it correctly – copied and pasted for all my other models, changing the specifics as needed but basically assuming the same pattern. Well, the pattern was messed up.

If you’re specifying more options for a given relationship than just the name, you need to wrap the entire thing inside of an array(). This is what it should look like when correctly setup:

var $hasMany = array(
'Town' => array(
'className' => 'Town',
'foreignKey' => 'metro_ID',
'dependent' => 'false'
)
);

And this is the wrong way to set it up:

var $hasMany = 'Town' => array(
'className' => 'Town',
'foreignKey' => 'metro_ID',
'dependent' => 'false'
);