Get it All
Together

Small reference to one of my favourite authors, Tad Williams, in the title of this post. But the quote, taken from the Memory, Sorrow and Thorn series, really paints the picture I wanted for this post in terms of the lesson I found myself suddenly learning yesterday. I don’t know how useful an article this will be, but I guess I just wanted to write out my thoughts because they seem important to me. Guess that’s why they call it blogging, huh?

The system I’m building is a kind of a specialized content management system for my company’s website. As such, each “page” can define its own “template,” or layout system that defines the specific view. My boss was very impressed by this system, which is good, but later he called me with another idea: why not create two folders for templates, one for templates that are reusable, and one for templates that pretty much only get used for a specific page and never again.

On the surface, this seems to make some sense: we do have one-off pages that would require a template that never gets used again. But for reasons I couldn’t quite articulate at the time, I thought it was a terrible idea and still do. After a night to think it over, I realized that the challenge in programming is always the edge case and that the goal is to eliminate them, not create them. Here’s what I mean:

Lets say for example that we want to create two “buckets” of information, Category A and Category B. A specific set of conditions defines what qualifies for each bucket. We setup logic to compare a data set against this set of conditions and place it in the right bucket. Immediately, we have created a decision point. If the decision is simple and mathematical – size of a file, lets say – there are no edge cases. But if the decision is more complex or based on human decision making, there may be thousands of specific data sets that either fit into both or neither bucket. Then what?

Well, then we have to define a more stringent set of rules, don’t we? Or else more rules. Herein lies the problem, because you can drive yourself nuts trying to come up with a solution for every case. Worse than that, you can riddle your code with flaws and – as the title of this post suggests – create a set of rules so stringent that entirely separate parts of your code are immobilized for fear of violating those rules.

Such is the case with the current website, frankly. Its very well programmed, but unfortunately, so many rules and so many eventualities have been programmed for that its almost as if they’ve painted themselves into a corner. What was built to be flexible is now anything but.

Object Oriented Programming will not help this kind of thing. If the rules of one portion of the program obligate another part of the program overmuch, even cloud-based, OOP web services would grind to a halt.

So, what I’m getting at here is that any new decision point needs to be carefully considered in context. Does the decision represent a really necessary component of the system? Forget the perceived benefits of that decision point, what are the flaws or disadvantages that this decision point would eliminate? If there aren’t any, are the benefits really that valuable? Finally – and perhaps most critically – how many edge cases can you think of and is it really worth the brain power to figure out solutions for each one?

I think another bad habit of us programmers is that we are so enamored of our ability to solve problems programmatically that we feel the need to do so at every step. Here again is an opportunity to introduce new and mind-numbingly confusing edge cases where they’re not needed. Every web page does not need to be assembled by a foreach() loop, if/then decisions and switches: sometimes, plain old HTML with a few choice includes will do just fine.

Its an irony that, after really nearly thirty years of off-and-on programming in everything from BASIC2.0 to CakePHP, I’m only just now appreciating in definable terms: the logical conditions that make programming possible and programs flexible can also be over used to the point of farce.

So, in developing the new website application for my company, I’ve hit a peculiar snafu. Namely that, because the same basic system will be used to spawn a number of different websites, things that go into making a website visible, such as images, views, css files and others will need to be different from instance to instance. I found an excellent tutorial on ignoring files, but that’s when I hit the snag: since I put everything under revision already, I discovered that I could not ignore the files and folders I wanted to anymore.

What is the solution? Well, here goes.

The first step is to backup the files you want to keep out of versioning. Just copy and paste them into a safe folder, outside the working directory somewhere. This way, it will be easy to put stuff back where it belongs later.

Next, we explicitly delete the files from the repository:
[code]myhost: working svn delete ./path/to/directory/
myhost: working svn commit -m "We need to make sure that the repository no longer records these files as being under revision."[/code]

Ok! Now that the files and folders are no longer in the repository, its time to click and drag / copy and paste our folders back into their correct locations. You will now see something like this:
[code]
myhost: working svn status
? path/to/directory
[/code]

Thats, ok: in fact, that’s exactly what we want. Now, Subversion does not recognize our directories as versioned items. They’re brand new, as far as Subversion knows or cares. Now we can set the svn:ignore same as we would any other time:
[code]myhost: working cd path/to
myhost: to svn propset svn:ignore directory .
property ‘svn:ignore’ set on ‘.'[/code]

That’s it! Do an svn up and you’ll see your directories are not listed anymore.

It’s important to remember that, if you’re going to ignore more than one directory or file within a given directory, you need to create a file that you can feed into the svn propedit command that will list them separately. This blog post does an excellent job of explaining how to ignore files and directories. And of course, you should also refer back to the Red-Bean handbook on the subject.

It will doubtless seem over-obvious to those veterans out there, but for those who are diving into CakePHP for the first time, let a fellow first-timer let you in on the first, most important secret of proper CakePHP building: nothing matters until you have your database.

“Sure,” you say. “Of course I know I need a database to build a project from.” No. I’m not saying you need one. I’m saying that until you’ve got your database designed, built, tested, benchmarked and completely sound, there’s really no point in building the CakePHP aspect of it. This may not be of monumental importance to those doing simple things, such as building a rudimentary blog or content management system (why aren’t you using WordPress or Drupal?). But since the real purpose of a framework like CakePHP is to develop applications other than those already provided an Open Source solution, the majority of developers do need to consider this all-important fact before they start hacking away at some high-value project or another.

Really, its at the moment when you begin to think about developing with Cake that you take your first important step towards being some kind of database administrator. CakePHP will only provide the presentation layer for your application, along with some basic calculations and such. The real meat of most applications will happen inside the database, or at least, the efficiency with which you design your database will determine the ability to gather the data you’ve previously stored and therefore determine the efficiency of your site to its users.

In my opinion, this is especially true because CakePHP’s methods for dealing with data tend to be something on the less-efficient side of the curve. Regardless of your opinion on the efficiency of CakePHP’s models, there is no question that those models are meant to reflect – not influence! – your database design.  As your understanding of your database increases, your understanding of how best to form the models will also increase at the same rate. Once you can easily extract the right data in the right order directly from MySQL (or phpMyAdmin, if you like), you’re well on your way towards creating efficient, practical code in your models and controllers to deal with same.

Thus it pays to spend a lot of time carefully planning out your database and normalizing it to the best of your abilities. When considering table structures, give a bit of thought to the following:

  1. What are the broad-strokes, raw data types you will be using? Orders? SKUs? Ratings? Users? This helps determine what the tables aught to be.
  2. What are the attributes of each data type that you need to capture? Think deeply on this one. Is the attribute in question really associated to the data type, or is there a better data type you need? Is there a new table to add in/modify?
  3. Math matters! Performing mathematical formulas on data with MySQL is more efficient in many cases than drawing the data out and then performing math, but it means that query results cannot be cached. Which method works best for your database and application depends greatly on your circumstances. Plan and test this aspect of your system very carefully.
  4. Math matters, part the second: if performing mathematical queries seems the best option for certain functions of your system, consider isolating those attributes of a given data type in their own table. Basically, you might get a performance boost from dividing a subject into two tables, one containing the non-math attributes and the other containing the math attributes.
  5. Have you considered cache tables? Even those things, like COUNT(*), which don’t seem like math, strictly speaking, take efficiency away from your database. Just plain old queries take time. Perhaps those tables that get hit hard with queries should have summary cache tables to support them? A multi-table, multi-join query might for example be run every ten minutes, with the data and time of update saved to a cache table for every query in between.
  6. Indexes! Its the ever-present challenge of database design: having enough indexes for enough query types, but not too many. A good tool for this is to just start writing down the anticipated queries you think you’ll be running on the data. What fields make sense to index in this case? Can you use values like timestamps to serve as primary keys instead of single-use IDs? Can you combine fields into a single index and have that be a relevant index for a sufficient number of queries?

And planning is only one part of the process of getting a nice database on which to build an application. The next step is to fill it full of data (even just garbage data, as long as it conforms to what you’d expect in your application) and start running queries against the tables. How well do they perform with a gigabyte or so of data to sift through? Can you knock together a few new indexes and make it perform better?

Of course, as an application evolves, you may well find that the design you started with is not adequate over time. This happens. But starting with something tuned precisely to your needs will get you a lot farther once you get to the CakePHP side of things, either way.

The below-linked article is an excellent example of how to use multiple databases in CakePHP, with an eye towards having a production, development and potentially, even more environments with which to work.

Easy peasy database config (Articles) | The Bakery, Everything CakePHP:

Like a lot of developers out there, I use Subversion to keep control of my code and projects, and I also use a different database for development and production. But when using Cake this can be a problem when checking out my code from development to production. Unless I edit my database.php with my production config, the production code would have problems, as it would be trying to access data from the development database.

The only thing I would add is a minor clarification to the one thing that tripped me up, abbreviated in this code snippet.
[code lang=”php” highlight=”4,8″]
var $development = array(…);
var $production = array(…);
var $test = array(…);
var $default = array();

function __construct()
{
$this->default = ($_SERVER[‘SERVER_ADDR’] == ‘127.0.0.1’) ?
$this->development : $this->production;
}[/code]
The important thing here is that CakePHP *always* requires the “default” database entry. The purpose of having the other database arrays is to set the correct one at runtime, which as you can see, the author does by checking the IP address. He then replaces the empty “default” array with one of the desired database arrays.

In my case, I’m running both environments off the same VPS, so I’m comparing domain names. But the idea is the same.

My project’s a long way from being done, but I’ve decided to go ahead and by my domain name and deploy what I have into it. I’m still not telling you the name of the site, though! I just feel much better knowing that the domain has officially been taking and no wildly bad luck will result in someone else owning the name and making me cry a lot.

The application that I’m in the process of building is designed around the idea of large metropolitan areas being grouped together in their own subdomains. So for example, I would like all people in the Rochester, NY metro area to go to rochester.example.com and all the Buffalo traffic to go to buffalo.example.com. So, I had to figure out how, exactly, I was going to go about doing this and I started looking around. After some trials and tribulations, I’m going to commit my process to the blog for the benefit of those who are searching for similar solutions.

I began my quest by reading this post at the CakePHP Bakery. A very good article and well documented, but it has a few problems for my application. Primarily, this example creates completely separate instances of the application for each domain, with different databases in each instance. That’s not at all what I had in mind for my application, which I expect to be cross-pollinating.

Apache and Your Domain Host

It is important to note at this point that hosted services such as BlueHost or 1and1.com will not be able to accomodate this kind of setup. Or at least, it will require a lot of work that is beyond the scope of this article. But I wanted to quickly cover the needed parameters for Apache and your domain host, assuming you’re using a VPS of some kind. You’ll need to setup any additional requirements with your web host as needed.

You can setup subdomains in one of two ways: either setting each domain up on your host separately, or else using a wild card to cover all the bases. I used this second option, which with GoDaddy.com, requires the use of an ANAME record. Note that if you use a wildcard, this will necessitate having some sort of “catch all” clause somewhere in your code to accomodate erroneous subdomains.

As for Apache, you’ll need to setup your VirtualHost entries for each anticipated subdomain. I personally use two VirtualHost files: one for my domain and a default file for stuff that comes in that I didn’t intend, which points to a generic “Whoopsies!” kind of message. The VirtualHost entries for your subdomains can just point to your Cake directory like so:
[code lang=”xml”]
<VirtualHost *:80>
ServerName domain1.example.com
DocumentRoot /var/www/path/to/cake
</VirtualHost>

<VirtualHost *:80>
ServerName domain2.example.com
DocumentRoot /var/www/path/to/cake
</VirtualHost>
[/code]
. . . and so on. Remember to reload Apache once this is done to get it to work.

CakePHP Part the First: Bootstrap.php

Based on the suggestion from the Bakery article I initially read, I setup my bootstrap.php file for subdomains. The bootstrap runs before everything else and is a great way to define some basic variables. Since many different controllers and methods will be available on a single domain, it’s important to at least verify that the correct operation is happening on the correct subdomain. Thus I defined a CLIENT_NAME per the original Bakery example, though with a few modifications, here:
[code lang=”php”]
if ( isset($_SERVER[‘SERVER_NAME’]) ) {
// define values that should NOT be affected by this test:
$subdomain = substr($_SERVER["HTTP_HOST"], 0, strpos($_SERVER["HTTP_HOST"], "."));
$donot = array(
‘productionserver’,
‘developmentserver’,
‘cake’
‘www’);
if (!in_array($subdomain, $donot)) {
define(‘CLIENT_NAME’, $subdomain);
} else {
define(‘CLIENT_NAME’, ‘home’);
}
}
[/code]
As you can see, I’ve used an array to define some known values that I’d like the system to define as “home.” If a person comes to productionserver.com, I obviously want them to be routed as coming to the home page. I also want to avoid misinterpreting my development server’s name as a metro, so I’ve specified a couple possible dev domain names as “home,” as well. Even though it is my habit to avoid using the ‘www’ on a domain name, for security’s sake, I’ve also defined this as “home” as well. The default option if none of these values is identified assumes that the segment of the URL passed is going to be a valid metro area and specify that as the current CLIENT_NAME.

This provides me the verification that I need whenever a new request is processed for a subdomain. But I have a controller to handle metro requests which I’d like to use in place of the generic pageController I use for the home page. For this, I need to use the Routes configuration:

CakePHP Part the Second: Routes Configuration

routes.php is a file you use to define what controllers get used with what URL strings. This is a hugely flexible and powerful piece of code which I have found you should use sparingly and cautiously. But there’s no question that Route configuration can be your best friend when building a functional application.

In our case, we need to snag the subdomain name once more and if it is indeed a subdomain, call the metroController with the host name as the first parameter. If this code looks familiar, that’s because it should! It’s practically the same code as the bootstrap.php code:
[code lang=”php”]
$subdomain = substr($_SERVER["HTTP_HOST"], 0, strpos($_SERVER["HTTP_HOST"], "."));
// define values that should NOT be affected by this test:
$donot = array(
‘potholepatrol’,
‘holisticnetworking’,
‘cake’);
if (!in_array($subdomain, $donot)) {
Router::connect(‘/’, array(‘controller’ => ‘metros’, ‘action’ => ‘view’, $subdomain));
} else {
Router::connect(‘/’, array(‘controller’ => ‘pages’, ‘action’ => ‘home’));
}
[/code]
Why not just see if the ‘CLIENT_NAME’ is defined? Well, because these are two separate checks on the same thing, and we don’t want to allow an error in one to affect the other.

This concludes my setup example for using CakePHP with multiple domains. I hope other developers find it interesting and useful. Please add any comments you have if you think there’s a better way!

I’m continuing to work with my CakePHP project and have run across some interesting math problems I thought I’d share that surround ratings and popularity ranking for the site.

The hypothetical new service provides a social network sensibility to local civic participation, allowing users to vote on the importance of issues and comment on them. The ranking system is a simple up-or-down voting system, held in the database as either a 1 or 0, depending on the vote.

So, being a social networking site, it is important to provide some rankings in order for people to know what’s hot and what’s not on the site. These rankings are: newest, highest rated, most popular and most active. Highest rated and most popular differ in that the highest rated issue is purely a function of the ratings system, whereas most popular needs to take into account how many people have commented. Most popular and most active differ in that most active is merely an indication of how many votes and comments a given issue has. Newest is obviously a function of time and therefore a straight-ahead dB query.

So, how to arrive at the other ratings? This seemed more obvious at first, but it got more complex as I went. I determined that the best thing to do was to get out the old spreadsheet and start laying out some numbers. Initially, I thought the highest rated function aught to be purely a count of the “yes” votes on each issue. But such a system does not take into account the power of the “no” votes. The solution was to divide the number of positive ratings by the total number of ratings. This gives you a percentage of the positive ratings, so one positive rating out of five makes an overall negative rating (20% positive), whereas one positive vote out of two is much more strongly weighted (50%).

This is not an entirely satisfactory, since a single positive vote can launch an issue to the top of the ratings board. There is also the issue of two or more pairs of ratings and positives equaling the same average, such as “4 ratings, 2 positives” and “2 ratings, 1 positive.” But since the ratings are not the only criteria, it’s acceptable to over-rate low numbers. The issue of matching averages will have to be dealt with in a sorting correction.

The next step was to determine the most popular issues. In this case, I opted to multiply the rating by the number of comments. This is a more satisfactory result overall, except that no matter how many comments an issue gets, if the rating is 0, the popularity is also 0. The solution to that is to add back in the number of comments, which has the effect of pushing up the lowest numbers without unduly affecting the higher popularity numbers.

I think I’ve gotten a decent handle on how to jiggle the numbers and get out of them what I think is most important. I’d be interested in hearing from any statistics experts or other folks with experience in this type of thing how they would change my metrics system to be more accurate.

The last few weeks have been absorbed in some freelance work for a local marketing company, which is a nice change of pace. But that also means I’ve not been able to get at my new pet project in that time, which has been a bit of a downer.

And since that new project is CakePHP driven and I’m only really learning the platform, time away from the project means knowledge lost or at least deeply buried. That makes getting back into it something of a challenge.

And indeed, I took the better part of the last two days figuring out a problem which turned out to have been rather obvious. Obvious, that is, if you know where to look. But with all that out of the way, I’m starting to make some decent progress on the project and am hoping to get something that at least looks nice by the end of the week.

Because more than one recruiter has said they want something like this to point to as a portfolio item. Everyone seems pleased with the direction I’m going in: they like what the project’s aim is. Sorry I can’t share that on this blog, but until I’ve got my domain name in place, I don’t want to screw myself.

But I really like the layout I’ve chosen for this project: very clean with big, friendly fonts to draw in the less-technical. The project is supposed to be about lowering the bar of participation in government, so it’s important not to cram too much information on any one page.

Hopefully by week’s end: a preview of my new CakePHP-powered web site!

OK, I’m playing around with CakePHP and have been banging my head against a wall trying to figure out the error message which is the title of this post. For the benefit of those who come after me, let me explain what I’ve been doing wrong:

The problem was the models I was using. In my haste to get some drudge work out of the way quickly, I setup one of my models and – assuming I had done it correctly – copied and pasted for all my other models, changing the specifics as needed but basically assuming the same pattern. Well, the pattern was messed up.

If you’re specifying more options for a given relationship than just the name, you need to wrap the entire thing inside of an array(). This is what it should look like when correctly setup:

var $hasMany = array(
'Town' => array(
'className' => 'Town',
'foreignKey' => 'metro_ID',
'dependent' => 'false'
)
);

And this is the wrong way to set it up:

var $hasMany = 'Town' => array(
'className' => 'Town',
'foreignKey' => 'metro_ID',
'dependent' => 'false'
);