Get it All

So, it turns out that you don’t really need a Model file in order to access database tables. I’ve done it with two different tables (quite accidentally) and it worked fine.

Mind you: I would not recommend such a thing. After all, if there is stuff you want to change about the way you handle a given table (from creating custom query functions to changing the name of the Model because of an unfortunately-named table, to using an entirely different database for a given table), you can’t do anything without the Model file. And you may as well make sure that the file works correctly from the jump rather than scratch your head over it later. But this is very interesting….

It will doubtless seem over-obvious to those veterans out there, but for those who are diving into CakePHP for the first time, let a fellow first-timer let you in on the first, most important secret of proper CakePHP building: nothing matters until you have your database.

“Sure,” you say. “Of course I know I need a database to build a project from.” No. I’m not saying you need one. I’m saying that until you’ve got your database designed, built, tested, benchmarked and completely sound, there’s really no point in building the CakePHP aspect of it. This may not be of monumental importance to those doing simple things, such as building a rudimentary blog or content management system (why aren’t you using WordPress or Drupal?). But since the real purpose of a framework like CakePHP is to develop applications other than those already provided an Open Source solution, the majority of developers do need to consider this all-important fact before they start hacking away at some high-value project or another.

Really, its at the moment when you begin to think about developing with Cake that you take your first important step towards being some kind of database administrator. CakePHP will only provide the presentation layer for your application, along with some basic calculations and such. The real meat of most applications will happen inside the database, or at least, the efficiency with which you design your database will determine the ability to gather the data you’ve previously stored and therefore determine the efficiency of your site to its users.

In my opinion, this is especially true because CakePHP’s methods for dealing with data tend to be something on the less-efficient side of the curve. Regardless of your opinion on the efficiency of CakePHP’s models, there is no question that those models are meant to reflect – not influence! – your database design. ¬†As your understanding of your database increases, your understanding of how best to form the models will also increase at the same rate. Once you can easily extract the right data in the right order directly from MySQL (or phpMyAdmin, if you like), you’re well on your way towards creating efficient, practical code in your models and controllers to deal with same.

Thus it pays to spend a lot of time carefully planning out your database and normalizing it to the best of your abilities. When considering table structures, give a bit of thought to the following:

  1. What are the broad-strokes, raw data types you will be using? Orders? SKUs? Ratings? Users? This helps determine what the tables aught to be.
  2. What are the attributes of each data type that you need to capture? Think deeply on this one. Is the attribute in question really associated to the data type, or is there a better data type you need? Is there a new table to add in/modify?
  3. Math matters! Performing mathematical formulas on data with MySQL is more efficient in many cases than drawing the data out and then performing math, but it means that query results cannot be cached. Which method works best for your database and application depends greatly on your circumstances. Plan and test this aspect of your system very carefully.
  4. Math matters, part the second: if performing mathematical queries seems the best option for certain functions of your system, consider isolating those attributes of a given data type in their own table. Basically, you might get a performance boost from dividing a subject into two tables, one containing the non-math attributes and the other containing the math attributes.
  5. Have you considered cache tables? Even those things, like COUNT(*), which don’t seem like math, strictly speaking, take efficiency away from your database. Just plain old queries take time. Perhaps those tables that get hit hard with queries should have summary cache tables to support them? A multi-table, multi-join query might for example be run every ten minutes, with the data and time of update saved to a cache table for every query in between.
  6. Indexes! Its the ever-present challenge of database design: having enough indexes for enough query types, but not too many. A good tool for this is to just start writing down the anticipated queries you think you’ll be running on the data. What fields make sense to index in this case? Can you use values like timestamps to serve as primary keys instead of single-use IDs? Can you combine fields into a single index and have that be a relevant index for a sufficient number of queries?

And planning is only one part of the process of getting a nice database on which to build an application. The next step is to fill it full of data (even just garbage data, as long as it conforms to what you’d expect in your application) and start running queries against the tables. How well do they perform with a gigabyte or so of data to sift through? Can you knock together a few new indexes and make it perform better?

Of course, as an application evolves, you may well find that the design you started with is not adequate over time. This happens. But starting with something tuned precisely to your needs will get you a lot farther once you get to the CakePHP side of things, either way.

The below-linked article is an excellent example of how to use multiple databases in CakePHP, with an eye towards having a production, development and potentially, even more environments with which to work.

Easy peasy database config (Articles) | The Bakery, Everything CakePHP:

Like a lot of developers out there, I use Subversion to keep control of my code and projects, and I also use a different database for development and production. But when using Cake this can be a problem when checking out my code from development to production. Unless I edit my database.php with my production config, the production code would have problems, as it would be trying to access data from the development database.

The only thing I would add is a minor clarification to the one thing that tripped me up, abbreviated in this code snippet.

var $development = array(...);
var $production = array(...);
var $test = array(...);
var $default = array();

    function __construct()
        $this->default = ($_SERVER['SERVER_ADDR'] == '') ?
            $this->development : $this->production;

The important thing here is that CakePHP *always* requires the “default” database entry. The purpose of having the other database arrays is to set the correct one at runtime, which as you can see, the author does by checking the IP address. He then replaces the empty “default” array with one of the desired database arrays.

In my case, I’m running both environments off the same VPS, so I’m comparing domain names. But the idea is the same.

I’m continuing to work with my CakePHP project and have run across some interesting math problems I thought I’d share that surround ratings and popularity ranking for the site.

The hypothetical new service provides a social network sensibility to local civic participation, allowing users to vote on the importance of issues and comment on them. The ranking system is a simple up-or-down voting system, held in the database as either a 1 or 0, depending on the vote.

So, being a social networking site, it is important to provide some rankings in order for people to know what’s hot and what’s not on the site. These rankings are: newest, highest rated, most popular and most active. Highest rated and most popular differ in that the highest rated issue is purely a function of the ratings system, whereas most popular needs to take into account how many people have commented. Most popular and most active differ in that most active is merely an indication of how many votes and comments a given issue has. Newest is obviously a function of time and therefore a straight-ahead dB query.

So, how to arrive at the other ratings? This seemed more obvious at first, but it got more complex as I went. I determined that the best thing to do was to get out the old spreadsheet and start laying out some numbers. Initially, I thought the highest rated function aught to be purely a count of the “yes” votes on each issue. But such a system does not take into account the power of the “no” votes. The solution was to divide the number of positive ratings by the total number of ratings. This gives you a percentage of the positive ratings, so one positive rating out of five makes an overall negative rating (20% positive), whereas one positive vote out of two is much more strongly weighted (50%).

This is not an entirely satisfactory, since a single positive vote can launch an issue to the top of the ratings board. There is also the issue of two or more pairs of ratings and positives equaling the same average, such as “4 ratings, 2 positives” and “2 ratings, 1 positive.” But since the ratings are not the only criteria, it’s acceptable to over-rate low numbers. The issue of matching averages will have to be dealt with in a sorting correction.

The next step was to determine the most popular issues. In this case, I opted to multiply the rating by the number of comments. This is a more satisfactory result overall, except that no matter how many comments an issue gets, if the rating is 0, the popularity is also 0. The solution to that is to add back in the number of comments, which has the effect of pushing up the lowest numbers without unduly affecting the higher popularity numbers.

I think I’ve gotten a decent handle on how to jiggle the numbers and get out of them what I think is most important. I’d be interested in hearing from any statistics experts or other folks with experience in this type of thing how they would change my metrics system to be more accurate.

OK, I’m playing around with CakePHP and have been banging my head against a wall trying to figure out the error message which is the title of this post. For the benefit of those who come after me, let me explain what I’ve been doing wrong:

The problem was the models I was using. In my haste to get some drudge work out of the way quickly, I setup one of my models and – assuming I had done it correctly – copied and pasted for all my other models, changing the specifics as needed but basically assuming the same pattern. Well, the pattern was messed up.

If you’re specifying more options for a given relationship than just the name, you need to wrap the entire thing inside of an array(). This is what it should look like when correctly setup:

var $hasMany = array(
'Town' => array(
'className' => 'Town',
'foreignKey' => 'metro_ID',
'dependent' => 'false'

And this is the wrong way to set it up:

var $hasMany = 'Town' => array(
'className' => 'Town',
'foreignKey' => 'metro_ID',
'dependent' => 'false'