Complexity, code metrics & PHP
One of the axioms of software quality is that bugs are highly gregarious. When I say bugs are gregarious, I don’t mean they like to go out to cocktail parties to schmooze and flirt with the interns. I mean that they tend to hang out in the same parts of the codebase. In many cases it is within 20% of the code where 80% of the problems arise (Pareto principle applies). This is important to consider, because it is this code that can benefit most from refactoring and where the developer should aim to get 100% coverage in unit tests.
I have been campaigning heavily in my company to ensure that a Master Test Plan (MTP) is prepared at the start of every project. A key element of an MTP is to identify the highest risks to quality and introduce risk-based testing contingencies to manage those risks. The problem is: how do I identify the risks? I think the answer to this question is to ask the more telling question: in what parts of the system does the highest complexity lie?
Before I bought my current car, I had a very sporty Mitsubishi Eclipse with all the options: power everything. I had nothing but problems with that car, culminating in the coup d’grace when it was laid up for a month outside my house in Washington DC only to suffer the ultimate indignity of having a colony of rats move in under the hood and chew all the wiring. Everything was gone: power steering, lights… I couldn’t even wind the bloody windows down! So when I traded it in for a less glamorous Ford Focus I deliberately turned down the options. I remembered why my Father hated automatic transmissions, because they are too complex to fix. The more there is, there more there is to go wrong. I have learned that my eternal nemesis, entropy, will always win in the end, the best I can do is delay it.
When it comes to software there is a very strong positive correlation between the amount of complexity in a given unit of code and the number of defects that are likely to manifest.
Wouldn’t it be nice if we could analyze our code somehow measure how complex each component is? That would give us a great deal of predictive power for unit tests, refactoring efforts and test planning. It would also identify application features that may be so expensive to maintain that they are not worth inclusion, leading to a more stable and reliable user experience.
The good news is that such code metrics exist. They have been around for at least 30 years and been so successful that engineers have calibrated them to measures of risk. Static code analysis tools have been developed to scan source code files and produce nice shiny reports useful for decision support and business reporting purposes. The bad news is there are no such tools for PHP, at least none that I can find. Most static code analysis tools are targeted to C/C++ and Java, with several emerging for C# / .NET. There are even a scattering of projects for Python and Ruby. For PHP, nada. This exposes one of the multiple limitations of PHP.
Given the high adoption of PHP compared to Python and Ruby, even in large enterprise environments, why is it that PHP does not have such tools? I think the answer lies in the fact that the majority of people who use PHP as a programming tool are not system developers and have minimal exposure to software engineering principles in general. PHP developers are for the most part web developers, due in large part to PHP’s origins as a domain specific language. Sure there are PHP binding to GTK, but no GUI applications of significance have been built in PHP.
Writing a source code scanner is not something your typical PHP web developer would know how to do. If they attempted it, they would probably try using regular expressions in a PHP script, that ultimately would be a fragile and unreliable solution. You really have to analyze the parse tree itself, not the base source code.
Analyzing the PHP parse tree sounds a little scary. However, it turns out to be fairly trivial to access the parse tree in an XML format using the Parse_Tree extension based on the Yacc extension to XML (YAXX). Using this extension you don’t even need to implement a scanner yourself, all you need is an XML parser which is something a more advanced PHP web developer has the capabilities and comprehension to attempt.
UPDATE: In the middle of writing this entry I came across Sebastian Bergmann’s PHPUnit and Software Metrics announcement of new feature support of software metrics in the forthcoming PHPUnit 3.2. This is excellent news. Not only does it support almost all the LOC metrics (of relatively little value with regard to quality, more useful for estimating as used within COCOMO), but one of the most useful metrics I know: cyclomatic complexity. It will provide a list of every class and function with a calculation of it’s cyclomatic complexity (basically a measure of the number of possible unique execution paths and states). However, although I haven’t looked at the code, I suspect that they are not calculating the cyclomatic complexity from the parse tree. Regardless, it’s a great start.
Two other metrics supported by PHPUnit worth looking at:
It is futile to attempt any quality improvement efforts without some form of objective measures with which to work. Static code analysis provides important key metrics that, combined with defect rates etc., provide good baselines for improving code quality and identifying problems early (essential to any quality practice).
Well that’s PHP software quality taken care of. I’m going outside now to check my car for rats.