Archive for the ‘Programming’ Category.

Over Optimizing

The word optimal has become a buzz word that gets tossed around too easily. Something a professor of mine pointed out a few weeks ago: By definition, optimal is something that isn’t realistic to achieve in the context of programming or software engineering. If something is optimal, there is no room for improvement, which is never the case. Some people may be irritated by the misuse of this word, but I think what is worse is the practice of optimizing.

There is a fine line between making meaningful improvements to the performance of your program,  and wasting coding time. If you have some data that needs to be searched through, a good use of your time would be deciding on a data structure to store your data, and an algorithm to search through it. A bad use of your time would be to go through all of your loops and change i++ to ++i. For those of you who are not aware of the difference, ++i is slightly faster than i++. The difference, however, is so tiny that you should never spend any amount of time changing one to the other. The same thing goes for print and echo. In PHP echo is slightly faster than print, but in a real world application it would be difficult to measure the difference.

I like to call these practices trivial over optimizations. There is a non-trivial kind of over optimization that I think is an even bigger time waster. Earlier I mentioned that a good use of your time would be deciding which data structure to use to store data that you plan to search through. This being the case you might want to spend some time comparing the performance differences between a binary search tree and a priority queue, correct? Maybe. If this program is being written to control someone’s pacemaker then the answer is certainly yes. If you plan to sort someone’s play list with this program then the answer is probably no.

There is a third kind of optimization that is not an optimization at all: If you are searching through ten items, a linear search is faster than a binary search. This is because of the over-head in constructing the binary search tree. Optimization is a practice that should belong to programs that are dealing with a large amount of data, or are time-critical. If you are concerned with the performance of your program the solution is not to optimize the code you have, the solution is to learn to write code in such a way that it doesn’t need to be optimized. This is something that comes with experience and attention to detail.

Inheriting Code

If you’ve been a programmer for any amount of time you’ve more than likely had the honor of inheriting someone else’s code. This might be in a corporate scenario, or you might just be modifying an open source program. Either way you’re experiences though varied are probably marked with few comments, poor syntax, and obscure methods. This might be something you can tolerate, but personally, I am a code perfectionist. I find it difficult to code in new functionality or modify existing functionality without reworking the code to suit my tastes. This can easily go from moving a few braces around to doing a full overhaul.

Unfortunately I can’t offer much advice to those who are stuck in a situation where they are digging through terrible code; honestly all you can do is be patient and buy a stress ball. I can, however, offer some tips on good coding practice that might prevent you from ruining someone’s day.

Tabs and Braces and Spaces

Also known as whitespace. There are few things more frustrating than staring at code that isn’t organized. The number one thing that deters me from helping someone with a programming question is looking at their code and seeing things like this:

$x = 0;
while($x < 10)
{
if($x==3){echo 'This is terrible code';}
else
{
echo 'hello';
}
x++
}

To begin with, the only time braces should not be on a line of their own is when they are preceded, or followed by a conditional statement:

//OK... this method id my personal preference
if($x == 1)
{
	echo 'hello';
}

//OK.. probably the most common
if($x == 1) {
	echo 'hello';
	/*
	.
	.
	.
	*/
}
else {
	echo 'goodbye';
	/*
	.
	.
	.
	*/
}

//OK.. I'm not a fan but its acceptable
if($x == 1) {
	echo 'hello';
	/*
	.
	.
	.
	*/
} else
	echo 'goodbye';
	/*
	.
	.
	.
	*/
}

//NOT OK!
if($x == 1) { echo 'hello'; /* . . . */ }

Moving on, code nested within braces or within the body of a conditional or other control flow statement should be indented:

//OK
for($i = 0; $i < 10; $i++)
{
	echo "hello \n";
	echo $i;
}

/**
* OK,
* Some people don't like this method, but as long
* as it is only a single line its fine with me. Make
* sure you leave an empty space following the
* final line.
**/
if($x == 1)
	echo '$x = 1';
else
	echo '$x != 1';

//NOT OK
while($x < 10)
{
echo $x;
$x++;
}

Now we come to spaces. Operators and their operands should always be separated by a space. UNIX shell scripters can make an exception to this of course, but otherwise you should space things out:

//All of these are OK
$x = 10;
$y = 5;
$z = $x + $y
$z *= 2;

//These are not OK
$x=10;
$y=5;
$z=$x+$y++;
$z*=($z+$x)--; //I can't even tell you with this is equal to

Duplicate Functionality

The only thing worse than editing your terrible code, is doing it twice. If you find yourself using copy and paste, or otherwise coding the same functionality multiple times, for the sake of anyone reading your code please stop. Not only is this going to translate into a bad experience when it comes time to update your code, but its also likely that whatever mistakes you made the first time you wrote the code were duplicated.

Yesterday I installed a WordPress plugin. The plugin performed well, and for the most part I was happy with it. Of course, however, there were modifications I needed to make (including as it turns out rewriting the code to make it conform to the XHTML standard). Once I opened the file to make the changed I was horrified to discover that there were literally zero functions or objects. If something needed to be done twice, the code was copied. Unfortunately the writer of the plugin forgot to close a div, which is an honest mistake. What isn’t an honest mistake, and should be punishable by death by CRT monitor thrown at you, is taking that code, and copying it three times (error included) to perform the exact same task.

Manageability

That same file I just mentioned contained over 5000 lines of code. I don’t care how large of a project it is, your code should never exceed 1000 lines. Of course, in this case, the programmer decided not to use any functions so there was no logical way to break up the file. If you, on the other hand, are not out of your mind, any significant amount of code you create will contain many functions. Most likely those functions can be categorized and placed in separate files. If you want to get fancy you can even use some objects. The important thing is that the code you write isn’t in blob form.

The same concept applies to individual lines of code. Once again refer to the plugin I had the displeasure of modifying, the longest line in that file was 549 characters long. If I can’t see an entire line of code with scrolling on a widescreen monitor then there is a problem. Really, even using a fullscreen monitor you should never have to use the horizontal scroll. There is nothing wrong with using the return key in the middle of a line of code.

//NOT OK
$animalCount = array('cat' => 5, 'dog' => 2, 'bird' => 7, 'mouse' => 3, 'badger' => 0, /* we don't need no stinking badgers */ 'chicken' => 4);

//OK
$animalCount = array(
			'cat' => 5,
			'dog' => 2,
			'bird' => 7,
			'mouse' => 3,
			'badger' => 0, //we don't need no stinking badgers
			'chicken' => 4	);

Comments

The final portion of my rant regards commenting your code. This is something we’ve all been guilty of at some point. I don’t care how well written your code is, most likely I can’t tell what it does at first glance. If you define a function, write a few words about what that function does. If you do something that seems out of the ordinary, at the very least write down that you did that intentionally.

I wrote some code a few days ago and it contained a switch statement. I intentionally did not use a break in one of the cases because the following case needed to be performed as well. To someone glancing at the code, however, this would seem like a bug, they would add a break statement and introduce a new bug when they thought they were fixing a bug.

Use comments!

Strings Are Arrays

I’ve seen a lot of people asking questions about how to find a certain character or sequence inside a string. It is common for people to turn to the library in order to find a function that does this for them, but it is likely the case that the answer is right in front of you. If you need to search thorough a string for whatever reason you can index it as an array. This is the case in a lot of language (PHP, C++, Java).

In a non-type-safe language like PHP people often forget the distinction between primitive types and other constructs. Primitives are often things like int, float, double, long, and char. Strings are generally not a primitive type but an array of chars. Most languages tend to hide this fact from you, because strings are so common it is often more convenient for the programmer to deal with them as if they were not an array of characters.  For example, Java does not support operator overloading; you can only use arithmetic operators (+, -, *) on primitive types, with the exception of strings, which can be appended using the + operator.

So how can you use this fact to your advantage? There are many cases where indexing strings rather than passing them to some function can serve your purpose more efficiently. Here are some examples in PHP:

Ex.
You have several lines of text and you want to count the instances of a particular word. One way to do this would be to use explode to turn the text into an array of words and then count the instances of that word:

$text = 'cat dog chicken cat bird mouse cat lizard';
$words = explode(' ' , $text);
$count = 0;
foreach($words as $word)
{
     if($word == 'cat')
          $count++;
}
echo $count; //should output 3

One function call and one loop, pretty simple right? Maybe not. It is important to remember that function calls may a lot more than they appear. In order to split the string into an array of words, explode must search through the string for a space. If all you need to do is count the number of instances of a word, there is no need to waste time constructing an array of words and then looping through that array. Here is a more efficient way:

$curr = '';
$count = 0;
$searchStr = 'cat';
$text = 'cat dog chicken cat bird mouse cat lizard';
$len = strlen($text);

for($i = 0; $i < $len; $i++)
{
     if($text[$i] == ' ')
     {
          if($curr == $searchStr)
               $count++;

          $curr = '';
     }
     else
          $curr .= $text[$i];
}

The above example contains several more lines of code, but it is important to remember that the number of lines in a program should not be used to measure the performance of a program.

Validating User Input

Whenever you write an application that takes user input, you must assume users fall into two categories. Users who are incompetent, meaning that they are likely to provide incorrect input, and users that are attempting to exploit the system, meaning that they are trying to access, destroy, or manipulate information that they should not be able to. Obviously there is a third category: Users who neither malicious, nor incompetent and are using the system in good faith. In the context of making a secure and robust application, however, we do not care about this third group of users.

Robust Applications

An application is robust if it is not prone to crashing or misbehaving regardless of the input it is given. If an application is given improper input it should respond by informing the user of their mistake. This means that the programmer must determine the nature of a user’s input, before using it. If your program is expecting a number as input, it should not proceed if that input is a string. Furthermore, if only a certain range of numbers (ie 1 through 10) are valid, then the program should not proceed if given a number outside of that range.

Regular expressions can further aid in validating data. If you are expecting for a user to submit an email address, simply verifying that the input is a string is not sufficient. You want to ensure that the input meets certain criteria: It should consist of at least 1 character followed by the @ symbol followed by a domain name, the . symbol and finally, a TLD. A regular expression to accomplish this would be:

/^([a-zA-Z0-9])+([a-zA-Z0-9\.\\+=_-])*@([a-zA-Z0-9_-])+([a-zA-Z0-9\._-]+)+$/

Certainly there are more comprehensive ways to validate an email address, but it is important not to get carried away when validating data. The main purpose is not only to ensure that incorrect input is dealt with, but also that correct input is passed on without incident.

Secure Applications

In most cases incorrect input is harmless. It might cause the application to behave poorly or even crash, but generally restarting the programming or submitting a form over again will fix the problem. Some input, however, is intended to exploit a vulnerability in the system. An SQL injection is a common example of this type of input. Ensuring that input conforms to any applicable constraints is a first wave of defence against this type of attack. In addition to this, however, it is important to either escape or exclude certain characters from user input. Quotes for example should be be properly escaped.

Examples

So far I’ve talked about robust and secure applications in general. Now I will give examples of how to secure your application in PHP. Before I give code examples I would like to outline a few practices that will prevent SQL injections, as well as good faith mistakes:

  • Email addresses should be validated with a regular expression.
  • Number values, such as dates should be validated as integers.
  • User names should be constrained to a subset of characters (ie A-Z, a-z, 0-9 and _) and validated with a regular expression.
  • Passwords should be encrypted (i.e. sha1) before submitting to the database
  • All data should have quotes escaped

Here are a few PHP functions to validate user input:

<?php

/**
Ensures that the input is number, and if specified, lies
between the values min and max. For example, if you want to
validate that an input is a valid day of the month call
validateNumeric($input , $min = 0 , $max = 31);
**/
function validateNumeric($value , $min = 'none' , $max = 'none')
{
	if(!is_numeric($value))
		return false;

	if(is_numeric($min) && $min > $value)
		return false;

	if(is_numeric($max) && $max < $value)
		return false;

	return true;
}

/**
Ensures that the given email address is correctly
formatted
**/
function validateEmailAddress($address)
{
	if(!preg_match( "/^([a-zA-Z0-9])+([a-zA-Z0-9\.\\+=_-])*@([a-zA-Z0-9_-])+([a-zA-Z0-9\._-]+)+$/", $address))
	{
		return false;
	}
	return true;
}

/**
Ensures that the given username contains only
letters and numbers and is longer than the
give minimum length.
**/
function validateUsername($user , $minLength)
{
	if(eregi('[^A-Za-z0-9]', $u) > 0 || strlen($u) < $minLength)
	{
		return false;
	}

	return true;
}

/**
Escapes the given string. It is best to use whatever
real_escape_string method that PHP supplies for your
particular database. I use MySQL here as a default.
If no real_escape_string method exists, the addslashes
function is used.
**/
function escapeString($value)
{
	if(function_exists('mysql_real_escape_string'))
		return @mysql_real_escape_string($value);
	else
		return addslashes($value);
}

/**
Encrypts the given password using sha1 (twice).
Also supports the use of a salt, which is recommended.
**/
function encryptPassword($password , $salt = '')
{
	$hash = sha1( $salt . sha1($password) );
	return $hash;
}

?>

Getting started with C++

I am by no means a C++ expert, but I did for, for a summer, teach introduction to programming in C++ (I would have preferred teaching in Java, but the job called for C++). This being the case, I’m aware of many of the issues that people who are just picking up the language have. I’ll address a few of the common questions and issues here. Continue reading ‘Getting started with C++’ »

Will Code For Food

A few months ago I quit my day job in favor of dedicating more of my time to school. Seeing as how after three months my cash reserves are all but depleted I’ve decided to offer my services to anyone in need (PHP, Java, SQL etc).

Check out the For Hire page for more info.

Semicolons

You know you’ve spent too much time programming when you start ending your sentences with a semicolon;

Consolidating Error Pages with .htaccess

Before I get into the topic of how to consolidate all of your error pages, let me first explain how to use .htaccess to create custom error pages. If you already know how to do this feel free to skip to the next section.

Creating Custom Error Pages

.htaccess, among other things, allows you to specify custom error pages for your site. Say a user requests a file that does not exist, typically that person will get an error page that looks somewhat like this:

—————————————————-
Not Found

The requested URL /somepage.html was not found on this server.

Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.
—————————————————-

Not only is this message not helpful, but it is unappealing. Most likely you worked tirelessly making your site look presentable, so it would be a shame for a user attempting to access a page on your site to be given an ugly error message.

.htaccess allows you to specify the page you would like to use as an error page for a particular error (404, 301, 500 etc.). To do this, if it does not already exist, create a file called .htaccess in the root ( / ) directory of your site, or whatever directory you want to use custom error pages for. Note that these custom error pages will be used not only for the directory that the .htaccess file is located in, but all of the ones below it, unless you specifically override it.

Once you have created your .htaccess file, for each error you would like a custom page for add the following line:

ErrorDocument

So for example, if you would like to use a custom error page for a 404 (file not found) do the following:

ErrorDocument 404 /404.html

Where 404.html is your custom error page. Note that not every code is an error code. If you tried to set up an error page for code 200 you would end up creating an infinite loop.

Consolidating Your Error Pages

So you’ve setup your custom error pages. Mostly likely you haven’t taken the time to create a custom page for every error, and its most likely the case you don’t need to. I can honestly say I’ve never gone to a site and have it come up with a 414 Request URI Too Long error. Still, you might have taken the time to create several pages for the more common errors. You may even have a whole directory dedicated to error pages. Instead, you may want to consider using a single error page for all errors. If you’re experienced with .htaccess, you may already know how to accomplish this, if not I’ll show you.

In your .htaccess file you’ll still need to add a line for each error code you want to use a custom page for. Make sure to use the same scheme when naming all of your error pages, for example fourohfour.html, and 5hundred.php would be a bad choice. For this example I’ll use error.php (ie error404.php). These pages don’t actually need to (and should not) exist. Now we just need to create a RewriteRule for our error pages:

RewriteEngine On
ErrorDocument 404 /error404.php
ErrorDocument 500 /error500.php
RewriteRule ^error([0-9]+) error.php?code=$1 [NC]

You could eleminate the need for the rewrite rule by redirecting all of your errors to the same page like so:

RewriteEngine On
ErrorDocument 404 /error.php?code=404
ErrorDocument 500 /error.php?code=500

This, however, reveals the underlying system. This might not be a problem for most people, but some people prefer the look of urls that don’t contain parameters (?code=xxx). Also, if you decide you want to track what errors your users are getting (like what pages they are linking to that don’t exist) you can store this information in a database, in which case you wouldn’t want users to be aware of the underlying system.

Now whenever a user gets a 404 or a 500 error they will be redirected to error.php and the error code will be passed to that script. In error.php you can now set up custom messages for each error code. Here is a simple script as an example:

$code = $_GET['code']; //the error code
$code .= ''; //avoid integer indexing of the array

$errors = array( '300' => 'Multiple Choices',
                 '301' => 'Moved Permanently',
                 '302' => 'Moved Temporarily',
                 '303' => 'See Other',
                 '304' => 'Not Modified',
                 '305' => 'Use Proxy',
                 '400' => 'Bad Request',
                 '401' => 'Authorization Required',
                 '402' => 'Payment Required',
                 '403' => 'Forbidden',
                 '404' => 'Not Found',
                 '405' => 'Method Not Allowed',
                 '406' => 'Not Acceptable',
                 '407' => 'Proxy Authentication Required',
                 '408' => 'Request Timed Out',
                 '409' => 'Conflicting Request',
                 '410' => 'Gone',
                 '411' => 'Content Length Required',
                 '412' => 'Precondition Failed',
                 '413' => 'Request Entity Too Long',
                 '414' => 'Request URI Too Long',
                 '415' => 'Unsupported Media Type',
                 '500' => 'Internal Server Error',
                 '501' => 'Not Implemented',
                 '502' => 'Bad Gateway',
                 '503' => 'Service Unavailable',
                 '504' => 'Gateway Timeout',
                 '505' => 'HTTP Version Not Supported'
                      );

echo $code . ' ' . $errors[$code];

Clearly the error page generated by the script above would be no better than the default ones, but it is just an example. You could easily embed it into your site in order to maintain consistency for your users.

Expanding My Horizons

The more I learn the more I realize there is so much more out there that I have yet to experience. With regard to programming, I’ve gotten to a point where I am no longer limited to any one language. I do feel, however, that I have yet to experience the huge variety of languages out there. When it comes to software engineering, object oriented programming is pervasive. This approach, has so far, dominated my experiences as a programmer.

Yet there are many more approaches. Here is a fairly comprehensive list of the various paradigms out there:

Each one of these paradigms has its own list of languages associated with it. Some of them you’ve heard of, some of them are more obscure. Some languages transcend multiple paradigms (C++, PHP, Oz) while others are pure forms of its associated paradigm (Java).

My Goal

I’m primarily a Java and PHP programmer. This means that I’ve really only experienced two paradigms: Object-Oriented and Imperative (I’ve dabbled in functional using LISP, but not enough). My goal is to either learn a language, or a new approach with a language that I already know, that falls under every one of the categories above.

To get started, I think I’m going to take a shot a parallel programming, either in join java or Oz. This is most likely going to be an ongoing project for quite a while as school is devouring all of my time (honestly I shouldn’t even be writing this right now).

Database Analysis Through Simulation

Making adjustments to a database schema after it has gone into use is a daunting task. Whether it be because of efficiency issues or the incorporation of a new feature, this is a situation you should avoid at all costs. Often times, however, mistakes and inefficiencies are difficult to spot at implementation time. Only when your database has become populated, often by users who are counting on your applications to be reliable, do these things come to light. So what can you do? One solution is to run a simulation. By this I mean systematically project how your database will look in the future when it has come into use.

Creating a simulation is not a difficult task, provided you have experience in virtually any programming language. All you need to do is write a program that simulates the growth of your site over time. The output would be a sequence of SQL commands (mostly inserts, maybe updates).

How it Works

For the purposes of this example lets assume that your website is some sort of forum. We’ll simplify things by limiting the actions that can be performed. Lets say that a user will be able to:

  • Register
  • Post a new thread
  • Comment on an existing thread
  • Send messages to another user

Start with a small initial population of users, as would be the case after your site was first launched. Now consider the upper limit of your simulation; how many users will your site have when the simulation completes. The simulation will run until your current population size exceeds your upper limit.

Since we want to take a systematic approach to this simulation, it should occur over intervals. An interval is an arbitrary period of time, over which, your population increases by a certain amount, we’ll call this the growth rate. The bulk of our simulation will occur in these intervals. During an interval, each user has a chance of performing one of the actions associated with our site. You must determine the chance of each action occuring in a realistic manner, which reference to your growth rate. Say you expect your site to grow by 5% a week. What is the probability that each user will post a new thread in that time period?

double population = 10.000;
double maxPopulation = 1000;
double growthRate = 1.05;

while(population <= maxPopulation)
{
     //simulation interval

    population *= growthRate;
}

In the example above, we start with a population size of 10 and we increase that population by 5 percent until it exceeds the max population of 1000. In each interval we must cycle over each user in the population, and determine if that user performs one of our actions, based on the probabilites we determined. We must also remember to generate new users based on how much our population increased. Now that we have a general frame for our simulation, we must generate our initial population and start generating data in our intervals.

List commandList = empty list;

List users = empty list;
List threads = empty list; //each thread is assumed to contain a list of comments
List messages = empty list;

double population = 10.000;
double maxPopulation = 1000;
double growthRate = 1.05;

double newThreadChance = .05; //for each user, there is a 2 percent
                                            //that they will post a new thread
double newCommentChance = .30;
double newMessageChance = .05;

//generates random numbers
Random r = new Random();

//generate our initial population
for(int i = 0; i < population; i++)
{
     User u = new User("username", "email", "other info");
     commandList.add(u.toSQL());
     users.add(u);
}

//begin intervals
while(population <= maxPopulation)
{
    //for each user
    for(User u : users)
    {
          if(newThreadChance <= r.nextDouble())
          {
               Thread t = new Thread(u, "title", "content");
               commandList.add(t.toSQL());
               threads.add(t);
          }

          if(newCommentChance <= r.nextDouble())
          {
               //select a random thread
               int index = r.nextInt(threads.size());
               Thread t = threads.get(index);

               Comment c = new Comment(t, u, "content");
               t.addComment(c);
               threads.set(index, t);

               commandList.add(c.toSQL());
          }

          if(newMessageChance <= r.nextDouble())
          {
               //select a random user
               int index = r.nextInt(users.size());
               User recipient = users.get(index);

               Message m = new Message(u, recipient, "subject", "content");
               messages.add(m);
               commandList.add(m.toSQL());
          }
     }

     //increase our population
     for(int i = population; i < population * growthRate; i++)
     {
           User u = new User("username", "email", "other info");
           users.add(u);

           commandList.add(u.toSQL());
      }

     population *= growthRate;
}

The above example is a completed simulation. When it is complete the list commandList will contain a complete list of all of the SQL insert commands, in order, for us to population our database with.

There are some parts of the simulation that I left for the reader to complete on their own. The details of implementing the user, message, thread, and comment objects have been left out. Notice that each of these entities contains a toSQL method. This will simplify the process of converting your objects to SQL. Also, you will have to dump the commandList to a text file so it can be run on your database. This is just one example of how to carry out a simulation. Obviously if you choose to use a non-object oriented approach your implementation will look different.

Once your database is populated you can then navigate your site as if it is teaming with users. This will not only allow you to rate the performance of your site, but allow you to see what it will look like once it has gone into use.