Archive for the ‘Programming’ Category.

Bubble Sort is Never the Answer

It is not too often in the real world that you have to implement your own sort. Generally, whatever language you are using has a library with this functionality built in. If the occasion does arise, however, it is important to understand which algorithms are applicable in which situations. As with most choices, there is no absolute correct answer; there are many trade offs to consider. When choosing an algorithm there are three things you should consider: performance, overhead, and ease of implementation.

You should give equal consideration to each of these factors, disregarding any one of them can lead to poor choices. It is common, for instance, for people to ignore the ease of implementation and focus on the performance of the algorithm. The problem with this is that not every operation is critical. No one is going to die if they songs on their play list do not get sorted quickly enough. Programmer time is more expensive than run time as a professor of mine often said. In addition to this, some high performance algorithms can slower than simpler algorithms due to overhead. If you are sorting 100 items, you can probably insertion sort them just as fast or faster than you can heap sort them. The same would not be true with one million items; heap sort would be faster.

Once we consider all of the factors, you should find that no one algorithm is ideal in every case. There are some algorithms, however, that are not ideal in any case. Unfortunately one of these algorithms is among the most popular: bubble sort. Bubble sort is a very simple algorithm to implement and it has little overhead. The problem lies in its performance. You might think that this conflicts with my earlier point that even simple, low performance algorithms can be faster than others in the right situation. You also might think that bubble sort, being easy to implement, makes up for its performance short comings. This would be true, if it were not the case that there are algorithms that are equally simple to implement, require just as little overhead, and perform better in practice.

Insertion sort is one such algorithm. Like bubble sort it is an in place sort, and is just as easy to implement. Both algorithms have the same time complexity (O notation), but in practice insertion sort performs better in most cases. This being the case you may wonder why bubble sort is even around. Certainly if it is obsolete pages regarding its implementation should be torn out of books and mentioning it should be punishable by a swift slap with a keyboard. Maybe not. When I learned bubble sort it was as an example of how not to sort. In my non-expert opinion, it is equally important to understand how NOT to do things as it is important to understand how TO do them. My point? Learn bubble sort, but never use it.

Know What to Expect from your Programming Language

I often see people asking how to do things with a given programming language that it was not intended to do. Recently I read a post from someone who wanted to know how to take a java program and compile it to a .exe. For anyone who is not aware, Java programs are not compiled in the same way a C++ program is compiled. The java source code is first compiled to bytecode. That bytecode is then interpreted by the java virtual machine. The writer was intending to get a performance boost by having the code compiled rather than interpreted.

While it is true that a compiled language can be faster than a interpreted language, it is not the case that every compiler can out perform every interpreter. This is especially true if you are compiling code that was intended to be interpreted. There are Java compilers out there, but the Java interpreter is much more mature than any of them. In addition to the non-existent performance increase, by compiling this code you eliminate one of Java’s key features: portability. If you bypass the JVM then you will have to worry about which systems your code will run on and which they will not.

Ultimately, if you need code that is very fast the answer is not to take code in one language and tweak it into something it was never meant to be. This echos another problem: language dependence. Too often I see people who learn everything there is to know about one programming language, and never bother to learn another. Being a programmer does not mean being a C++ programming, or a Java programmer, or a PHP programmer, it means understanding the concepts of programming and having that understanding transcend multiple languages. This will remain to be true until someone comes up with a catch-all language that is ideal in every case. Until then if you need really fast code think about C++, if you want safe portable code think about Java. You should also be aware that these trade-offs are not absolute. Not every Java program is slower than an equivalent C++ program, and a poorly written C++ program is certainly slower than a well written Java program.

The example I mentioned above is only one of many. I’ve seen people that want to write desktop applications in PHP, write .NET apps that work without the .NET framework, and use javascript in an offline application. Though someone might have hacked something together that facilitates this, be aware that in most cases these implementations are not ideal. If you need to do something that the language you are using was not intended to do then that is a sign you need to branch out and become a programmer.

The Programmer’s Cardinal Sin

If you are a programmer you should do yourself, and anyone else working with your code, a favor: stop using copy and paste. If there is a case where you need to use the exact same, or very similar, code in multiple places, that is a sign that you should be using a function, object, or other structure. I say this not for the sake of ‘proper coding practices’, but to save you and anyone else dealing with your code a massive headache.

I admit that there have been cases where I have copied code (I was young! I didn’t know what I was doing). From experience, I can tell you that I have almost always regretted it. For one, it is very annoying to have to make changes in multiple parts of your program. For another, when you copy code, you also copy any errors in that code. You might be inclined to think that you’ve been careful or you are certain that this code doesn’t contain any errors, but in retrospect, I think at least half of the time I’ve done this even when it is a fairly simple program, I’ve had to go back and fix bugs in multiple places.

This headache is even worse for someone who is working with your code. When making changes or updates that person may think that they’ve done what they intended when in fact they’ve only addressed half or less of the problem. The same goes for errors: it can be extremely frustrating to think that you’ve fixed a problem and then have it present itself again because you are unaware that the original coder used copy and paste.

If you have the means, I advize you to attach USB electrodes (you can buy them at target) to yourself and have yourself zapped everytime you press ctrl+c.

Over Optimizing

The word optimal has become a buzz word that gets tossed around too easily. Something a professor of mine pointed out a few weeks ago: By definition, optimal is something that isn’t realistic to achieve in the context of programming or software engineering. If something is optimal, there is no room for improvement, which is never the case. Some people may be irritated by the misuse of this word, but I think what is worse is the practice of optimizing.

There is a fine line between making meaningful improvements to the performance of your program,  and wasting coding time. If you have some data that needs to be searched through, a good use of your time would be deciding on a data structure to store your data, and an algorithm to search through it. A bad use of your time would be to go through all of your loops and change i++ to ++i. For those of you who are not aware of the difference, ++i is slightly faster than i++. The difference, however, is so tiny that you should never spend any amount of time changing one to the other. The same thing goes for print and echo. In PHP echo is slightly faster than print, but in a real world application it would be difficult to measure the difference.

I like to call these practices trivial over optimizations. There is a non-trivial kind of over optimization that I think is an even bigger time waster. Earlier I mentioned that a good use of your time would be deciding which data structure to use to store data that you plan to search through. This being the case you might want to spend some time comparing the performance differences between a binary search tree and a priority queue, correct? Maybe. If this program is being written to control someone’s pacemaker then the answer is certainly yes. If you plan to sort someone’s play list with this program then the answer is probably no.

There is a third kind of optimization that is not an optimization at all: If you are searching through ten items, a linear search is faster than a binary search. This is because of the over-head in constructing the binary search tree. Optimization is a practice that should belong to programs that are dealing with a large amount of data, or are time-critical. If you are concerned with the performance of your program the solution is not to optimize the code you have, the solution is to learn to write code in such a way that it doesn’t need to be optimized. This is something that comes with experience and attention to detail.

Inheriting Code

If you’ve been a programmer for any amount of time you’ve more than likely had the honor of inheriting someone else’s code. This might be in a corporate scenario, or you might just be modifying an open source program. Either way you’re experiences though varied are probably marked with few comments, poor syntax, and obscure methods. This might be something you can tolerate, but personally, I am a code perfectionist. I find it difficult to code in new functionality or modify existing functionality without reworking the code to suit my tastes. This can easily go from moving a few braces around to doing a full overhaul.

Unfortunately I can’t offer much advice to those who are stuck in a situation where they are digging through terrible code; honestly all you can do is be patient and buy a stress ball. I can, however, offer some tips on good coding practice that might prevent you from ruining someone’s day.

Tabs and Braces and Spaces

Also known as whitespace. There are few things more frustrating than staring at code that isn’t organized. The number one thing that deters me from helping someone with a programming question is looking at their code and seeing things like this:

$x = 0;
while($x < 10)
{
if($x==3){echo 'This is terrible code';}
else
{
echo 'hello';
}
x++
}

To begin with, the only time braces should not be on a line of their own is when they are preceded, or followed by a conditional statement:

//OK... this method id my personal preference
if($x == 1)
{
	echo 'hello';
}

//OK.. probably the most common
if($x == 1) {
	echo 'hello';
	/*
	.
	.
	.
	*/
}
else {
	echo 'goodbye';
	/*
	.
	.
	.
	*/
}

//OK.. I'm not a fan but its acceptable
if($x == 1) {
	echo 'hello';
	/*
	.
	.
	.
	*/
} else
	echo 'goodbye';
	/*
	.
	.
	.
	*/
}

//NOT OK!
if($x == 1) { echo 'hello'; /* . . . */ }

Moving on, code nested within braces or within the body of a conditional or other control flow statement should be indented:

//OK
for($i = 0; $i < 10; $i++)
{
	echo "hello \n";
	echo $i;
}

/**
* OK,
* Some people don't like this method, but as long
* as it is only a single line its fine with me. Make
* sure you leave an empty space following the
* final line.
**/
if($x == 1)
	echo '$x = 1';
else
	echo '$x != 1';

//NOT OK
while($x < 10)
{
echo $x;
$x++;
}

Now we come to spaces. Operators and their operands should always be separated by a space. UNIX shell scripters can make an exception to this of course, but otherwise you should space things out:

//All of these are OK
$x = 10;
$y = 5;
$z = $x + $y
$z *= 2;

//These are not OK
$x=10;
$y=5;
$z=$x+$y++;
$z*=($z+$x)--; //I can't even tell you with this is equal to

Duplicate Functionality

The only thing worse than editing your terrible code, is doing it twice. If you find yourself using copy and paste, or otherwise coding the same functionality multiple times, for the sake of anyone reading your code please stop. Not only is this going to translate into a bad experience when it comes time to update your code, but its also likely that whatever mistakes you made the first time you wrote the code were duplicated.

Yesterday I installed a WordPress plugin. The plugin performed well, and for the most part I was happy with it. Of course, however, there were modifications I needed to make (including as it turns out rewriting the code to make it conform to the XHTML standard). Once I opened the file to make the changed I was horrified to discover that there were literally zero functions or objects. If something needed to be done twice, the code was copied. Unfortunately the writer of the plugin forgot to close a div, which is an honest mistake. What isn’t an honest mistake, and should be punishable by death by CRT monitor thrown at you, is taking that code, and copying it three times (error included) to perform the exact same task.

Manageability

That same file I just mentioned contained over 5000 lines of code. I don’t care how large of a project it is, your code should never exceed 1000 lines. Of course, in this case, the programmer decided not to use any functions so there was no logical way to break up the file. If you, on the other hand, are not out of your mind, any significant amount of code you create will contain many functions. Most likely those functions can be categorized and placed in separate files. If you want to get fancy you can even use some objects. The important thing is that the code you write isn’t in blob form.

The same concept applies to individual lines of code. Once again refer to the plugin I had the displeasure of modifying, the longest line in that file was 549 characters long. If I can’t see an entire line of code with scrolling on a widescreen monitor then there is a problem. Really, even using a fullscreen monitor you should never have to use the horizontal scroll. There is nothing wrong with using the return key in the middle of a line of code.

//NOT OK
$animalCount = array('cat' => 5, 'dog' => 2, 'bird' => 7, 'mouse' => 3, 'badger' => 0, /* we don't need no stinking badgers */ 'chicken' => 4);

//OK
$animalCount = array(
			'cat' => 5,
			'dog' => 2,
			'bird' => 7,
			'mouse' => 3,
			'badger' => 0, //we don't need no stinking badgers
			'chicken' => 4	);

Comments

The final portion of my rant regards commenting your code. This is something we’ve all been guilty of at some point. I don’t care how well written your code is, most likely I can’t tell what it does at first glance. If you define a function, write a few words about what that function does. If you do something that seems out of the ordinary, at the very least write down that you did that intentionally.

I wrote some code a few days ago and it contained a switch statement. I intentionally did not use a break in one of the cases because the following case needed to be performed as well. To someone glancing at the code, however, this would seem like a bug, they would add a break statement and introduce a new bug when they thought they were fixing a bug.

Use comments!

Strings Are Arrays

I’ve seen a lot of people asking questions about how to find a certain character or sequence inside a string. It is common for people to turn to the library in order to find a function that does this for them, but it is likely the case that the answer is right in front of you. If you need to search thorough a string for whatever reason you can index it as an array. This is the case in a lot of language (PHP, C++, Java).

In a non-type-safe language like PHP people often forget the distinction between primitive types and other constructs. Primitives are often things like int, float, double, long, and char. Strings are generally not a primitive type but an array of chars. Most languages tend to hide this fact from you, because strings are so common it is often more convenient for the programmer to deal with them as if they were not an array of characters.  For example, Java does not support operator overloading; you can only use arithmetic operators (+, -, *) on primitive types, with the exception of strings, which can be appended using the + operator.

So how can you use this fact to your advantage? There are many cases where indexing strings rather than passing them to some function can serve your purpose more efficiently. Here are some examples in PHP:

Ex.
You have several lines of text and you want to count the instances of a particular word. One way to do this would be to use explode to turn the text into an array of words and then count the instances of that word:

$text = 'cat dog chicken cat bird mouse cat lizard';
$words = explode(' ' , $text);
$count = 0;
foreach($words as $word)
{
     if($word == 'cat')
          $count++;
}
echo $count; //should output 3

One function call and one loop, pretty simple right? Maybe not. It is important to remember that function calls may a lot more than they appear. In order to split the string into an array of words, explode must search through the string for a space. If all you need to do is count the number of instances of a word, there is no need to waste time constructing an array of words and then looping through that array. Here is a more efficient way:

$curr = '';
$count = 0;
$searchStr = 'cat';
$text = 'cat dog chicken cat bird mouse cat lizard';
$len = strlen($text);

for($i = 0; $i < $len; $i++)
{
     if($text[$i] == ' ')
     {
          if($curr == $searchStr)
               $count++;

          $curr = '';
     }
     else
          $curr .= $text[$i];
}

The above example contains several more lines of code, but it is important to remember that the number of lines in a program should not be used to measure the performance of a program.

Validating User Input

Whenever you write an application that takes user input, you must assume users fall into two categories. Users who are incompetent, meaning that they are likely to provide incorrect input, and users that are attempting to exploit the system, meaning that they are trying to access, destroy, or manipulate information that they should not be able to. Obviously there is a third category: Users who neither malicious, nor incompetent and are using the system in good faith. In the context of making a secure and robust application, however, we do not care about this third group of users.

Robust Applications

An application is robust if it is not prone to crashing or misbehaving regardless of the input it is given. If an application is given improper input it should respond by informing the user of their mistake. This means that the programmer must determine the nature of a user’s input, before using it. If your program is expecting a number as input, it should not proceed if that input is a string. Furthermore, if only a certain range of numbers (ie 1 through 10) are valid, then the program should not proceed if given a number outside of that range.

Regular expressions can further aid in validating data. If you are expecting for a user to submit an email address, simply verifying that the input is a string is not sufficient. You want to ensure that the input meets certain criteria: It should consist of at least 1 character followed by the @ symbol followed by a domain name, the . symbol and finally, a TLD. A regular expression to accomplish this would be:

/^([a-zA-Z0-9])+([a-zA-Z0-9\.\\+=_-])*@([a-zA-Z0-9_-])+([a-zA-Z0-9\._-]+)+$/

Certainly there are more comprehensive ways to validate an email address, but it is important not to get carried away when validating data. The main purpose is not only to ensure that incorrect input is dealt with, but also that correct input is passed on without incident.

Secure Applications

In most cases incorrect input is harmless. It might cause the application to behave poorly or even crash, but generally restarting the programming or submitting a form over again will fix the problem. Some input, however, is intended to exploit a vulnerability in the system. An SQL injection is a common example of this type of input. Ensuring that input conforms to any applicable constraints is a first wave of defence against this type of attack. In addition to this, however, it is important to either escape or exclude certain characters from user input. Quotes for example should be be properly escaped.

Examples

So far I’ve talked about robust and secure applications in general. Now I will give examples of how to secure your application in PHP. Before I give code examples I would like to outline a few practices that will prevent SQL injections, as well as good faith mistakes:

  • Email addresses should be validated with a regular expression.
  • Number values, such as dates should be validated as integers.
  • User names should be constrained to a subset of characters (ie A-Z, a-z, 0-9 and _) and validated with a regular expression.
  • Passwords should be encrypted (i.e. sha1) before submitting to the database
  • All data should have quotes escaped

Here are a few PHP functions to validate user input:

<?php

/**
Ensures that the input is number, and if specified, lies
between the values min and max. For example, if you want to
validate that an input is a valid day of the month call
validateNumeric($input , $min = 0 , $max = 31);
**/
function validateNumeric($value , $min = 'none' , $max = 'none')
{
	if(!is_numeric($value))
		return false;

	if(is_numeric($min) && $min > $value)
		return false;

	if(is_numeric($max) && $max < $value)
		return false;

	return true;
}

/**
Ensures that the given email address is correctly
formatted
**/
function validateEmailAddress($address)
{
	if(!preg_match( "/^([a-zA-Z0-9])+([a-zA-Z0-9\.\\+=_-])*@([a-zA-Z0-9_-])+([a-zA-Z0-9\._-]+)+$/", $address))
	{
		return false;
	}
	return true;
}

/**
Ensures that the given username contains only
letters and numbers and is longer than the
give minimum length.
**/
function validateUsername($user , $minLength)
{
	if(eregi('[^A-Za-z0-9]', $u) > 0 || strlen($u) < $minLength)
	{
		return false;
	}

	return true;
}

/**
Escapes the given string. It is best to use whatever
real_escape_string method that PHP supplies for your
particular database. I use MySQL here as a default.
If no real_escape_string method exists, the addslashes
function is used.
**/
function escapeString($value)
{
	if(function_exists('mysql_real_escape_string'))
		return @mysql_real_escape_string($value);
	else
		return addslashes($value);
}

/**
Encrypts the given password using sha1 (twice).
Also supports the use of a salt, which is recommended.
**/
function encryptPassword($password , $salt = '')
{
	$hash = sha1( $salt . sha1($password) );
	return $hash;
}

?>

Getting started with C++

I am by no means a C++ expert, but I did for, for a summer, teach introduction to programming in C++ (I would have preferred teaching in Java, but the job called for C++). This being the case, I’m aware of many of the issues that people who are just picking up the language have. I’ll address a few of the common questions and issues here. Continue reading ‘Getting started with C++’ »

Will Code For Food

A few months ago I quit my day job in favor of dedicating more of my time to school. Seeing as how after three months my cash reserves are all but depleted I’ve decided to offer my services to anyone in need (PHP, Java, SQL etc).

Check out the For Hire page for more info.

Semicolons

You know you’ve spent too much time programming when you start ending your sentences with a semicolon;