NoSQL

Recently I’ve been working with a non-relational, graph DBMS called Neo4j. I’ve really only scratched the surface and it might just be a the temporary euphoria of working with something so new, but it feels liberating to be able to approach problems in a new way. It might also be that for the first time in a long time I’m working in Java. At this point I’ve written far more PHP code than Java, but I still feel like Java is my native language.

I think anyone who works with database driven applications should at least try some form of non-relational database, if only just to see things from another perspective.

Validating User Input

Whenever you write an application that takes user input, you must assume users fall into two categories. Users who are incompetent, meaning that they are likely to provide incorrect input, and users that are attempting to exploit the system, meaning that they are trying to access, destroy, or manipulate information that they should not be able to. Obviously there is a third category: Users who neither malicious, nor incompetent and are using the system in good faith. In the context of making a secure and robust application, however, we do not care about this third group of users.

Robust Applications

An application is robust if it is not prone to crashing or misbehaving regardless of the input it is given. If an application is given improper input it should respond by informing the user of their mistake. This means that the programmer must determine the nature of a user’s input, before using it. If your program is expecting a number as input, it should not proceed if that input is a string. Furthermore, if only a certain range of numbers (ie 1 through 10) are valid, then the program should not proceed if given a number outside of that range.

Regular expressions can further aid in validating data. If you are expecting for a user to submit an email address, simply verifying that the input is a string is not sufficient. You want to ensure that the input meets certain criteria: It should consist of at least 1 character followed by the @ symbol followed by a domain name, the . symbol and finally, a TLD. A regular expression to accomplish this would be:

/^([a-zA-Z0-9])+([a-zA-Z0-9\.\\+=_-])*@([a-zA-Z0-9_-])+([a-zA-Z0-9\._-]+)+$/

Certainly there are more comprehensive ways to validate an email address, but it is important not to get carried away when validating data. The main purpose is not only to ensure that incorrect input is dealt with, but also that correct input is passed on without incident.

Secure Applications

In most cases incorrect input is harmless. It might cause the application to behave poorly or even crash, but generally restarting the programming or submitting a form over again will fix the problem. Some input, however, is intended to exploit a vulnerability in the system. An SQL injection is a common example of this type of input. Ensuring that input conforms to any applicable constraints is a first wave of defence against this type of attack. In addition to this, however, it is important to either escape or exclude certain characters from user input. Quotes for example should be be properly escaped.

Examples

So far I’ve talked about robust and secure applications in general. Now I will give examples of how to secure your application in PHP. Before I give code examples I would like to outline a few practices that will prevent SQL injections, as well as good faith mistakes:

  • Email addresses should be validated with a regular expression.
  • Number values, such as dates should be validated as integers.
  • User names should be constrained to a subset of characters (ie A-Z, a-z, 0-9 and _) and validated with a regular expression.
  • Passwords should be encrypted (i.e. sha1) before submitting to the database
  • All data should have quotes escaped

Here are a few PHP functions to validate user input:

<?php
/**
Ensures that the input is number, and if specified, lies
between the values min and max. For example, if you want to
validate that an input is a valid day of the month call
validateNumeric($input , $min = 0 , $max = 31);
**/
function validateNumeric($value , $min = 'none' , $max = 'none')
{
	if(!is_numeric($value))
		return false;
	if(is_numeric($min) && $min > $value)
		return false;
	if(is_numeric($max) && $max < $value)
		return false;
	return true;
}
/**
Ensures that the given email address is correctly
formatted
**/
function validateEmailAddress($address)
{
	if(!preg_match( "/^([a-zA-Z0-9])+([a-zA-Z0-9\.\\+=_-])*@([a-zA-Z0-9_-])+([a-zA-Z0-9\._-]+)+$/", $address))
	{
		return false;
	}
	return true;
}
/**
Ensures that the given username contains only
letters and numbers and is longer than the
give minimum length.
**/
function validateUsername($user , $minLength)
{
	if(preg_match('/[^A-Za-z0-9]/i', $user) > 0 || strlen($user) < $minLength)
	{
		return false;
	}
	return true;
}
/**
Escapes the given string. It is best to use whatever
real_escape_string method that PHP supplies for your
particular database. I use MySQL here as a default.
If no real_escape_string method exists, the addslashes
function is used.
**/
function escapeString($value)
{
	if(function_exists('mysql_real_escape_string'))
		return @mysql_real_escape_string($value);
	else
		return addslashes($value);
}
/**
Encrypts the given password using sha1 (twice).
Also supports the use of a salt, which is recommended.
**/
function encryptPassword($password , $salt = '')
{
	$hash = sha1( $salt . sha1($password) );
	return $hash;
}
?>

PDO… Use It

The majority of PHP code I see, whether it be posted by a beginner on a forum, or built into a large application, seems to suggest that the current standard when accessing a database is to make use of whatever database specific commands PHP provides for your particular database (ie mysql_query, mssql_query, etc). Often times, calls to these function are coupled with various attempts to prevent sql injections (such as calls to mysql_real_escape_string). Altogether, however, to me it seems clumsy and insecure.

PHP5 provies a built in (partial) database abstraction layer that can not only simplify the process of querying your database not only protecting against sql injections, but optimize your queries and make your application more portable: PDO (PHP Data Objects). A typical database transaction with PHP might look something like this:

mysql_connect('localhost', 'user', 'password');
mysql_select_db('myDB');
$data = mysql_real_escape_string($_POST['data']);
$query = 'SELECT column FROM table WHERE data = \'' . $data . '\'';
$result = mysql_query($query);
while($row = mysql_fetch_array($result, FETCH_NUM))
{
     echo $row[0];
}

Now with PDO

$dsn = 'mysql:dbname=myDB;host=127.0.0.1';
try {
     $db = new PDO($dsn , 'user' , 'password');
}
catch(PDOException $e) {
     echo $e->getMessage();
}
$query = 'SELECT column FROM table WHERE data = ?';
$statement = $db->prepare($query);
$statement->bindParam(1 , $_POST['data']);
$statement->execute();
$rows = $statement->fetchAll(PDO::FETCH_NUM);
foreach($rows as $row)
{
     echo $row[0];
}

Outwardly, the second example might not seem like an improvement on the first. What is important, however, is what is going on underneath. From PHP.net:

Calling PDO::prepare() and PDOStatement::execute() for statements that will be issued multiple times with different parameter values optimizes the performance of your application by allowing the driver to negotiate client and/or server side caching of the query plan and meta information, and helps to prevent SQL injection attacks by eliminating the need to manually quote the parameters.

In addition to this, it simplifies the process of making your application compatible with another database. If you were to use database specific function throughout your application, you would need to change each to allow it to work with another database.

PDO, however, is not a complete database abstraction layer. While it does aid you in making your application portable, it will not emulate features of a particular database. If your queries are incompatible with your database PDO will not help you. Rarely, however, will it be the case that your queries are not compatible with most sql databases.

There are actually a few cases where where PDO will not be able to execute your queries (example: http://dev.mysql.com/doc/refman/5.0/en/c-api-prepared-statement-problems.html) but these cases are few and far between. Really it is rarely the case that you should NOT be using PDO.

Using PDO does not guarantee that user data is safe. You should always validate and sanitize user input, but PDO provides a decent extra line of defense.

http://us.php.net/pdo

Database Analysis Through Simulation

Making adjustments to a database schema after it has gone into use is a daunting task. Whether it be because of efficiency issues or the incorporation of a new feature, this is a situation you should avoid at all costs. Often times, however, mistakes and inefficiencies are difficult to spot at implementation time. Only when your database has become populated, often by users who are counting on your applications to be reliable, do these things come to light. So what can you do? One solution is to run a simulation. By this I mean systematically project how your database will look in the future when it has come into use.

Creating a simulation is not a difficult task, provided you have experience in virtually any programming language. All you need to do is write a program that simulates the growth of your site over time. The output would be a sequence of SQL commands (mostly inserts, maybe updates).

How it Works

For the purposes of this example lets assume that your website is some sort of forum. We’ll simplify things by limiting the actions that can be performed. Lets say that a user will be able to:

  • Register
  • Post a new thread
  • Comment on an existing thread
  • Send messages to another user

Start with a small initial population of users, as would be the case after your site was first launched. Now consider the upper limit of your simulation; how many users will your site have when the simulation completes. The simulation will run until your current population size exceeds your upper limit.

Since we want to take a systematic approach to this simulation, it should occur over intervals. An interval is an arbitrary period of time, over which, your population increases by a certain amount, we’ll call this the growth rate. The bulk of our simulation will occur in these intervals. During an interval, each user has a chance of performing one of the actions associated with our site. You must determine the chance of each action occuring in a realistic manner, which reference to your growth rate. Say you expect your site to grow by 5% a week. What is the probability that each user will post a new thread in that time period?

double population = 10.000;
double maxPopulation = 1000;
double growthRate = 1.05;
while(population <= maxPopulation)
{
     //simulation interval
    population *= growthRate;
}

In the example above, we start with a population size of 10 and we increase that population by 5 percent until it exceeds the max population of 1000. In each interval we must cycle over each user in the population, and determine if that user performs one of our actions, based on the probabilites we determined. We must also remember to generate new users based on how much our population increased. Now that we have a general frame for our simulation, we must generate our initial population and start generating data in our intervals.

List commandList = empty list;
List users = empty list;
List threads = empty list; //each thread is assumed to contain a list of comments
List messages = empty list;
double population = 10.000;
double maxPopulation = 1000;
double growthRate = 1.05;
double newThreadChance = .05; //for each user, there is a 2 percent
                                            //that they will post a new thread
double newCommentChance = .30;
double newMessageChance = .05;
//generates random numbers
Random r = new Random();
//generate our initial population
for(int i = 0; i < population; i++)
{
     User u = new User("username", "email", "other info");
     commandList.add(u.toSQL());
     users.add(u);
}
//begin intervals
while(population <= maxPopulation)
{
    //for each user
    for(User u : users)
    {
          if(newThreadChance <= r.nextDouble())
          {
               Thread t = new Thread(u, "title", "content");
               commandList.add(t.toSQL());
               threads.add(t);
          }
          if(newCommentChance <= r.nextDouble())
          {
               //select a random thread
               int index = r.nextInt(threads.size());
               Thread t = threads.get(index);
               Comment c = new Comment(t, u, "content");
               t.addComment(c);
               threads.set(index, t);
               commandList.add(c.toSQL());
          }
          if(newMessageChance <= r.nextDouble())
          {
               //select a random user
               int index = r.nextInt(users.size());
               User recipient = users.get(index);
               Message m = new Message(u, recipient, "subject", "content");
               messages.add(m);
               commandList.add(m.toSQL());
          }
     }
     //increase our population
     for(int i = population; i < population * growthRate; i++)
     {
           User u = new User("username", "email", "other info");
           users.add(u);
           commandList.add(u.toSQL());
      }
     population *= growthRate;
}

The above example is a completed simulation. When it is complete the list commandList will contain a complete list of all of the SQL insert commands, in order, for us to population our database with.

There are some parts of the simulation that I left for the reader to complete on their own. The details of implementing the user, message, thread, and comment objects have been left out. Notice that each of these entities contains a toSQL method. This will simplify the process of converting your objects to SQL. Also, you will have to dump the commandList to a text file so it can be run on your database. This is just one example of how to carry out a simulation. Obviously if you choose to use a non-object oriented approach your implementation will look different.

Once your database is populated you can then navigate your site as if it is teaming with users. This will not only allow you to rate the performance of your site, but allow you to see what it will look like once it has gone into use.