Archive for March 2009

Database Analysis Through Simulation

Making adjustments to a database schema after it has gone into use is a daunting task. Whether it be because of efficiency issues or the incorporation of a new feature, this is a situation you should avoid at all costs. Often times, however, mistakes and inefficiencies are difficult to spot at implementation time. Only when your database has become populated, often by users who are counting on your applications to be reliable, do these things come to light. So what can you do? One solution is to run a simulation. By this I mean systematically project how your database will look in the future when it has come into use.

Creating a simulation is not a difficult task, provided you have experience in virtually any programming language. All you need to do is write a program that simulates the growth of your site over time. The output would be a sequence of SQL commands (mostly inserts, maybe updates).

How it Works

For the purposes of this example lets assume that your website is some sort of forum. We’ll simplify things by limiting the actions that can be performed. Lets say that a user will be able to:

  • Register
  • Post a new thread
  • Comment on an existing thread
  • Send messages to another user

Start with a small initial population of users, as would be the case after your site was first launched. Now consider the upper limit of your simulation; how many users will your site have when the simulation completes. The simulation will run until your current population size exceeds your upper limit.

Since we want to take a systematic approach to this simulation, it should occur over intervals. An interval is an arbitrary period of time, over which, your population increases by a certain amount, we’ll call this the growth rate. The bulk of our simulation will occur in these intervals. During an interval, each user has a chance of performing one of the actions associated with our site. You must determine the chance of each action occuring in a realistic manner, which reference to your growth rate. Say you expect your site to grow by 5% a week. What is the probability that each user will post a new thread in that time period?

double population = 10.000;
double maxPopulation = 1000;
double growthRate = 1.05;

while(population <= maxPopulation)
{
     //simulation interval

    population *= growthRate;
}

In the example above, we start with a population size of 10 and we increase that population by 5 percent until it exceeds the max population of 1000. In each interval we must cycle over each user in the population, and determine if that user performs one of our actions, based on the probabilites we determined. We must also remember to generate new users based on how much our population increased. Now that we have a general frame for our simulation, we must generate our initial population and start generating data in our intervals.

List commandList = empty list;

List users = empty list;
List threads = empty list; //each thread is assumed to contain a list of comments
List messages = empty list;

double population = 10.000;
double maxPopulation = 1000;
double growthRate = 1.05;

double newThreadChance = .05; //for each user, there is a 2 percent
                                            //that they will post a new thread
double newCommentChance = .30;
double newMessageChance = .05;

//generates random numbers
Random r = new Random();

//generate our initial population
for(int i = 0; i < population; i++)
{
     User u = new User("username", "email", "other info");
     commandList.add(u.toSQL());
     users.add(u);
}

//begin intervals
while(population <= maxPopulation)
{
    //for each user
    for(User u : users)
    {
          if(newThreadChance <= r.nextDouble())
          {
               Thread t = new Thread(u, "title", "content");
               commandList.add(t.toSQL());
               threads.add(t);
          }

          if(newCommentChance <= r.nextDouble())
          {
               //select a random thread
               int index = r.nextInt(threads.size());
               Thread t = threads.get(index);

               Comment c = new Comment(t, u, "content");
               t.addComment(c);
               threads.set(index, t);

               commandList.add(c.toSQL());
          }

          if(newMessageChance <= r.nextDouble())
          {
               //select a random user
               int index = r.nextInt(users.size());
               User recipient = users.get(index);

               Message m = new Message(u, recipient, "subject", "content");
               messages.add(m);
               commandList.add(m.toSQL());
          }
     }

     //increase our population
     for(int i = population; i < population * growthRate; i++)
     {
           User u = new User("username", "email", "other info");
           users.add(u);

           commandList.add(u.toSQL());
      }

     population *= growthRate;
}

The above example is a completed simulation. When it is complete the list commandList will contain a complete list of all of the SQL insert commands, in order, for us to population our database with.

There are some parts of the simulation that I left for the reader to complete on their own. The details of implementing the user, message, thread, and comment objects have been left out. Notice that each of these entities contains a toSQL method. This will simplify the process of converting your objects to SQL. Also, you will have to dump the commandList to a text file so it can be run on your database. This is just one example of how to carry out a simulation. Obviously if you choose to use a non-object oriented approach your implementation will look different.

Once your database is populated you can then navigate your site as if it is teaming with users. This will not only allow you to rate the performance of your site, but allow you to see what it will look like once it has gone into use.

I’m Not A Crazy Person…

Some people dream of buying a house so they can raise a family, buy furniture, decor and electronics, or just so they have a place they can call their own. When I think of buying a house only one thing comes to mind:

Lawn Gnome Army.

I dream of one day having more lawn gnomes than any one person should own and arranging them in my front yard. At night while the neighbors sleep I will rearrange them so that the seem to be moving on their own. If some neighbor complains of my hobby I will arrange my gnomes into ranks, directed at their yard, and threaten them with invasion.

This is my life’s aspiration.

Genetic Algorithm Problems

For anyone who is not familiar with Genetic Algorithms I’ll begin with a short summery of what they are and how they are used.

The Algorithm

A genetic algorithm (GA) can be thought of as a search. Given some initial state, the algorithm is searching for an optimal state. It does this in a way that mimics nature (hence the name). Say you have a population of a certain species. The first generation of these creatures may not be optimally suited for their environment. Over time the individuals who are less suited die off while those that are well suited reproduce and dominate the others. In addition to reproduction between well suited individuals (cross-overs in the context of a GA) the offspring of those individuals experience mutation, meaning that the child of individuals A and B is not purely a cross between the two, but has its own unique traits. Generally mutation occurs at a low probability.

In the context of programming, a GA can be expressed as a function that takes as input a population and a fitness function. The population is a collection of individuals and the fitness function is a means of determining how fit an individual is. At generation zero the population is usually randomly generated. In order to get from generation zero to generation 1, the algorithm uses the fitness function to determine which individuals to include in the cross-over (reproduce), leaving out the rest. The children of those individuals are then passed to a mutate function, that alters them in some way, usually at a very low probability. Here’s some pseudo code that might help to understand how this might be implemented:

GeneticAlgorithm(Population pop, FitnessFunction fn) returns optimal state
{
     while(solution not found)
     {
          Population nextGeneration = empty list;
          for(int i; i < size(pop); i++)
          {
               Individual x = randomSelection(pop, fn);
               Individual y = randomSelection(pop, fn);
               child = reproduce(x, y);

               if(small probability)
                    child = mutate(child);

               nextGeneration.add(child);
          }
     }
     return the most fit individual;
}

How They Are Used

There are a variety of problems that can be solved with genetic algorithms. GAs are adept for optimization problems in particular. K-SAT problems for example can be solved with a genetic algorithm (though other means exist).

For anyone not familiar with K-SAT problems I’ll give a short explanation. SAT (or satisfaction) problems attempt to assign values to a boolean formula in such a way that it evaluates to true. So if my SAT problem consists of two variables: A and B and one clause: A OR B then one solution would be A = true, B = true. Clauses are the components of the boolean formula, in the example I gave the formula consists of only one clause. A larger SAT problem may consist of hundreds of variables and thousands of clauses and cannot be solved on paper in a reasonable amount of time. Here is an example of a larger sat problem in Conjunctive Normal Form (CNF):

(A OR B OR C) AND (A OR !B OR !C) AND (!A OR B OR !C)

This formula consists of three variables (A, B, C) and three clauses. A solution to this problem would be A = true, B = true, C = false. Notice that there are many different assignments of these variables that satisfy the formula. If there were more clauses this might not be the case.

To solve a SAT problem with a genetic algorithm you start of with a population of randomly generated “solutions”, each solution consisting of a random assignment of true of false to each variable. This population is generation zero. In this context the fitness function is defined as the number of satisfied (or unsatisfied) clauses in the boolean formula. Using the fitness function, for each individual in generation zero, a fitness value is determined. It might be the case that one of these individuals satisfies the formula, in which case you’re done. Otherwise, in order to get from generation zero to generation one, we must choose a portion of the population to “reproduce”, for example, those having a fitness above the average.

Once we’ve made our selection we perform the cross over by producing a new individual with a portion of its assignments coming from each parent (the size of the portion may be determined randomly). For example, for individuals X and Y and X(A,B,C) = {True , False, False} and Y(A,B,C) = {False, True, True} a possible child would be Child(A,B,C) = {False, False, False}.

After we’ve generated a new population we then randomly mutate each individual at a very low probability. At probabilities above 5% in many cases a solution will not be found in a reasonable amount of time. A mutation takes an assignment and flips it. So for the individual X(A,B,C) = {True, False, False} if a mutation event occurs on the variable B, it will become X(A,B,C) {True, True, False}. Without this mutation the algorithm does not approach a solution.

At Generation zero for a large problem, there is very little chance of a solution existing. After each passing generation, however, the average fitness increases and it becomes likely that an individual satisfies the formula.

Problems with Genetic Algorithms

After each generation the individuals of a population begin to approach the solution. In the context of a SAT problem this means they satisfy more and more clauses. There is, however, no guarantee that they will ever satisfy all of them. This is because individuals that have a fitness near the maximum, may actually be very different from the solution. For example, say a SAT problem has the solution 000011000 where each character in the bitstring represents a variable and the 0s represent false, and the 1s represent true. The string 111100111 might satisfy 90% of the clauses. If this is the case, the children produced by this individual will look similar to it and the likelihood of it being mutated into the solution is essentially zero. The following graph illustrates this problem:

Local Max Problem

Local Max Problem

From the graph you can see that there are two peaks, one reaching 100, the other 75. The higher one represents the solution to the problem, while the other is called a local maximum. A genetic algorithm may reach the peak of a local maximum and become stuck because all similar solutions have a lower fitness, while the actual solution is unsimilar to the current state.

Possible Solutions

A possible way to fix this problem would be to reset the search. Generated a new set of random solutions as the algorithm did at generation zero and proceed from there. This is called a random-reset. Hopefully after the reset the search will approach the solution rather than a local max. Another similar solution would be to mutate each individual in the current population at a much higher rate, possibly 100%. This would produce a population that very different from the one that existed at the local maximum.

These solutions would fix the problem in a case where there were only a few local maximums, but for some problems it might be the case that there are numerous local maximums. For these problems, genetic algorithms with random-reset might find solutions that have very high fitness, but never the solution. Currently, I’ve found that the 100% mutation solution performs better than random-reset when it comes to avoiding local maximums in the context of a K-SAT problem, but for very large problems I consistently end up with solutions that satisfy 99% of the clauses and go on indefinitely without approaching the solution.

Need I Say More…

I feel as if my life’s aspirations are suddenly within reach.

http://www.swordsofhonor.com/ca-30-61.html
http://www.swordsofhonor.com/9thcehovihed.html