I’ve been doing a deep dive into the Lognormal distribution this week. The big message is that online information on the topic is confusing, inconsistent, and contradictory. The little message is that I came up with a practical scheme to estimate the scale and shape parameters of observed data which is assumed to follow a Lognormal distribution. The first challenge is understanding exactly what the Lognormal distribution is. If you have frequency data that when graphed resembles a bell-shaped curve then that data may follow a Normal (or Gaussian) distribution with parameters mean (Greek letter mu) and standard deviation (Greek letter sigma). For example, if you looked at the heights of 100 men and graphed these heights you’d likely see a bell-shaped curve with center about 68 inches (the mean) and spread from about 59 inches to 77 inches (so standard deviation of 77 – 59 / 6 = 3 inches). If you want to estimate the mean and standard deviation of Normal data, you can do so by calculating the sample mean and sample standard deviation of observed data.

If you have frequency data which when graphed resembles a bell-shaped curve where the right tail is stretched towards the right, that data may follow a Lognormal distribution. Instead of a mean and standard deviation, a given Lognormal distribution has a scale and a shape parameter. The scale parameter is a measure of the center of the data and the shape parameter is a measure of the spread of the data. The Wikipedia entry on Lognormal has a poorly labeled image which shows what the Lognormal distribution looks like for a fixed scale value = 0.0 and several different shape values. Well I had skewed data and so I wanted to estimate the underlying Lognormal scale and shape parameters. Here’s where I ran into all kinds of difficulty trying to understand the math involved. I’m fairly decent with math but after several hours I decided to take a practical approach.

One problem I ran into was inconsistent naming of the scale and shape parameters of the Lognormal, the mean and standard deviation of the underlying Normal sdistribution related to a Lognormal distribution, and the mean and standard deviation of the Lognormal.

My input was a frequency array. I wrote a program which iteratively generated a Lognormal distributed “proposed” array with more-or-less randomly selected scale and shape parameters. Then I computed how close the proposed array was to the input array using an average sum of squared differences. I kept doing this, keeping track of the best scale and shape parameters found. The programming was moderately challenging but it seems to be working (so far). In the image below I generated a test observed data array with known scale = 0.10 and known shape = 0.80. You can see that test data is 330, 967, 385, 160, and so on. The program estimated the scale parameter to be 0.1146 and the shape parameter to be 0.8132. To help me visualize what that meant I generated an array with those parameter estimates and got 329, 952, 386, 159, and so on. Pretty close to the input array values.

The Lognormal parameter estimation program I wrote is sort of a hybrid of a Genetic Algorithm and a Simulated Bee Colony algorithm. I’ll describe the program in a future blog post if I have time.