Suppose you have a sample of data, for example a bunch of people’s heights, or maybe the times between arrivals of Web requests to a server. You might want to estimate the underlying probability distribution (PDF) that generated the sample data.
In my two examples above, it’s well known that people’s heights usually follow a Normal, bell-shaped curve distribution. And times between arrivals often follow an Exponential distribution. But for many sets of sample data, the underlying distribution may be unknown.
There are dozens of classical statistics techniques to determine the underlying distribution for a set of sample data. One approach is called the Parzen window technique. It’s also known as kernel density estimation.
Briefly, if you have a set of sample data X, you can estimate the probability of each of the values xi with:
Ugh. What a mess. But the equations aren’t as bad as they appear. The f is the approximating function. It needs a smoothing parameter h and a kernel function K. The h shown is called Silverman’s rule of thumb. The K shown is the Gaussian kernel.
I coded up a demo. First, I generated a sample of 30 values from a Normal distribution with mean = 0 and standard deviation = 1, which is a bell-shaped curve, centered about 0, with most data between -3 and +3. In a real problem I wouldn’t know the underlying distribution.
The I estimated the PDF using the sample data. The graph shows the estimate is pretty close to the true distribution. Different choices of h and K would give significantly different results — having to pick h and K is a major weakness of Parzen window estimation.
The moral of the story is that the more techniques you know, the more flexible you become. But some topics you can live without. I think Parzen window PDF estimation is probably too rarely used for you to spend much time on it. But it’s an interesting technique.