Shortest Path Graph Analysis using a CLR Stored Procedure

I wrote an article “Shortest-Path Graph Analysis using a CLR Stored Procedure” that appears in the May 2013 issue of Microsoft’s MSDN Magazine. See http://msdn.microsoft.com/en-us/magazine/dn198246.aspx. There are many important uses of shortest path analyses. Imagine you have a huge amount of social network data. This data can be stored in a graph where the nodes are people and the connections between nodes represent associations like “friend-ness” or “has exchanged e-mail”. Shortest path analysis determines the shortest number of hops between any two people. This information could be used to identify people who are highly influential because many paths run through them.

Most traditional shortest path algorithms assume that the graph under analysis can be stored in memory, typically in a matrix. But large graphs (say greater than 100,000 nodes) that cannot fit into memory can be stored in a SQL database. One way to do shortest path analysis of a graph in a database is to write a stored procedure using the SQL language. But SQL is fairly slow and the code is quite unnatural because SQL is primarily set-based (e.g., the Select statement) rather than procedural (e.g., foreach loops).

In a Microsoft technology stack environment, an excellent alternative to shortest path analysis using the SQL language is to write a stored procedure using the C# language. Such stored procedures are called CLR (Common Language Runtime) procedures. CLR stored procedures are more complex than standard SQL stored procedures, but have much better performance.

NewProjectCLRDatabase

CallingTheStoredProcUsingSQL

Posted in Software Test Automation | Leave a comment

Data Clustering using Category Utility

I wrote an article titled “Data Clustering using Category Utility” which appears in the May 2013 issue of MSDN Magazine. See http://msdn.microsoft.com/en-us/magazine/dn198247.aspx. Data clustering is a process that programmatically groups similar data items together and dissimilar items in different groups.

There are many clustering algorithms; the most well-known is called the k-means clustering algorithm. In the MSDN article I present a clustering algorithm that is based on the idea of something called category utility (CU). For any possible clustering of a data set it is possible to compute the CU, which is just a number like 0.37, of the clustering. The CU of a clustering is a clever metric that is related to how good a clustering is — small values of CU indicate better clustering than larger values of CU. So the idea is to try different clusterings, keep track of the CU of each clustering, and use the clustering that has the best CU.

ClusteringCategoricalDemoRunScreenshot

C# Source Code:
Continue reading

Posted in Software Test Automation | 1 Comment

Discretizing Continuous Data

Many machine learning algorithms work only on either continuous numeric data (such as heights in inches — 67.5 inches, 70.2 inches, etc.) or work only on categorical data (such as car color — red, white, etc.) For example, k-means data clustering works only with continuous/numeric data but CU (category utility) clustering works only with categorical data. And logistic regression classification and prediction works only with continuous data but naive Bayes classification and prediction works only with categorical data. There are many other examples.

A fundamental technique in machine learning is to convert numeric data into categorical data so that an algorithm that works only with categorical data can be applied to the data under investigation. For example, consider clustering mixed numeric and categorical data. If the numeric data columns can be converted into categorical data, then the powerful CU clustering algorithm can then be applied to the entire data set. By the way, it is possible to convert categorical data into numeric data using 1-of-n or 1-of-(n-1) encoding but that’s another story.

Perhaps because discretization of continuous data isn’t a very flashy topic, even though there is a fairly good amount of research on the topic, there hasn’t been nearly as much research as I’d guessed there’d be given the topic’s fundamental importance.

There are three basic approaches to unsupervised (meaning no so-called training data is used) discretizing numeric data. Suppose we have heights of 16 people: 64, 66, 66, 68, 68, 68, 70, 70, 70, 70, 72, 72, 72, 74, 74, 76.

The first approach is to sort the data, and assign equal numbers of data points to each category label. For example, if we have n=4 bins, then the first four heights get category label “0″, the second four heights get category “1″, the third four heights are “2″, and the last four heights are “3″. So heights 64, 66, 66, 68 map to label “0″, and 68, 68, 70, 70 map to label “1″ and . . . oops you can see that we have a problem because some heights are mapping to different category labels. Even if we fix this minor but annoying technical problem, the equal-frequency approach doesn’t take into account natural breaks in the data.

A second approach to unsupervised discretization of numeric data is to create equal intervals. For example, the range of the example data is 76 – 64 = 12 inches. If there are n=4 bins then the intervals are [64-67), [67-70), [70-73), [73-76] where I’ve used square brackets for inclusive and parentheses for exclusive. This approach also ignores natural breaks in the data. A variation on the equal-interval approach is to bin data according to the data’s standard deviation.

The third approach to unsupervised discretization of continuous data is to use clustering. The idea is to cluster the numeric data using Euclidean distance. The result will generally take into account natural breaks in the data. Then the category label for each numeric data point is its assigned zero-based cluster number.

So, this is almost a bit recursive: if the primary problem is to cluster mixed categorical and numeric data, we’d like to use CU clustering which works only on categorical data. We can turn the numeric data into categorical data using clustering which works only on numeric data.

I’ve experimented with the third approach with pretty good results. An unanswered question is how to determine the optimal number of clusters — both for the preliminary clustering of each numeric data column, and the primary CU clustering of the entire data set. I have written up a detailed explanation of this topic, with source code. It is scheduled to appear in Microsoft’s MSDN Magazine in the July 2013 issue.

Posted in Software Test Automation

My Top Ten Favorite Movies That Involve Time Travel

I like movies that involve time travel or being able to see into the past or future. Here are my top 10, meaning that if I was going on a long trip and had to pick ten movies that feature time travel in some way, these would be my choices.

1. Source Code (2011) – Jake Gyllenhaal finds himself part of a government system that can send him back in time to an alternate timeline, but as another person. Gyllenhaal’s mission is to determine who blows up a train with explosives. Great combination of mystery and time travel, and I loved the surprise ending.

2. 12 Monkeys (1995) – Director Terry Gilliam has Bruce Willis as a future criminal who is sent back in time to find the cause of a deadly virus that decimated the earth. Unfortunately the time travel device is a bit crude and Willis lands in unexpected situations. Brad Pitt, relatively unknown in 1995, gives a good performance as an insane asylum resident. The ambiguous ending is a flaw in my opinion.

3. The Time Machine (1960) – A science fiction classic loosely based on the H.G. Wells novel. Rod Taylor builds a time machine goes into 800,000 years into the future. There he encounters the meek and attractive Eloi, and the not-so-meek and not-so-attractive Morlocks. Well-deserved Oscar for special effects that hold up well even 50+ years later.

TimeMachineMoviePoster

4. Paycheck (2003) – Ben Affleck creates a machine that can see into the future for a giant corporation. As part of the deal, Affleck’s memory is erased. So, choice A – the corporation is good and Affleck has no problems. Or, choice B – the corporation sets out to murder Affleck. Great plot especially where Affleck has an envelope of seemingly random objects that allow him to escape his pursuers. I knew the bird would be significant in the end.

5. Minority Report (2002) – In 2054 the government PreCrime unit, using precog mutants, can peer into the future, see murders, and then stop them before they are committed. Tom Cruise, as John Anderton, does a good job as a captain in the PreCrime unit. Anderton is predicted to commit a murder. Will he or is it all part of a plot? I like the movie’s details of what 2054 might be like. Didn’t like the kind of weird blue tint and light glare in almost every scene.

6. Deja Vu (2006) – Denzel Washington is a government agent who is sent back in time to discover who blows up a ferry and kills hundreds of people. This is really more of a mystery movie than science fiction movie. As is often the case with time travel movies there is an unexpected, clever ending.

7. Time Bandits (1981) – This is really a fantasy film with time travel elements. Director Terry Gilliam crafts a rather wacky story that I liked mostly for its wild creativity. A young boy and six dwarfs jump through time and battle Evil, personified by actor David Warner (who has appeared in quite a few scifi movies).

8. A Sound of Thunder (2005) – Even though this movie was a colossal box office disaster, I really enjoyed this tale of present day hunters traveling back in time for a dinosaur hunt. Unfortunately, one hunter accidentally kills an insect in the past. Could that cause problems?

9. The Time Travelers (1964) – Low budget production where a group of scientists travel to 2071 to find post nuclear war survivors living underground battling surface dwelling mutants. Eventually the scientists return to the present . . . or do they? This movie was kind of remade in 1967 as Journey to the Center of Time but I much prefer the 1964 version.

10. The Butterfly Effect (2004) – Ashton Kutcher has the ability to go back in time, and then change the future. What could go wrong? Lots. For me, this movie was somewhat too dark and disturbing, mostly because of traumatic plot events. Not a movie for children. And a not-so-happy ending.

Some other films:

Time Cop (1994) starring Jean-Claude Van Damme was surprisingly not bad and would make my top ten list on some days

Terminator 2 (1991) with Arnold Schwarzenegger as a good Terminator is an excellent film, but I’ve seen it about 50,000 times.

Looper (2012) starring Bruce Willis had the potential to be great but the idea of child-murder as part of the plot totally wrecked the film for me.

Star Trek: First Contact (1996) with the Borg Queen was very good, but the time travel element was a bit too incidental for me to put this movie on my top ten list.

Primer (2004) was very well liked by most of my friends but I didn’t like this ultra-low budget movie. Too confusing for me to enjoy.

Harry Potter and the Prisoner of Azkaban (2004) had Hermione travelling back in time but I felt it was incidental to the main plot.

Frequency (2000) with Dennis Quaid and Jim Caviezel as father and son was too slow for me.

Posted in Uncategorized | 2 Comments

Classification and Prediction using Adaptive Boosting

I wrote an article titled “Classification and Prediction using Adaptive Boosting” that appears in the April 2013 issue of MSDN Magazine. See http://msdn.microsoft.com/en-us/magazine/dn166933.aspx. Adaptive boosting is a technique that combines a bunch of simple rules extracted from data to come up with a consensus prediction. Adaptive boosting is one form of what is usually called ensemble classification.

In the article I set up a demo where the goal is to predict whether a football team will win or lose an upcoming game, based on three simple rules culled from historical data. For example, suppose the goal is to predict if the team in question will win if their opponent is Detroit, they are playing at Home, and the Vegas point spread is Small. Suppose that historically, simple rules indicate that if the opponent is Detroit, the team will win. If the field is Home, the team will win. If the point spread is Small, the team will lose. Note that two of simple rules say the team will win, and one simple rule says the team will lose.

Adaptive boosting assigns a weight to each simple rule. In this case the three weights turn out to be 0.63, 3.15, and 4.49. Combining the weights yields a final prediction that the team will lose. Even though two of the three rules predict a win, the higher weight of the third rule overcomes the first two rules and gives an overall prediction of lose. Determining the weights of the simple rules is the heart of the adaptive boosting algorithm.

To be honest, I’m not entirely convinced about how effective adaptive boosting is. There are other ensemble machine learning algorithms, such as random forests, that I suspect may be superior to adaptive boosting. Regardless, adaptive boosting is a more or less standard machine learning technique.

AdaptiveBoostingDemo

Posted in Software Test Automation

Classification using Perceptrons

I wrote an article titled “Classification using Perceptrons” that appears in the April 2013 issue of Visual Studio Magazine. See http://visualstudiomagazine.com/articles/2013/04/01/classification-using-perceptrons.aspx. A perceptron is code that models a single biological neuron. Perceptrons were the predecessor to neural networks — a neural network is a collection of interconnected perceptrons.

ColorfulPerceptron

Although quite limited, perceptrons can be used to perform certain types of machine learning classification. Perceptron classification uses data to construct a perceptron that can place data into one of several categories, for example, determining whether a hospital patient has (+1) or does not have (-1) a disease based on medical test data.

PerceptronClassificationDemoRun

Posted in Software Test Automation

The Microsoft Management Summit Conference

This week I spoke at the 2013 Microsoft Management Summit conference. The conference is intended primarily for IT engineers and managers. There are about 300 talks, maybe 200 training sessions, and miscellaneous other events. I think there are a total of about 6,000 people at the conference: about 5,000 attendees and about 1,000 speakers and exhibitors. This year the conference was at the Mandalay Bay hotel. I have spoken at MMS many times in the past (at least 8 years that I can recall) and I always learn a lot and have a good time.

My talk was “Neural Networks for IT Professionals”. I am posting images of my PPT here. You can also hear the audio of the presentation at http://channel9.msdn.com/Events/MMS/2013/MMS104.

Slide2

Slide3

Slide4

Slide5

Slide6

Slide7

Slide8

Slide9

Slide10

Slide11

Slide12

Slide13

Slide14

Slide15

Slide16

Slide17

Posted in Software Test Automation