I’ve been working on some really fascinating code lately. The main goal is to create a desktop-type application that calculates the shortest path between two nodes in a very large graph. The nodes represent people in a social network and the edges between nodes represent a communication of some sort. By large graph, I mean a graph that has millions of nodes/people and so the graph won’t fit into machine memory; therefore, in practical terms, the graph is stored in a SQL database.
My first experiment was to write a SQL stored procedure using the T-SQL language (I’m working in a 100% Microsoft technology environment). Because there is a ton of processing involved (I’m using the Dijkstra Shortest Path algorithm) and T-SQL is not really designed to do such processing, the performance was not so good — for my baseline test case with 5,000,000 nodes, computing the shortest path took 25 minutes.
My second experiment was to write a C# method. The C# method did all of the algorithm processing and called a helper function named GetAdjacents that used ADO.NET to retrieve all the nodes in the graph connected to a given node (the key data/graph access part of Dijksta). Performance was quite good; the baseline case took about 90 seconds. But now I was hooked into this problem and wanted to do better.
My next experiment tried several code trick tweaks to the second experiment. I discovered that the recommended approach of using the C# ‘using’ statement, like
using (SqlConnection sc = new SqlConnection(connString))
had a pretty significant bad impact on performance. Not sure why this is so but I ditched using the using statement and explicitly opened and closed connections. I was able to squeeze about 10% improvement in performance, so my baseline test case took about 80 seconds.
I was ready to move on when I recalled that the .NET environment allows you to write stored procedures using C#. These are called CLR Stored Procedures. Maybe you recall that it’s possible to write a SQL ‘extended’ stored procedure using C++ (which is a real pain in the patooty). I wondered if by pushing both the data access and the logic processing into a CLR Stored Procedure, and then calling the stored procedure from a C# helper method, I could improve shortest path performance. Well, the net results were excellent: my baseline test case run time was reduced to about 13.25 seconds.
To summarize, when using Microsoft technologies, a technique to do very fast shortest path analysis for a large graph stored in SQL is to write a CLR Stored Procedure using C# that does both the data access and logic processing and then call the resulting stored procedure using a C# helper method. The advantage is very fast performance. The disadvantage of the approach is that the technical details are very tricky. If I find time, I’ll write up the details of this technique.