The past several weeks I’ve been working on a very challenging problem: given a very large graph (in the neighborhood of 100 million nodes) that represents a kind of a social network (who talks to who), determine the shortest path (number of connections) between any two nodes/people, with near-real-time performance. I could write volumes about the ways I found that do not work. Standard in-memory techniques fail because of the size of the graph (even when highly compressed). Standard in-SQL techniques are much too slow. After many, many hours of experimentation, I made a big breakthrough this week. The essence of the solution has two parts. The first is to maintain a set of dynamic data structures and only add nodes to these structures as they appear during program execution. The second part of the solution is a random access priority queue that has O(lg n) performance. Here is a rough sketch of the algorithm:
1 function ShortestPath(startNode, endNode) 2 Dictionary dist // dist from startNode 3 Dictionary previous // predecessor in SP path 4 Dictionary beenAdded // node been seen yet? 5 PriorityQueue PQ // random access PQ 6 dist[startNode] = 0.0 7 previous[startNode] = undefined 8 PQ.Enqueue[startNode,0.0] 9 beenAdded[startNode] = true 10 while PQ not empty 11 u = PQ.Dequeue 12 if u.GetDistance = infinity then break; 13 if u.GetID = endNode then break; 14 foreach neighbor v to u do 15 if beenAdded[v] = false then 16 dist[v] = infinity 17 previous[v] = undefined 18 PQ.Enqueue(v,infinity) 19 beenAdded[v] = true; 20 end if 21 alt = dist[u.GetID] + 1.0 22 if alt < dist[v] then 23 dist[v] = alt 24 previous[v] = u.GetID 25 PQ.ChangeDistance(v,alt) 26 end if 27 end foreach 28 end while 29 return dist[endNode] 30 end ShortestPath
The main difference between this big-graph version and a traditional version of Dijkstra’s shortest path is in lines 15-20 which occur when a graph node is seen for the first time. Hidden in the algorithm are a lot of technical challenges, in particular an efficient SQL graph representation in line 14, and a very complex random access priority queue in lines 8, 11, 18, and 25. All in all, this was one of the most interesting and challenging algorithms I’ve ever worked on.