Modified Dijkstra’s Shortest Path for Very Large Graphs

The past several weeks I’ve been working on a very challenging problem: given a very large graph (in the neighborhood of 100 million nodes) that represents a kind of a social network (who talks to who), determine the shortest path (number of connections) between any two nodes/people, with near-real-time performance. I could write volumes about the ways I found that do not work. Standard in-memory techniques fail because of the size of the graph (even when highly compressed). Standard in-SQL techniques are much too slow. After many, many hours of experimentation, I made a big breakthrough this week. The essence of the solution has two parts. The first is to maintain a set of dynamic data structures and only add nodes to these structures as they appear during program execution. The second part of the solution is a random access priority queue that has O(lg n) performance. Here is a rough sketch of the algorithm:

 1 function ShortestPath(startNode, endNode)
 2   Dictionary dist     // dist from startNode
 3   Dictionary previous    // predecessor in SP path
 4   Dictionary beenAdded  // node been seen yet?
 5   PriorityQueue PQ    // random access PQ
 6   dist[startNode] = 0.0
 7   previous[startNode] = undefined
 8   PQ.Enqueue[startNode,0.0]
 9   beenAdded[startNode] = true
10   while PQ not empty
11     u = PQ.Dequeue
12     if u.GetDistance = infinity then break;
13     if u.GetID = endNode then break;
14     foreach neighbor v to u do
15       if beenAdded[v] = false then
16         dist[v] = infinity
17         previous[v] = undefined
18         PQ.Enqueue(v,infinity)
19         beenAdded[v] = true;
20       end if
21       alt = dist[u.GetID] + 1.0
22       if alt < dist[v] then
23         dist[v] = alt
24         previous[v] = u.GetID
25         PQ.ChangeDistance(v,alt)
26       end if
27     end foreach
28   end while
29   return dist[endNode]
30 end ShortestPath

The main difference between this big-graph version and a traditional version of Dijkstra’s shortest path is in lines 15-20 which occur when a graph node is seen for the first time. Hidden in the algorithm are a lot of technical challenges, in particular an efficient SQL graph representation in line 14, and a very complex random access priority queue in lines 8, 11, 18, and 25. All in all, this was one of the most interesting and challenging algorithms I’ve ever worked on.

This entry was posted in Software Test Automation. Bookmark the permalink.

2 Responses to Modified Dijkstra’s Shortest Path for Very Large Graphs

  1. Well… Do you have any implementation?

Comments are closed.