## Modified Dijkstra’s Shortest Path for Very Large Graphs

The past several weeks I’ve been working on a very challenging problem: given a very large graph (in the neighborhood of 100 million nodes) that represents a kind of a social network (who talks to who), determine the shortest path (number of connections) between any two nodes/people, with near-real-time performance. I could write volumes about the ways I found that do not work. Standard in-memory techniques fail because of the size of the graph (even when highly compressed). Standard in-SQL techniques are much too slow. After many, many hours of experimentation, I made a big breakthrough this week. The essence of the solution has two parts. The first is to maintain a set of dynamic data structures and only add nodes to these structures as they appear during program execution. The second part of the solution is a random access priority queue that has O(lg n) performance. Here is a rough sketch of the algorithm:

``` 1 function ShortestPath(startNode, endNode)
2   Dictionary dist     // dist from startNode
3   Dictionary previous    // predecessor in SP path
4   Dictionary beenAdded  // node been seen yet?
5   PriorityQueue PQ    // random access PQ
6   dist[startNode] = 0.0
7   previous[startNode] = undefined
8   PQ.Enqueue[startNode,0.0]
10   while PQ not empty
11     u = PQ.Dequeue
12     if u.GetDistance = infinity then break;
13     if u.GetID = endNode then break;
14     foreach neighbor v to u do
15       if beenAdded[v] = false then
16         dist[v] = infinity
17         previous[v] = undefined
18         PQ.Enqueue(v,infinity)
20       end if
21       alt = dist[u.GetID] + 1.0
22       if alt < dist[v] then
23         dist[v] = alt
24         previous[v] = u.GetID
25         PQ.ChangeDistance(v,alt)
26       end if
27     end foreach
28   end while
29   return dist[endNode]
30 end ShortestPath
```

The main difference between this big-graph version and a traditional version of Dijkstra’s shortest path is in lines 15-20 which occur when a graph node is seen for the first time. Hidden in the algorithm are a lot of technical challenges, in particular an efficient SQL graph representation in line 14, and a very complex random access priority queue in lines 8, 11, 18, and 25. All in all, this was one of the most interesting and challenging algorithms I’ve ever worked on.