The past several weeks I’ve been working on a very challenging problem: given a very large graph (in the neighborhood of 100 million nodes) that represents a kind of a social network (who talks to who), determine the shortest path (number of connections) between any two nodes/people, with near-real-time performance. I could write volumes about the ways I found that do not work. Standard in-memory techniques fail because of the size of the graph (even when highly compressed). Standard in-SQL techniques are much too slow. After many, many hours of experimentation, I made a big breakthrough this week. The essence of the solution has two parts. The first is to maintain a set of dynamic data structures and only add nodes to these structures as they appear during program execution. The second part of the solution is a random access priority queue that has O(lg n) performance. Here is a rough sketch of the algorithm:

1 function ShortestPath(startNode, endNode)
2 Dictionary dist // dist from startNode
3 Dictionary previous // predecessor in SP path
4 Dictionary beenAdded // node been seen yet?
5 PriorityQueue PQ // random access PQ
6 dist[startNode] = 0.0
7 previous[startNode] = undefined
8 PQ.Enqueue[startNode,0.0]
9 beenAdded[startNode] = true
10 while PQ not empty
11 u = PQ.Dequeue
12 if u.GetDistance = infinity then break;
13 if u.GetID = endNode then break;
14 foreach neighbor v to u do
15 if beenAdded[v] = false then
16 dist[v] = infinity
17 previous[v] = undefined
18 PQ.Enqueue(v,infinity)
19 beenAdded[v] = true;
20 end if
21 alt = dist[u.GetID] + 1.0
22 if alt < dist[v] then
23 dist[v] = alt
24 previous[v] = u.GetID
25 PQ.ChangeDistance(v,alt)
26 end if
27 end foreach
28 end while
29 return dist[endNode]
30 end ShortestPath

The main difference between this big-graph version and a traditional version of Dijkstra’s shortest path is in lines 15-20 which occur when a graph node is seen for the first time. Hidden in the algorithm are a lot of technical challenges, in particular an efficient SQL graph representation in line 14, and a very complex random access priority queue in lines 8, 11, 18, and 25. All in all, this was one of the most interesting and challenging algorithms I’ve ever worked on.

### Like this:

Like Loading...

*Related*

Well… Do you have any implementation?

I hope to post the implementation at some point. The code is surprisingly tricky, but the hardest part was coming up with the overall scheme.