Topological Sorting

Posted by Jake Good
on Feb 02, 06

When it comes to implementing algorithms, it’s always useful to write it out in psuedo code or at least obtain the algorithm in some form to work with.



At work, we typically use a form of data mapping layer that relies on Martin Fowler’s documented Unit of Work pattern. A problem that arises while using this pattern is dependency ordering for when you commit a unit of work.



Imagine if you will, you have a Parent object and a Child object that has a reference. In order for it to save to certain data store scenarios, you have to commit the Parent before the Child. Think “foreign key relationship” for common databases. The Unit of Work has to rearrange it’s elements internally so that when it goes to commit an object, it’s predecessors and dependency needs are taken into account.



This is where Topological Sort comes in. It essentially takes a graph with edges and outputs the nodes in such a away that all of the preceding nodes get spit out before their dependant nodes. This solves our Unit of Work problem by ensuring that the Parent objects will be committed before their respective Child objects. The only problem it introduces is that it fails when there is a circular reference, but with most data stores, you aren’t allowed or shouldn’t be allowed to have circular references that depend on each other.



The algorithm itself is pretty simple:




while graph is nonempty do
if Q is empty then
you have a cicular reference.
remove a node n from Q
output n
for each node m with an edge e from n to m do
remove edge e from the graph
if m has no other incoming edges then
insert m into Q


If you’re interested, the Big-O of this aglorithm is O(|V|+|E|) where V is the number of nodes and E is the number of edges. Not bad, could be better… could be worse.



The difficulty in this algorithm, applying it to our problem is that it requires a lot of meta data about the “nodes” of the graph. In our case, we need to know that an object is indeed, a real predecessor to another object.



Instead of quickly trying to fit it into our existing solutions, I took a baby step and began with a simple graph implementation.



We have some nodes:




Node nodeOne = new Node();


The each contain a collection of incoming edges (Parents) and outgoing edges (Children).



The C# (.Net 2.0) implementation of the actual sort is pretty straight forward. What you don’t see below is the method GetRootNodes(List<Node> nodes). All this method does is find the nodes that don’t have incoming edges (Parents).




public List Sort(List nodes)
{
List sorted = new List();

// Queue up the nodes that don't have any incoming edges.
Queue rootNodes = this.GetRootNodes(nodes);

while (nodes.Count > 0)
{
// if you have nodes, but no nodes
// that don't have incoming edges,
// then you've got a circular reference.
if (rootNodes.Count == 0)
{
throw new ApplicationException("The graph contains a circular reference.");
}

// grab the first start node.
Node n = rootNodes.Dequeue();

// add it your output.
sorted.Add(n);

// for each of it's dependancies,
// remove the incoming edge,
// queue the node,
// then remove the original node from the graph.
for (int i = (n.Children.Count - 1); i >= 0; i--)
{
Node childNode = n.Children[i];
n.RemoveChildNode(childNode);
if (childNode.Parents.Count == 0)
{
rootNodes.Enqueue(childNode);
}
}

nodes.Remove(n);
}

return sorted;
}


There you have it, plain and simple. I wrote some tests that verify the sort but the interesting thing is… the sort can theoretically produce multiple valid outputs. Luckily my test doesn’t check the actual order, but merely that the ones that need to appear first always appear first; by checking indexes of nodes.



The caveat to this is the meta data that is available. Each child node knows about it’s parent node. The difficulty in applying this method to a Unit of Work is that the relationship data might not be available or you might have to poll for more meta-data to make the decisions about edges. This specific implementation is also dangerous as it is destructive to the objects that come in, the line : n.RemoveChildNode(childNode); . Though this can easily be avoided by creating collections of meta data, rather than manipulating the actual collections of edges, I decided to be destructive for the sake of sharing an example and letting you figure things out for yourself.



Another trick you can pull is change the Queue to be a Stack. The nature of the outputted nodes is affected. Using a queue, it does a breadth-first search through the node trees, going the breadth across the outgoing edges (Children). If you change it to a stack, it does a pure depth-first search through the outgoing edges (Children). How does this affect the Unit of Work? Essentially all of the related objects will be inserted in more tight-knit groups, rather than just after the fact.



In conclusion, this kind of knowledge can be applied to numerous different areas of software development. It just so happens that it solves a lot of our problems in our frameworks we use everyday. It’s two fold. One on end, it shows us the value of using psuedo-code to express our ideas and on the other end, it shows us the importance of a well thought out and tested solution that might not have been thought of. I bet you don’t think about graph theory everyday while you’re writing code…



The code listed above does not intend to be perfect, production level code. Use it at your own risk. It is meant to be an example and should be treated as such.



References:



Wikipedia : List of Algorithms



NIST Dictionary of Algorithms

Comments

Leave a response

  1. blameMikeFeb 02 06 @ 02:20AM
    Sweet.
  2. SteveFeb 02 06 @ 02:20AM
    Wonderful post man!
  3. MalFeb 02 06 @ 02:20AM

    Funny how there is a whole interweb out there full of crap and then you come across something like this - useful, thoughtful and elegant. Nice work.

Comment