In this post I’ll walk through how to use Python to model an IP/MPLS network. Once you get familiar with this kind of modeling you may find it useful for all kinds of things. You’ll need NetworkX installed.

**The graph**

The basis for our model is a graph representing our network. If you don’t have this handy in some static form, then a good source for this is to use the traffic engineering database (TED) from a router. The TED describes all links, with their weights, bandwidths, admin-groups, and SRLGs. Of course the drawback of using the TED is that you only get the view of your network in the RSVP-TE domain, links and routers not running RSVP-TE will not be included. For other methods of generating a graph, look at parsing the output of the IGP database, or fetching LLDP output from all routers, or of course just pairing up static IP addresses from configuration files. Then join that with bandwidth and other configuration information. If you can use it, the TED is a quickest way to get everything you need.

You can gather the TED any number of ways. Here’s one method based on a pattern that I often use when I need to get data for ad hoc investigations. I call this ghetto netconf.

jlogin -c "show ted database extensive | display xml" <router> | sed -n "/<rpc-reply/,/<\/rpc-reply>/p" > ted.xml

This fetches the TED from a Juniper router in XML format, which makes for slightly easier processing. Then we can process this XML to extract only the parts we care about and store that in something that is easier to work with for whatever I am doing. (Over time of doing this you might build a library of XSLT transformations for different outputs which will be useful for other automation.)

<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:t="http://xml.juniper.net/junos/12.3R5/junos-routing"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:template match="/"> <xsl:apply-templates select="//t:ted-link"/> </xsl:template> <xsl:template match="t:ted-link"> <xsl:value-of select="../t:ted-database-id"/><xsl:text>,</xsl:text> <xsl:value-of select="t:ted-link-to"/><xsl:text>,</xsl:text> <xsl:value-of select="t:ted-link-local-address"/><xsl:text>,</xsl:text> <xsl:value-of select="t:ted-link-remote-address"/><xsl:text>,</xsl:text> <xsl:value-of select="t:ted-link-metric"/><xsl:text>,</xsl:text> <!-- Convert static BW to Mbps. There's got to be a more elegant method to convert magnitudes than this. --> <xsl:choose> <xsl:when test="contains(t:ted-link-static-bandwidth, 'Gbps')"><xsl:value-of select="number(substring-before(t:ted-link-static-bandwidth,'Gbps')) * 1000"/></xsl:when> <xsl:when test="contains(t:ted-link-static-bandwidth, 'Mbps')"><xsl:value-of select="number(substring-before(t:ted-link-static-bandwidth,'Mbps'))"/></xsl:when> <xsl:when test="contains(t:ted-link-static-bandwidth, 'Kbps')"><xsl:value-of select="number(substring-before(t:ted-link-static-bandwidth,'Kbps')) / 1000"/></xsl:when> <!-- Unrecognized format. Should really raise an exception or error here instead. --> <xsl:otherwise><xsl:value-of select="t:ted-link-static-bandwidth"/></xsl:otherwise> </xsl:choose><xsl:text>,</xsl:text> <!-- Convert reservable BW to Mbps. There's got to be a more elegant method to convert magnitudes than this. --> <xsl:choose> <xsl:when test="contains(t:ted-link-reservable-bandwidth, 'Gbps')"><xsl:value-of select="number(substring-before(t:ted-link-reservable-bandwidth,'Gbps')) * 1000"/></xsl:when> <xsl:when test="contains(t:ted-link-reservable-bandwidth, 'Mbps')"><xsl:value-of select="number(substring-before(t:ted-link-reservable-bandwidth,'Mbps'))"/></xsl:when> <xsl:when test="contains(t:ted-link-reservable-bandwidth, 'Kbps')"><xsl:value-of select="number(substring-before(t:ted-link-reservable-bandwidth,'Kbps')) / 1000"/></xsl:when> <!-- Unrecognized format. Should really raise an exception or error here instead. --> <xsl:otherwise><xsl:value-of select="t:ted-link-static-bandwidth"/></xsl:otherwise> </xsl:choose><xsl:text>,</xsl:text> <!-- Push the admin-groups into a list separated by |. --> <xsl:for-each select="t:admin-groups/t:admin-group-name"> <xsl:value-of select="."/> <xsl:if test="position() != last()"><xsl:text>|</xsl:text></xsl:if> </xsl:for-each><xsl:text>,</xsl:text> <!-- Push the srlgs into a list separated by |. --> <xsl:for-each select="t:ted-link-srlg/t:srlg-name"> <xsl:value-of select="."/> <xsl:if test="position() != last()"><xsl:text>|</xsl:text></xsl:if> </xsl:for-each> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>

Throw the above XSLT into a file then transform the XML to CSV with xsltproc. You could of course do this inside Python and work directly from the XML source if you wanted, but in the real world you may get the graph from different sources.

xsltproc dirty_xslt ted.xml > ted.csv

The node IDs in the TED output are slightly obfuscated. They’re a contraction of the hostname, a number indicating if this node is a router or a pseudonode, and the router_id. We don’t have to concern ourselves with pseudonodes, and hopefully you don’t either. If you do you might need to parse the TED slightly differently to account for the added complexity of modelling pseudonodes. Since we don’t have them, I’ll just strip this cruft out of our TED.

cat ted.csv | sed -r 's/\.[0-9]+\(([0-9]+\.){3}[0-9]+\)//g' > ted_clean.csv

Depending on the size of your network you may need to take multiple snapshots of the TED over time and merge these to account for different links being down. Remember to key by hostname, link key, take the maximum of the bandwidths, and probably you’ll want to take minimum of the weights. If you don’t you may get duplicate links, or may have the wrong bandwidths and weights.

**To Python**

However you get your list of edges, once you have it you can load it into a NetworkX graph and start to model the traffic data against it. Our graph contains multiple edges between each node and these edges are directed because we may have different attributes (e.g. weight and utilization) in each direction.

def load_ted(tedfile): G = nx.MultiDiGraph() with open(tedfile, 'rb') as csvfile: reader = csv.reader(csvfile, delimiter=',') for row in reader: G.add_edge(row[0], row[1], key=tuple(row[0:4]), weight=int(row[4]), static_bandwidth=int(row[5]), rsvp_bandwidth=float(row[6]), rsvp_utilization=0.0, total_utilization=0.0) return G

The basic process now is to read the traffic data and for each flow, find the correct path in the graph, then apply the load to the edges in that path. We’ll need a couple of functions to help us do this.

**ECMP loads**

One type of load we need to model is ECMP traffic. These are flows that follow the shortest path on the network and in the case of multiple equal cost paths they will be load balanced across those paths. This means we need to divide the load by the number of equal cost paths to the destination, then walk each path and at each hop we need to divide the load again by the number of equal cost links to the next hop.

def apply_ecmp_flow(G, source, destination, load): try: paths = list(nx.all_shortest_paths(G, source, destination, weight='weight')) num_ecmp_paths = len(paths) for p in paths: u=p[0] for v in p[1:]: min_weight = min(d['weight'] for d in G[u][v].values()) keys = [k for k, d in G[u][v].items() if d['weight'] == min_weight] num_ecmp_links = len(keys) for k in keys: G[u][v][k]['total_utilization'] += load / num_ecmp_paths / num_ecmp_links u=v except (KeyError, nx.NetworkXNoPath): print "Error, no path for %s to %s in apply_ecmp_flow()" % (source, destination) raise

The call to `nx.all_shortest_paths()`

returns a list of paths, where each path is itself a list of the hops. We iterate over the paths and for each path we iterate over the hops with the current node in `u`

and the next hop in `v`

. For each next hop we must determine what links to apply the load to, there may be one or multiple links. First we find the minimum weight of all links between `u`

and `v`

and then using this we select all links with that minimum weight. We then distribute the load across those links.

The number of equal cost paths supported varies depending on the router, so an enhancement here would be to specify the maximum number of paths either as input to the function, or else as a node attribute in the model to permit different settings for each router. We could then select only the supported number of paths and links.

**RSVP loads**

Another kind of load we need to model is flows on RSVP signaled LSPs. For example, these might be LSPs configured to run auto-bandwidth, where the bandwidth constraint changes in response to the load. In this case we need to apply the load across a path that meets the bandwidth constraint. A simple method for doing this is to prune the links that do not meet the bandwidth constraint before running the shortest path algorithm.

def apply_rsvp_flow(G, source, destination, load): # Produce new graph with links not meeting the reservation # requirement pruned. H = nx.MultiDiGraph() for u, v, k, d in G.edges_iter(keys=True, data=True): if d['rsvp_utilization'] + load <= d['rsvp_bandwidth']: H.add_edge(u, v, key=k, attr_dict=d) # Select one random path from all shortest paths. try: paths = list(nx.all_shortest_paths(H, source, destination, weight='weight')) p = paths[randrange(len(paths))] u=p[0] for v in p[1:]: min_weight = min(d['weight'] for d in H[u][v].values()) keys = [k for k, d in H[u][v].items() if d['weight'] == min_weight] k = keys[randrange(len(keys))] G[u][v][k]['total_utilization'] += load G[u][v][k]['rsvp_utilization'] += load u=v except (KeyError, nx.NetworkXNoPath): print "Error, no path for %s to %s with load %.2f" % (source, destination, load) pass

Notice that we add the load to both `rsvp_utilization`

and `total_utilization`

. The first attribute is used to track the RSVP reservations for our constrained shortest path. The second tracks this load and the ECMP load combined.

Another difference here is that we are not doing any ECMP load balancing of the traffic. The load is applied only to one path and one link on each hop. When more than one path or link exists we select one at random. This mimics the CSFP random tie breaking behavior often used by default on routers. Other tie breaking behaviors do exist, including least-fill where it’ll select the path with the lowest utilization, and most-fill where it will select the path with the highest utilization. These could be added as enhancements.

Admin-groups (aka link colors) are another obvious constraint that would be easy to add. You could then associate your flow data with the admin-groups of the configured LSPs and pass these as constraints when applying the flow to the graph.

**Hey, presto!**

Now load your traffic data and iterate through applying the flows.

# ... load traffic data ... for source, destination, load in ecmp_flows: apply_ecmp_flow(G, source, destination, load) # Simulate random signalling order shuffle(rsvp_flows) for source, destination, load in rsvp_flows: apply_rsvp_flow(G, source, destination, load)

At the end your graph will have the modeled load of each of it’s links. Simply iterate through and output these in a report.

for u, v, k, d in G.edges_iter(keys=True, data=True): print "%s,%s,%s,%s,%d,%.2f,%.2f,%.2f" % (u, v, k[2], k[3], d['static_bandwidth'], d['rsvp_utilization'], d['total_utilization'], d['total_utilization']/d['static_bandwidth'])

A limitation of our `apply_rsvp_flow()`

function is that it simply outputs an error when it can’t find a path with enough bandwidth for the required load. If you’re just printing the results this is probably fine, but an obvious enhancement is to return these results and handle them in a some other way.

**We need to iterate a bunch**

For RSVP loads you will want to do a number of iterations where the order in which they are applied is randomized each time. Then you can keep the worst case (or p50, or p95, etc) for each link. This mimics the effects of signalling order. LSPs in the network will continuously be coming up in different orders. In many networks the paths will depend on the order in which LSPs come up, and result in different bin packing arrangements and different loads on the links.

Of course you don’t just want to simulate different signalling order. You’ll want to understand what happens in failures. This requires simulating every combination of link and node failure in the network before applying the loads and recording the results. Naturally you’ll want to bound this by the cases you are designing for (e.g. if you’re planning to handle only one link or node failure at a time then limit yourself to those combinations).

**Not bad, but… slooow.**

On my laptop with a fairly large graph it takes around three minutes to run one iteration. Not a big deal, but as explained above what we really want to do is run many iterations. Running each in parallel over multiple cores or machines is an obvious approach, and necessary at a certain size, but there is one glaringly obvious optimization that can be made here.

$ python -m cProfile -s time ghetto_cariden.py traffic_p95_in_hour_0.csv | head 345696512 function calls (345695336 primitive calls) in 248.321 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 19150 63.244 0.003 114.593 0.006 weighted.py:343(dijkstra_predecessor_and_distance) 26623535 60.456 0.000 74.667 0.000 multidigraph.py:174(add_edge) 18122 26.354 0.001 238.189 0.013 ghetto_cariden.py:138(apply_rsvp_flow) 26675584 18.585 0.000 26.259 0.000 multidigraph.py:329(edges_iter) 51838019 15.157 0.000 17.957 0.000 weighted.py:388(<genexpr>)

Inside `apply_rsvp_flow()`

we want to prune links from the graph that don’t meet our constraints. The initial approach is quick and lazy, creating a new graph by iterating over the edges and adding only those that don’t meet the constraints.

for u, v, k, d in G.edges_iter(keys=True, data=True): if d['rsvp_utilization'] + load <= d['rsvp_bandwidth']: H.add_edge(u, v, key=k, attr_dict=d)

This is showing up as a lot of add_edge() and edges_iter() calls adding quite a bit of time. It would be better if we could modify the shortest path function to ignore links that don’t meet our constraints, in place, as it runs. To do that let’s just copy the established and probably well debugged functions from the NetworkX library and modify them.

First copy `nx.all_shortest_paths()`

and change the call to `nx.dijkstra_predecessor_and_distance()`

to point to our own version.

def constrained_all_shortest_paths(G, source, target, weight=None, load=0): if weight is not None: pred,dist = constrained_dijkstra_predecessor_and_distance(G,source,weight=weight,load=load) # ...

Then copy `nx.dijkstra_predecessor_and_distance()`

. We will want to modify the part shown below where it selects the neighbors and minimum weights.

# ... if G.is_multigraph(): edata = [] for w, keydata in G[v].items(): minweight = min((dd.get(weight, 1) for k, dd in keydata.items())) edata.append((w, {weight: minweight})) else: edata = iter(G[v].items()) # ...

Our modification below simply selects only those links that meet our constraint. I’ll save you a whole explanation about how the algorithm works since you can read that on Wikipedia.

# ... if G.is_multigraph(): edata = [] for w, keydata in G[v].items(): minweight = float("inf") for k, dd in keydata.items(): if dd['rsvp_utilization'] + load <= dd['rsvp_bandwidth'] and dd['weight'] < minweight: minweight = dd['weight'] if not isinf(minweight): edata.append((w, {'weight': minweight})) else: edata = [(w, dd) for (w, dd) in G[v].items() if dd['rsvp_utilization'] + load <= dd['rsvp_bandwidth']] # ...

And then modify our `apply_rsvp_flow()`

function. Notice that we need to check again for the bandwidth constraint here when selecting the keys. We didn’t have to do in the previous version because these links were pruned from the copy of the graph we were working from. A further enhancement might be to create a constrained shortest path function that returns a list of edges in the path rather than nodes.

def apply_rsvp_flow_new(G, source, destination, load): try: paths = list(constrained_all_shortest_paths(G, source, destination, weight='weight', load=load)) p = paths[randrange(len(paths))] u=p[0] for v in p[1:]: min_weight = min(d['weight'] for d in G[u][v].values()) keys = [k for k, d in G[u][v].items() if d['weight'] == min_weight and d['rsvp_utilization'] + load <= d['rsvp_bandwidth']] k = keys[randrange(len(keys))] G[u][v][k]['total_utilization'] += load G[u][v][k]['rsvp_utilization'] += load u=v except (KeyError, nx.NetworkXNoPath): print "Error, no path for %s to %s with load %.2f" % (source, destination, load) pass

Let’s profile our code again. Nice, this was a huge optimization.

$ python -m cProfile -s time ghetto_cariden.py traffic_p95_in_hour_0.csv | head -15 115203207 function calls (115202031 primitive calls) in 101.789 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 18122 70.399 0.004 92.432 0.005 ghetto_cariden.py:169(constrained_dijkstra_predecessor_and_distance) 29025130 7.317 0.000 7.317 0.000 {method 'items' of 'dict' objects} 5980233 5.565 0.000 5.565 0.000 {_heapq.heappop} 1028 3.165 0.003 5.852 0.006 weighted.py:343(dijkstra_predecessor_and_distance) 5980233 3.080 0.000 3.080 0.000 {_heapq.heappush}