So you can’t afford Cariden

In this post I’ll walk through how to use Python to model an IP/MPLS network. Once you get familiar with this kind of modeling you may find it useful for all kinds of things. You’ll need NetworkX installed.

The graph

The basis for our model is a graph representing our network. If you don’t have this handy in some static form, then a good source for this is to use the traffic engineering database (TED) from a router. The TED describes all links, with their weights, bandwidths, admin-groups, and SRLGs. Of course the drawback of using the TED is that you only get the view of your network in the RSVP-TE domain, links and routers not running RSVP-TE will not be included. For other methods of generating a graph, look at parsing the output of the IGP database, or fetching LLDP output from all routers, or of course just pairing up static IP addresses from configuration files. Then join that with bandwidth and other configuration information. If you can use it, the TED is a quickest way to get everything you need.

You can gather the TED any number of ways. Here’s one method based on a pattern that I often use when I need to get data for ad hoc investigations. I call this ghetto netconf.

jlogin -c "show ted database extensive | display xml" <router> | sed -n "/<rpc-reply/,/<\/rpc-reply>/p" > ted.xml

This fetches the TED from a Juniper router in XML format, which makes for slightly easier processing. Then we can process this XML to extract only the parts we care about and store that in something that is easier to work with for whatever I am doing. (Over time of doing this you might build a library of XSLT transformations for different outputs which will be useful for other automation.)

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="" xmlns:t="">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:apply-templates select="//t:ted-link"/>

<xsl:template match="t:ted-link">
    <xsl:value-of select="../t:ted-database-id"/><xsl:text>,</xsl:text>
    <xsl:value-of select="t:ted-link-to"/><xsl:text>,</xsl:text>
    <xsl:value-of select="t:ted-link-local-address"/><xsl:text>,</xsl:text>
    <xsl:value-of select="t:ted-link-remote-address"/><xsl:text>,</xsl:text>
    <xsl:value-of select="t:ted-link-metric"/><xsl:text>,</xsl:text>
    <!-- Convert static BW to Mbps. There's got to be a more elegant method to convert magnitudes than this. -->
        <xsl:when test="contains(t:ted-link-static-bandwidth, 'Gbps')"><xsl:value-of select="number(substring-before(t:ted-link-static-bandwidth,'Gbps')) * 1000"/></xsl:when>
        <xsl:when test="contains(t:ted-link-static-bandwidth, 'Mbps')"><xsl:value-of select="number(substring-before(t:ted-link-static-bandwidth,'Mbps'))"/></xsl:when>
        <xsl:when test="contains(t:ted-link-static-bandwidth, 'Kbps')"><xsl:value-of select="number(substring-before(t:ted-link-static-bandwidth,'Kbps')) / 1000"/></xsl:when>
        <!-- Unrecognized format. Should really raise an exception or error here instead. -->
        <xsl:otherwise><xsl:value-of select="t:ted-link-static-bandwidth"/></xsl:otherwise>
    <!-- Convert reservable BW to Mbps. There's got to be a more elegant method to convert magnitudes than this. -->
        <xsl:when test="contains(t:ted-link-reservable-bandwidth, 'Gbps')"><xsl:value-of select="number(substring-before(t:ted-link-reservable-bandwidth,'Gbps')) * 1000"/></xsl:when>
        <xsl:when test="contains(t:ted-link-reservable-bandwidth, 'Mbps')"><xsl:value-of select="number(substring-before(t:ted-link-reservable-bandwidth,'Mbps'))"/></xsl:when>
        <xsl:when test="contains(t:ted-link-reservable-bandwidth, 'Kbps')"><xsl:value-of select="number(substring-before(t:ted-link-reservable-bandwidth,'Kbps')) / 1000"/></xsl:when>
        <!-- Unrecognized format. Should really raise an exception or error here instead. -->
        <xsl:otherwise><xsl:value-of select="t:ted-link-static-bandwidth"/></xsl:otherwise>
    <!-- Push the admin-groups into a list separated by |. -->
    <xsl:for-each select="t:admin-groups/t:admin-group-name">
        <xsl:value-of select="."/>
        <xsl:if test="position() != last()"><xsl:text>|</xsl:text></xsl:if>
    <!-- Push the srlgs into a list separated by |. -->
    <xsl:for-each select="t:ted-link-srlg/t:srlg-name">
        <xsl:value-of select="."/>
        <xsl:if test="position() != last()"><xsl:text>|</xsl:text></xsl:if>



Throw the above XSLT into a file then transform the XML to CSV with xsltproc. You could of course do this inside Python and work directly from the XML source if you wanted, but in the real world you may get the graph from different sources.

xsltproc dirty_xslt ted.xml > ted.csv

The node IDs in the TED output are slightly obfuscated. They’re a contraction of the hostname, a number indicating if this node is a router or a pseudonode, and the router_id. We don’t have to concern ourselves with pseudonodes, and hopefully you don’t either. If you do you might need to parse the TED slightly differently to account for the added complexity of modelling pseudonodes. Since we don’t have them, I’ll just strip this cruft out of our TED.

cat ted.csv | sed -r 's/\.[0-9]+\(([0-9]+\.){3}[0-9]+\)//g' > ted_clean.csv

Depending on the size of your network you may need to take multiple snapshots of the TED over time and merge these to account for different links being down. Remember to key by hostname, link key, take the maximum of the bandwidths, and probably you’ll want to take minimum of the weights. If you don’t you may get duplicate links, or may have the wrong bandwidths and weights.

To Python

However you get your list of edges, once you have it you can load it into a NetworkX graph and start to model the traffic data against it. Our graph contains multiple edges between each node and these edges are directed because we may have different attributes (e.g. weight and utilization) in each direction.

def load_ted(tedfile):
    G = nx.MultiDiGraph()
    with open(tedfile, 'rb') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            G.add_edge(row[0], row[1], key=tuple(row[0:4]), weight=int(row[4]), static_bandwidth=int(row[5]), rsvp_bandwidth=float(row[6]), rsvp_utilization=0.0, total_utilization=0.0)
    return G

The basic process now is to read the traffic data and for each flow, find the correct path in the graph, then apply the load to the edges in that path. We’ll need a couple of functions to help us do this.

ECMP loads

One type of load we need to model is ECMP traffic. These are flows that follow the shortest path on the network and in the case of multiple equal cost paths they will be load balanced across those paths. This means we need to divide the load by the number of equal cost paths to the destination, then walk each path and at each hop we need to divide the load again by the number of equal cost links to the next hop.

def apply_ecmp_flow(G, source, destination, load):
        paths = list(nx.all_shortest_paths(G, source, destination, weight='weight'))
        num_ecmp_paths = len(paths)
        for p in paths:
            for v in p[1:]:
                min_weight = min(d['weight'] for d in G[u][v].values())
                keys = [k for k, d in G[u][v].items() if d['weight'] == min_weight]
                num_ecmp_links = len(keys)
                for k in keys:
                    G[u][v][k]['total_utilization'] += load / num_ecmp_paths / num_ecmp_links
    except (KeyError, nx.NetworkXNoPath):
        print "Error, no path for %s to %s in apply_ecmp_flow()" % (source, destination)

The call to nx.all_shortest_paths() returns a list of paths, where each path is itself a list of the hops. We iterate over the paths and for each path we iterate over the hops with the current node in u and the next hop in v. For each next hop we must determine what links to apply the load to, there may be one or multiple links. First we find the minimum weight of all links between u and v and then using this we select all links with that minimum weight. We then distribute the load across those links.

The number of equal cost paths supported varies depending on the router, so an enhancement here would be to specify the maximum number of paths either as input to the function, or else as a node attribute in the model to permit different settings for each router. We could then select only the supported number of paths and links.

RSVP loads

Another kind of load we need to model is flows on RSVP signaled LSPs. For example, these might be LSPs configured to run auto-bandwidth, where the bandwidth constraint changes in response to the load. In this case we need to apply the load across a path that meets the bandwidth constraint. A simple method for doing this is to prune the links that do not meet the bandwidth constraint before running the shortest path algorithm.

def apply_rsvp_flow(G, source, destination, load): 
    # Produce new graph with links not meeting the reservation
    # requirement pruned.
    H = nx.MultiDiGraph()
    for u, v, k, d in G.edges_iter(keys=True, data=True):
        if d['rsvp_utilization'] + load <= d['rsvp_bandwidth']:
            H.add_edge(u, v, key=k, attr_dict=d)
    # Select one random path from all shortest paths.
        paths = list(nx.all_shortest_paths(H, source, destination, weight='weight'))
        p = paths[randrange(len(paths))]
        for v in p[1:]:
            min_weight = min(d['weight'] for d in H[u][v].values())
            keys = [k for k, d in H[u][v].items() if d['weight'] == min_weight]
            k = keys[randrange(len(keys))]
            G[u][v][k]['total_utilization'] += load
            G[u][v][k]['rsvp_utilization'] += load
    except (KeyError, nx.NetworkXNoPath):
        print "Error, no path for %s to %s with load %.2f" % (source, destination, load)

Notice that we add the load to both rsvp_utilization and total_utilization. The first attribute is used to track the RSVP reservations for our constrained shortest path. The second tracks this load and the ECMP load combined.

Another difference here is that we are not doing any ECMP load balancing of the traffic. The load is applied only to one path and one link on each hop. When more than one path or link exists we select one at random. This mimics the CSFP random tie breaking behavior often used by default on routers. Other tie breaking behaviors do exist, including least-fill where it’ll select the path with the lowest utilization, and most-fill where it will select the path with the highest utilization. These could be added as enhancements.

Admin-groups (aka link colors) are another obvious constraint that would be easy to add. You could then associate your flow data with the admin-groups of the configured LSPs and pass these as constraints when applying the flow to the graph.

Hey, presto!

Now load your traffic data and iterate through applying the flows.

    # ... load traffic data ...

    for source, destination, load in ecmp_flows:
        apply_ecmp_flow(G, source, destination, load)
    # Simulate random signalling order
    for source, destination, load in rsvp_flows:
        apply_rsvp_flow(G, source, destination, load)

At the end your graph will have the modeled load of each of it’s links. Simply iterate through and output these in a report.

    for u, v, k, d in G.edges_iter(keys=True, data=True):
        print "%s,%s,%s,%s,%d,%.2f,%.2f,%.2f" % (u, v, k[2], k[3], d['static_bandwidth'], d['rsvp_utilization'], d['total_utilization'], d['total_utilization']/d['static_bandwidth'])

A limitation of our apply_rsvp_flow() function is that it simply outputs an error when it can’t find a path with enough bandwidth for the required load. If you’re just printing the results this is probably fine, but an obvious enhancement is to return these results and handle them in a some other way.

We need to iterate a bunch

For RSVP loads you will want to do a number of iterations where the order in which they are applied is randomized each time. Then you can keep the worst case (or p50, or p95, etc) for each link. This mimics the effects of signalling order. LSPs in the network will continuously be coming up in different orders. In many networks the paths will depend on the order in which LSPs come up, and result in different bin packing arrangements and different loads on the links.

Of course you don’t just want to simulate different signalling order. You’ll want to understand what happens in failures. This requires simulating every combination of link and node failure in the network before applying the loads and recording the results. Naturally you’ll want to bound this by the cases you are designing for (e.g. if you’re planning to handle only one link or node failure at a time then limit yourself to those combinations).

Not bad, but… slooow.

On my laptop with a fairly large graph it takes around three minutes to run one iteration. Not a big deal, but as explained above what we really want to do is run many iterations. Running each in parallel over multiple cores or machines is an obvious approach, and necessary at a certain size, but there is one glaringly obvious optimization that can be made here.

$ python -m cProfile -s time traffic_p95_in_hour_0.csv | head
         345696512 function calls (345695336 primitive calls) in 248.321 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    19150   63.244    0.003  114.593    0.006
 26623535   60.456    0.000   74.667    0.000
    18122   26.354    0.001  238.189    0.013
 26675584   18.585    0.000   26.259    0.000
 51838019   15.157    0.000   17.957    0.000<genexpr>)

Inside apply_rsvp_flow() we want to prune links from the graph that don’t meet our constraints. The initial approach is quick and lazy, creating a new graph by iterating over the edges and adding only those that don’t meet the constraints.

        for u, v, k, d in G.edges_iter(keys=True, data=True):
            if d['rsvp_utilization'] + load <= d['rsvp_bandwidth']:
                H.add_edge(u, v, key=k, attr_dict=d)

This is showing up as a lot of add_edge() and edges_iter() calls adding quite a bit of time. It would be better if we could modify the shortest path function to ignore links that don’t meet our constraints, in place, as it runs. To do that let’s just copy the established and probably well debugged functions from the NetworkX library and modify them.

First copy nx.all_shortest_paths() and change the call to nx.dijkstra_predecessor_and_distance() to point to our own version.

def constrained_all_shortest_paths(G, source, target, weight=None, load=0):
    if weight is not None:
        pred,dist = constrained_dijkstra_predecessor_and_distance(G,source,weight=weight,load=load)
    # ...

Then copy nx.dijkstra_predecessor_and_distance(). We will want to modify the part shown below where it selects the neighbors and minimum weights.

        # ...
        if G.is_multigraph():
            edata = []
            for w, keydata in G[v].items():
                minweight = min((dd.get(weight, 1)
                                 for k, dd in keydata.items()))
                edata.append((w, {weight: minweight}))
            edata = iter(G[v].items())
        # ...

Our modification below simply selects only those links that meet our constraint. I’ll save you a whole explanation about how the algorithm works since you can read that on Wikipedia.

        # ...
        if G.is_multigraph():
            edata = []
            for w, keydata in G[v].items():
                minweight = float("inf")
                for k, dd in keydata.items():
                    if dd['rsvp_utilization'] + load <= dd['rsvp_bandwidth'] and dd['weight'] < minweight:
                        minweight = dd['weight']
                if not isinf(minweight):
                    edata.append((w, {'weight': minweight}))
            edata = [(w, dd) for (w, dd) in G[v].items() if dd['rsvp_utilization'] + load <= dd['rsvp_bandwidth']]
        # ...

And then modify our apply_rsvp_flow() function. Notice that we need to check again for the bandwidth constraint here when selecting the keys. We didn’t have to do in the previous version because these links were pruned from the copy of the graph we were working from. A further enhancement might be to create a constrained shortest path function that returns a list of edges in the path rather than nodes.

def apply_rsvp_flow_new(G, source, destination, load): 
        paths = list(constrained_all_shortest_paths(G, source, destination, weight='weight', load=load))
        p = paths[randrange(len(paths))]
        for v in p[1:]:
            min_weight = min(d['weight'] for d in G[u][v].values())
            keys = [k for k, d in G[u][v].items() if d['weight'] == min_weight and d['rsvp_utilization'] + load <= d['rsvp_bandwidth']]
            k = keys[randrange(len(keys))]
            G[u][v][k]['total_utilization'] += load
            G[u][v][k]['rsvp_utilization'] += load
    except (KeyError, nx.NetworkXNoPath):
        print "Error, no path for %s to %s with load %.2f" % (source, destination, load)

Let’s profile our code again. Nice, this was a huge optimization.

$ python -m cProfile -s time traffic_p95_in_hour_0.csv | head -15
         115203207 function calls (115202031 primitive calls) in 101.789 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    18122   70.399    0.004   92.432    0.005
 29025130    7.317    0.000    7.317    0.000 {method 'items' of 'dict' objects}
  5980233    5.565    0.000    5.565    0.000 {_heapq.heappop}
     1028    3.165    0.003    5.852    0.006
  5980233    3.080    0.000    3.080    0.000 {_heapq.heappush}

An Uber support experience

This post was prompted by the automated support feedback email Uber sent me.

I recently opened the Uber app to find it had forgotten my authentication and was prompting me for a password. Unfortunately my Uber credentials were not saved in my password safe, so it was time for me to press “forgot password” and get a reset link via email. This should have me back into Uber in less than 5 minutes… but it turned out to be more like two weeks.

After a few days of trying to reset my password via the website and the app and not receiving any emails I looked for an Uber phone number I could call for support, but I couldn’t find one. I sent @Uber a tweet asking for a phone number. No response. It turns out Uber doesn’t want to deal with people on the phone and got rid of their support number. I Googled, and found others complaining, and an email address that way.


You may wonder, why not just create a new account if I’ve been waiting so long? Uber won’t let you create a new account with the same phone number, for some obvious reasons. So you are at the mercy of their support team when their password reset system fails and you want to use Uber. Even if you could create another account it wouldn’t be prudent to leave unused accounts with credit card and personal details laying around.

As a side note, I do wonder what happens when numbers get recycled by carriers, now that phone numbers have become increasingly tied to our identity online and used for dual factor authentication by many services.

So on Aug 26 at 6:51pm I sent an email, that I thought was pretty clear.


I’ve forgotten my password and I’ve tried getting a reset maybe a dozen times in the past few days but still the reset email hasn’t come through. Now I’m signed out for some reason on the app and I can’t log in.

My email address I signed up with is [omitted]

I’ve checked spam folders but nothing there either.


Aug 27 at 7:49am

Uber gets back to me in just over 12 hours, not bad really.

Happy to help, Kris.

You can reset your Uber account password by entering your email address here.

It’s possible that you signed up with a different email address than you remember, so be sure to try entering each of those addresses to receive a reset password link.

The link will expire after a certain amount of time, so if it’s not working, just fill out the form again and a another email will be sent to you with a new link.

If you need anymore help with this, please let me know.

[M] at Uber

Aug 27 at 9:40am

Okay. Slightly annoying response, I thought I was very clear. Let me try again.

I’ve entered my email address several times and it hasn’t sent me the link.

I am pretty sure [omitted] is the email address I signed up with as that is the one shown and saved in the app. But if you don’t think it is it should be easy for you to look and confirm shouldn’t it? If that account exists can you please make sure an email reset link is sent out?

Aug 28 at 7:45am

Apparently no, this is not clear. I’m told to reset the password for a third time.

Hi Kris,

This is [R] stepping in for [M].

I had a look at your account and see that your email address is correctly mentioned in your Uber account. Try to enter the email in the link and you would be able to reset the password.

If you got any error message request you to delete the app and reinstall the app

If you haven’t already deleted the app and reinstalled it directly from, this would be a good first step. This ensures you’re using the latest version available and resolves any known issues.

If you continue to have trouble, please share the details and any troubleshooting steps you’ve already tried so that I can lend a hand and get you riding again.

[R] at Uber

Aug 28 at 9:14am

At this point I’m getting a little annoyed.

As explained multiple times. I have tried the reset password link. It tells me that an email has been sent but no email ever arrives. I do receive test emails from other people to that address fine. So possibly it isn’t even leaving your system. Does this make sense? I don’t want to be told to try the reset link for again.

Here is a picture of your site telling me it has sent the email that doesn’t arrive: [Screenshot omitted for brevity.]

Aug 29 at 1:54am

Hi Kris,

Sorry to hear you’re still having trouble, Kris.

Happy to get this resolved together. You won’t be able to receive the password reset link. Kindly go through the below link.

Remember that the forgot password link may expire if you’ve waited too long, so if it’s not working just go to again and fill it out again. Make sure you’re typing your email address correctly and to click on the newest link when you get another email.

Let me know if this doesn’t do the trick and I’ll be happy to keep troubleshooting.

[R] at Uber

Aug 29 at 8:12am

Seriously? Now I’m annoyed. I’ve been told three times to reset the password when I’ve clearly explained I’ve done this and that it’s not being sent.


It seems you either can’t understand what I am saying or are messing with me. Can you please put me in touch with someone who is technical?

To explain again: for the past week I have been trying to get a reset link emailed to me from Uber using the link you have provided multiple times and through the app itself. Despite now having done it more than a dozen times no email has arrived. And no emails are in my junk folder from uber.

It seems your system is not sending the emails. Or Google Apps is silently dropping be emails. I have asked friends to send test emails to the same address ([omitted] and they have all arrived. So it suggests the problem is with your system not sending them.

It should be easy to verify in the logs that I have made these requests. And see in the logs if the email is actually being generated by the system and leaving your mail server.

Can you put me in touch with someone technical who can help?


Aug 29 at 8:14am

Following up with some more screenshots so they can see (again) that I’m not some idiot who can’t find the forgot password link.

Here are two photos showing again that I indeed requested the password. Just as I showed yesterday. Now please put me in touch with someone who can help. [Screenshots omitted for brevity.]

Aug 31

After more than a week of being locked out, and with no update from Uber in two days I decided to try again at guessing my password. After a few tries I had success using an old password algorithm (something I did before fully randomized passwords). I’m not sure how many times you can try before it blocks you, if at all… but pleased to be back in I followed up by doing some testing, changing my address to a few other aliases, some that are variations on the original. Each time I would request a password reset.

On changing the address from the original to something new Uber would immediately send an email requesting I verify the new address (not that it seemed to matter that I did verify it). Then on changing the address from something new to something else, Uber would send both a notification that my address had changed to the previous address, and the email requesting verification to the new address. When I would request password resets the emails would come through immediately. But on changing back to the original email address I would not receive the email requesting verification, and I would not get a password reset email. I never got any notification of account changes to the original address.

Searching the gmail logs I see that nothing is blocked to that address or from Uber, everything is delivered. Unfortunately not having more raw access I don’t know if something under the hood in gmail is potentially dropping this, but it seems suspicious that I receive emails to it from other sources, just not Uber.

Aug 31 at 9:14am

I let Uber know the results of this:

I was able to guess my password and I’ve done a bunch of testing this morning. The results show that if I change my email to anything else I can get the verification email for the change and can then get a password reset link emailed. But if I change it back to [omitted] I neither get the verification email nor can get a password reset link email. So it’s something about this address that I think your system is blocking.

Any luck on getting help? I notice there’s a lot of complaints on the Internet from people about this. It’s very frustrating.

Sep 2 at 10:23am

Hi Kris,

Stepping in for [R]. My name is [S] and I’m an Operations Manager at Uber.

The issue is that the email address you are using ([omitted] is not the email address on your account. We do not have that email address in our system. Can you think of another email address you would have used? If so, please copy that email address on this thread and write in providing permission to make changes to your account. Unfortunately we are not able to disclose any account information to anyone but the account holder.

If you don’t recall which email you used, please send me all of the following and I’ll be able to help you out:

The mobile number on your account
The first and last name on your account
One of the following:
A picture of the last four digits of your credit card or payment source that is on the account (please cover up all but the last four digits!) OR a description of your last two Uber trips (i.e. dates, start/stop locations.)

I understand this may seem like a lot but I promise it’s to ensure the privacy of all Uber account holders, including your own information.


[S] at Uber

Sep 2 at 11:10am


No that’s not the issue. You need to read the thread. It was [omitted]@ before and this was confirmed by your staff. As explained in my last update I changed it as part of the testing that shows when it’s [omitted]@ I don’t receive the message from you. As I asked: Is there someone in technical support who can review this?

Sep 2 at 12:14pm

Hi Kris,

I am your best contact for troubleshooting this issue.

What did you change the email address to?


[S] at Uber

Sep 2 at 12:27pm

I changed it to [omitted-variation]

I’ve just changed it back to [omitted] now

Sep 7 at 8:54am

Five days later…

Hi Kris,

I believe I have now rectified the issue. Can you please try the password reset link again and let me know if you still do not receive the email?


Thank you,

[S] at Uber

Finally, success

When asked what the cause was the only response Uber would give is that my [omitted] address was blocked. Seeing as other people have complained about this I’m wondering what in their mail pipeline is doing this and why it is not a known troubleshooting issue that can be resolved more quickly. If the mail is generated by their system, from Uber, to a customer’s email address, what possible reason would you have to block it unless you don’t trust your own system not to spam people (or something absurd like that).

Appendix bug


When editing my email address in the profile page of the app it comes up with an error that the invite code, prepopulated by Uber, is incorrect. But it successfully saves the email address changes. Given the code was put their by the app it seems that this error shouldn’t exist. And when I delete it and hit save again I still get the same error.


This generated some discussion on Hacker News that may be of interest if you made it this far. Check that out over here.

Stateless SNMP for fast, distributed polling

We’re doing SNMP polling wrong. This old idea resurfaced during a recent conversation on monitoring. Okay, before anyone else calls it out: it’s everything SNMP in general that’s wrong. We’ve known for a long time that we should be streaming data from devices rather than polling for it, and now vendors are starting to get this and give us code for it (e.g. Juniper’s J-vision).

So if we accept devices should just stream the data to collectors without us needing to continuously elicit it from them, then when we look at SNMP polling a simple improvement will be obvious to most: make it stateless by decoupling the request from the response. For polling once the SNMP get is sent, there is no need for the SNMP client to wait for the response and timeout or report errors. Instead we create an SNMP collector that can receive SNMP responses, decode the ASN.1, and log that to our database. Remember that the SNMP response has all of the OIDs and data together, there is no context needed to decode this. These SNMP collectors can clearly be distributed. The SNMP poller must still exist to elicit these responses, but that to can be managed separately. Optionally the poller can include the intelligence to direct responses to collectors by simply spoofing the source address on the SNMP get. This can be used for geographic location, or load balancing, or since it’s UDP we might as well anycast the SNMP collector. The discovery mechanism by which the objects you want to monitor are provided to the SNMP poller is again a separated function that stores the device, object, and interval configuration in the database.

You can demonstrate this concept easily enough using Scapy. Below we generate an SNMP get from kjp1 to kjp2 ( with the spoofed address of kjp3 (

kjp@kjp1:~# sudo scapy
Welcome to Scapy (2.2.0)
>>> r = IP(dst='',src='')/UDP(dport=161,sport=RandShort())/SNMP(version = ASN1_INTEGER(0L),  community = ASN1_STRING('public'), PDU = SNMPget(id=ASN1_INTEGER(123L), error=ASN1_INTEGER(0L), error_index=ASN1_INTEGER(0L), varbindlist = [ SNMPvarbind(oid=ASN1_OID(""),value=ASN1_NULL(0L)) ]))
>>> r
<IP  frag=0 proto=udp src= dst= |<UDP  sport=<RandShort> dport=snmp |<SNMP  version=<ASN1_INTEGER[0L]> community=<ASN1_STRING['public']> PDU=<SNMPget  id=<ASN1_INTEGER[123L]> error=<ASN1_INTEGER[0L]> error_index=<ASN1_INTEGER[0L]> varbindlist=[<SNMPvarbind  oid=<ASN1_OID['.']> value=<ASN1_NULL[0L]> |>] |> |>>>
>>> send(r)
Sent 1 packets.

We see the SNMP get and response in tcpdump on kjp2.

kjp@kjp2:~# sudo tcpdump -nnnvvvi eth0 udp and port 161
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:15:15.733263 IP (tos 0x0, ttl 46, id 1, offset 0, flags [none], proto UDP (17), length 68) > [udp sum ok]  { SNMPv1 { GetRequest(25) R=123  . } }
19:15:15.735145 IP (tos 0x0, ttl 64, id 36270, offset 0, flags [DF], proto UDP (17), length 147) > [bad udp cksum 0x4363 -> 0xe428!]  { SNMPv1 { GetResponse(104) R=123  ."Linux kjp2 3.19.0-15-generic #15-Ubuntu SMP Thu Apr 16 23:32:37 UTC 2015 x86_64" } }

And then the response received on kjp3.

kjp@kjp3:~# sudo scapy
Welcome to Scapy (2.2.0)
>>> sniff(filter="udp and port 161", prn=lambda x:
###[ Ethernet ]###
  dst= 00:0d:3a:30:a0:45
  src= a4:4c:11:2a:d0:7c
  type= 0x800
###[ IP ]###
     version= 4L
     ihl= 5L
     tos= 0x0
     len= 147
     id= 62631
     flags= DF
     frag= 0L
     ttl= 45
     proto= udp
     chksum= 0xbfd7
###[ UDP ]###
        sport= snmp
        dport= 1950
        len= 127
        chksum= 0x8a82
###[ SNMP ]###
           version= <ASN1_INTEGER[0L]>
           community= <ASN1_STRING['public']>
            |###[ SNMPresponse ]###
            |  id= <ASN1_INTEGER[123L]>
            |  error= <ASN1_INTEGER[0L]>
            |  error_index= <ASN1_INTEGER[0L]>
            |  \varbindlist\
            |   |###[ SNMPvarbind ]###
            |   |  oid= <ASN1_OID['.']>
            |   |  value= <ASN1_STRING['Linux kjp2 3.19.0-15-generic #15-Ubuntu SMP Thu Apr 16 23:32:37 UTC 2015 x86_64']>

To be honest this decoupling of polling and collecting functions was first proposed in a time when fast polling was more of a problem. Today with how we seem to throw machines at problems it’s less of a concern, but I do think the architecture is cleaner and better aligned to where we want to head.

(And yeah, I’m ignoring the root/raw socket issue for the split poller/collector scenario, there are solutions but they do make it uglier. ;-))

Simulating auto-bandwidth settings

I’ve been working with auto-bandwidth a lot in the past year and there have been a number of questions I’ve wanted to answer in that time. Some examples of these are:

  • What is the impact of choosing a different adjust-threshold for LSPs in this range? Is there a single adjust-threshold that would work well enough for all LSPs?
  • Can we reduce the amount of over- or under-subscription of LSPs?
  • What impact do overflow-limit and underflow-limit settings have?
  • Is there an entirely different and better algorithm for auto-bandwidth?

A lot of auto-bandwidth knobs end up being set based on a feeling, with maybe some follow up to see the result. But with a large real-world set of data we can instead take another approach: simulate the permutations of the knobs we are interested in, and then examine the results.

Here I’ll share some preliminary results exploring the impact of the overflow- and underflow-limit knobs. I’ve been told to stay away from these without any real explanation as to why. Apart from potential bugs, I wanted to know what impact these may have. There were two main reasons for my interest in these knobs.

We have situations where rapid increases in utilization can occur, for example it is not uncommon to see have someone kick off a process and we see the utilization instantaneously jump 200% or more. Simply waiting for the adjust-interval timer means the LSP may remain on a congested path for considerable time, we want a faster reaction. One option is reducing the adjust-interval, the other is applying an overflow-limit. (The third is to do something entirely different.)

An inherent aspect of auto-bandwidth’s design is that LSPs will mostly over-reserve bandwidth. This is a product of auto-bandwidth selecting the largest sample over a period to reserve the bandwidth to. This can be smoothed by having longer statistics intervals of course, but 30 second or 60 second intervals seem to be the most common. Arguably this bias towards over reserving bandwidth is preferred due to the peaky nature of traffic, hidden in our average samples are a lot of peaks we want to be able to absorb. Still it would be interesting to know what effect underflow-limit can have.

These results are very much a first run and based on a limited subset of the complete data set to reduce processing time so don’t email me if you use this for anything and it’s wrong. Verification and scaling up to the full set of data and permutations is for another time if ever warranted. Though auto-bandwidth has a limited shelf-life and you’re better off spending efforts on other things: better planning and prediction systems, centralized TE and algorithms for that, and bandwidth brokering and calendaring systems.

The plots below are pretty self explanatory. They are arranged in a four-pack, giving a plot without any overflow- or underflow-limit setting, with one of each, and with both for comparison. This first one shows the rate of adjustments per five minutes for each combination of adjustment-threshold between 5% and 50% and adjustment-interval between 300s and 3600s. There is essentially no impact to using overflow-limit, but there is some increase in adjustment rate when using underflow-limit. (Looks like there’s a pretty optimum knee around a 900s interval and 15% threshold doesn’t it.)


These next plots show the sum of over reserved bytes, that is for every statistics interval the difference between the reservation and the utilization is taken and summed. This is probably not the best metric to use but it served to give me a rough magnitude to look at. Note the adjustment-threshold axis is reversed in these plots. The interesting thing here is just how flat the over reservation became when using underflow-limit. It’s suspiciously almost too flat and I need to verify the results aren’t an error. Essentially it appears to roughly give you the same over reservation at all adjustment-intervals as you get at adjustment-interval of 300s, which on the face of it makes some sense.


These last plots show the sum of under reserved bytes using the same method as above. Also note again that the adjustment-threshold axis is reversed from the first set of plots. There seems to be some reduction in the amount of under reservation at the higher adjustment-intervals when using overflow-limit, but it’s not major. Using both underflow-limit and overflow-limit together results in an increase in under reservation at most settings.


No need for bandwidth constraints in New Zealand

Some time ago I made some off the cuff comments to some people about the bandwidth scarcity fraud going on in New Zealand. People were skeptical. Telco thinking is well ingrained. Ports must cost hundreds of thousands of dollars. Switches must be built to survive direct nuclear strike. So on. But this simply isn’t true today.

Amazon, Microsoft, Google, Facebook don’t build their infrastructure on the telco model. Extending the concepts used there to building broadband networks is entirely reasonable.

I wanted to pound this out in an hour or two so this post is a hastily written death-by-bullet-point. I’ve necessarily hand waved over some of the costs and added large undisclosed multipliers to them as they’re derived from costs I get see, and may not be the costs you’d see. I think the multipliers I’ve added put it more than comfortably in attainable territory for NZ Inc. Hopefully there’s no glaring holes in my calculations below.

First let’s layout some assumptions and facts.

  • There is fiber to every corner of New Zealand[1], literally from Cape Reinga to Bluff. This fiber is sufficient to support backhaul demands. We’ll light a multi-terrabit DWDM network to support this.
  • There is new fiber being deployed past every home in UFB deployment areas as part of that program. This must include more than enough fiber for any metro area demand (if it doesn’t, someone seriously fucked up).
  • Thus fiber costs are sunk. There is no benefit from not utilizing the fiber to its fullest, except to create artificial scarcity so you can maximize your profit. The physical layer costs of the fiber are long lived and should amortize over its lifespan and simply be priced appropriately to recover that cost, it’s ongoing maintenance, plus some fair margin for ROI to the investors. We can assume this is happening today and is already being passed on to peoples bills. So this isn’t a factor in our discussion as that doesn’t change.
  • There’s 1,570,000 occupied dwellings[2] and the UFB target (IIRC) is to pass 75% of those. Thus we are targeting 1,177,500 dwellings.
  • I couldn’t find any quick statistic on the peak usage in New Zealand, Kim Dotcom reported 125 Gbps [3] which would correspond to a pathetic 100 Kbps per dwelling. Peak average utilization on other Gigabit to the home networks overseas has been reported to me as being well under 5 Mbps (seriously). The reality is that after people do the speed test, they almost never push their Gigabit connection again. But let’s pick 10 Mbps per dwelling for this exercise, this should give us some headroom. This means we need to aggregate up to 11.775 Tbps total.
  • Assume it all flows towards Auckland as that is where content is largely hosted and cables land.


  • We can exclude the fine people of Auckland from backhaul calculations, as the backhaul network is about getting traffic to/from the provinces to the content and submarine cables in Auckland. Local traffic in Auckland can be aggregated on the local metro network.
  • We need to get roughly 3 Tbps across the Cook Strait.
  • Aggregate with 3 Tbps from the lower North Island.
  • And merge with 2 Tbps from Hamilton and Tauranga.
  • Let’s double that (for redundancy purposes).
  • A back of the napkin DWDM network to support this works out to about 18.1 million USD in DWDM equipment. Let’s be extremely generous and double this and assume that will cover our installation costs too: 36.2 million USD.

Auckland metro-fabric

According to our 75% target we expect to have 350,000 dwellings on the network at 10 Mbps peak. This means we need to aggregate up to 3.5 Tbps egress in the Auckland metro. To put this in perspective you can buy a 1 RU switch today that gives you 3.2 Tbps for under $10,000 USD. Granted that’s a small FIB so if you want to support greater FIB maybe you’d spend some multiple of that.

I’m going to have to make some wild ass assumptions for the access node part as I’m not up to speed with access nodes. If you’re going down this path of providing ridiculous bandwidth you’re pushing your own developments in this space. It should be entirely within reach to get an access node that delivers 1G to 100 subscribers with a 1:1 uplink ratio, supporting 10×10 Gbps uplinks for under $400 USD per subscriber line. Thus about $140 million for the access nodes.

This makes for supporting 35,000 10G downlinks to these access nodes in our metro fabric. Assuming all LR optics we could produce a supporting fabric for comfortably under $50 million USD.

Total $190 million USD / 350,000 = $542 USD per home. This has a lot of fat in it too. The fabric I actually estimated here provides 1:1 subscription ratio so is seriously over provisioned.


Extending the Auckland back-of-envelope cost to the nation gives us:

  • 1,177,500 * $542 USD per home for metro fabrics.
  • $36.2 million USD / 1,177,500 homes = $30 USD per home for backhaul.
    • So at the worst case we have a round figure of $600 USD per home. Assuming a three year lifecycle that would be $16 USD per month.


clogin techniques

I’ve recently had to dust off my shell and clogin skills for some projects, and because I often encounter engineers who aren’t aware of or familiar with clogin I thought I’d share some of techniques that I frequently turn to. Learning clogin with some basic egrep, sed, awk, tr, cut, and a dash of Perl can save you hundreds of hours. Throw in some expect and you’re ready for any situation.

My pattern is to combine clogin with xargs to parallelize tasks. This pattern will absolutely freak out managers (and a lot of other people) but I value my time more than their comfort. The following may seem like cowboy techniques, but the fact is in a lot of cases the “right way” to do things isn’t available, and when in a hurry clogin can be your friend. Besides cowboys have cool hats and six-shooters.


  1. Break down the task into small parallelizable steps. When doing configuration changes you’ll want these to be increments that leave you in a safe place if it fails somewhere. You can also work on part of the fleet in phases by adjusting the input.
  2. Concoct various layers of precheck and postcheck steps to perform (to the extent that it makes sense).
  3. Use clogin to execute the steps in parallel across your list of routers.
  4. Take down whole network… no really, you do want to be comfortable that you know how this works and apply a healthy dose of paranoia. (I’ve left out a lot of the extremely paranoid precheck/postcheck behavior I implement.)

Show commands

A common activity is simply to execute a show command on a bunch of routers and process that in some way. As an example let’s just check the light levels on a bunch of interfaces.

$ mkdir switch-check; cd switch-check
$ echo switch-{1..100} | xargs -n1 -P30 
  sh -c 'clogin -c "show int e1-4 tra det" $0 
  > $0.show_light'

Here I’m making a new directory that I’ll work from. Then I’m then using the ridiculously powerful bash brace expansion to generate my list of hostnames which I feed to my clogin command using xargs. xargs is set to consume 1 argument per execution (-n1) and to execute up to 30 instances in parallel (-P30).

You can use brace expansion to get a variety of outputs that fit your naming scheme. Unless your naming scheme is to use pseudo random globally unique identifiers, in which case I’ll give you my imaginary six shooter and you can pretend to shoot yourself in the head after shooting the person who came up with that. (HR people: that’s NOT AN ACTUAL THREAT, my six shooter is as fictional as a cool cowboy hat.) Some other brace expansion examples:

$ echo switch-{1..4}
switch-1 switch-2 switch-3 switch-4
$ echo switch-{01..04}
switch-01 switch-02 switch-03 switch-04
$ echo switch-{a..d}-{01..04}
switch-a-01 switch-a-02 switch-a-03 switch-a-04 switch-b-01 switch-b-02 switch-b-03 switch-b-04 switch-c-01 switch-c-02 switch-c-03 switch-c-04 switch-d-01 switch-d-02 switch-d-03 switch-d-04
$ echo switch-{aa,bb,cc,dd}-{01..04}
switch-aa-01 switch-aa-02 switch-aa-03 switch-aa-04 switch-bb-01 switch-bb-02 switch-bb-03 switch-bb-04 switch-cc-01 switch-cc-02 switch-cc-03 switch-cc-04 switch-dd-01 switch-dd-02 switch-dd-03 switch-dd-04
$ eval echo switch-{aa,bb,cc,dd}-{01..04};

I’ve wrapped the clogin command in a shell because I often end up in situations where I want to pass multiple different arguments into clogin and xargs doesn’t handle this on its own. By wrapping it in shell I can pass a space separated list of arguments to the shell which will use those as arguments ($0, $1, $2, …) in the command. This will make more sense later when I talk about pushing configuration.

After running this we will find in our directory a bunch of files named switch-*.show_light that we can grep for whatever we were interested in. In this case I want to check the Rx Power of all the uplinks for any that are outside an acceptable limit. The egrep will vary depending on the output of the router. (One day we will develop a nice abstraction layer that normalizes all this.)

Here’s an example for Cisco NX-OS.

$ egrep 'Ethernet|Power[ ]{1,}(-([5-9]|[1-9][0-9])|N/A)' *.show_light | grep -B1 Power | egrep -v '^--$'
switch-19.show_light:  Rx Power       -8.92 dBm       0.00 dBm  -20.00 dBm   -1.00 dBm    -18.23 dBm
switch-69.show_light:  Rx Power      -10.22 dBm --    1.99 dBm  -13.97 dBm   -1.00 dBm     -9.91 dBm
switch-81.show_light:  Rx Power       -6.86 dBm       0.00 dBm  -20.00 dBm   -1.00 dBm    -18.23 dBm

Here’s an example for Arista EOS. Because EOS outputs voltage, Tx, and Rx in blocks, you can extract those blocks using awk. Use of sed and awk will become important as you progress to doing more complex scenarios.

$ for f in *.show_light; do cat $f | awk '/Tx/, /Rx/' > $f.tx; cat $f | awk '/Rx/, /END/' > $f.rx; done
$ egrep '^Et[0-9]{1,}[ ]{1,}(-([5-9]|[1-9][0-9])|N/A)' *.show_light.rx
switch-19.show_light.rx:Et9        -5.10         1.50        1.00        -14.00      -12.00      
switch-69.show_light.rx:Et14        N/A          2.00        1.50        -12.90      -11.90      
switch-81.show_light.rx:Et4        -15.33        1.50        1.00        -14.00      -12.00      

Of course Arista EOS also can output to CSV which makes this kind of thing a much easier task most of the time (bless you and your considerably more modern OS). Try something like the following to get a CSV you can use.

$ echo switch-{1..100} | xargs -n1 -P30 sh -c 'clogin -c "show int e1-21 tra csv" $0 | grep 10GBASE > $0.show_light'

Obviously at this point you can see how it is quite easy to extend this pattern and do greater levels of post processing using more awk and sed one-liners. A common pattern with clogin users is to have an inventory of devices in a file, and have cron execute a bunch of show commands for things you want to monitor, then generate some alerts or reports via email (see RANCID). People also use the output to do things like generate topologies, maps, so on.

Applying some configuration

Applying some configuration is just as easy as the above. clogin will let you feed a bunch of semi-colon separated commands to execute, and nothing stops you from making those a small configuration snippet.

$ echo switch-{1..100} | 
  xargs -n1 -P30 sh -c 'clogin -c 
  "configure; interface po1-4; ipv6 nd suppress-ra" 
  $0 > $0.config'

Or you can make a configuration script that clogin will read line by line. This is useful for longer configurations. First create the configuration script you want to apply.

$ cat>configure_tcam
hardware profile tcam region vacl 256
hardware profile tcam region racl 256
hardware profile tcam region e-racl 256
hardware profile tcam region e-vacl 256
hardware profile tcam region ipv6-racl 256
hardware profile tcam region ipv6-e-racl 256

Now push that to the devices. Here the -c argument to clogin is exchanged with the -x argument and our configuration script. The output is redirected to *.configure_tcam which you can grep for any errors that might interest you.

$ echo switch-{1..100} | 
  xargs -n1 -P30 sh -c 'clogin 
  -x configure_tcam $0 > $0.configure_tcam'

Verify the configuration was applied by fetching the current running configuration and composing an egrep. Here we want to see that our egrep finds the correct number of lines, in this case there are 6 configuration lines in our script that should now show up in the running configuration on all of our switches.

$ echo switch-{1..100} | xargs -n1 -P30 sh -c 'clogin -c "show running" $0 > $0.show_running'
$ egrep -c 'hardware profile tcam region (ipv6-|)(e-|)[rv]acl 256' *.show_running
# ...

Because we have such a large number of devices we need to work on exceptions, thus it’s better to append a grep to exclude lines that are okay and only show devices that appear to have not accepted our configuration completely. Now you can hone in on those and scratch your head as to why that happened. Hint, go look at the switch-69.configure_tcam output file.

$ egrep -c 'hardware profile tcam region (ipv6-|)(e-|)[rv]acl 256' *.show_running | grep -v ':6'

Another technique is to just feed the configuration script to grep as a pattern file. The -1 in the expression is to exclude the “configure” line in the configuration script.

$ egrep -cf configure_tcam *.show_running | 
  grep -v ":$(($(wc -l < configure_tcam) - 1))"

Naturally you’ll need to save the configuration.

$ echo switch-{1..100} | 
  xargs -n1 -P30 sh -c 'clogin 
  -c "copy running startup" $0 > $0.save_config'

Now in this case we need to reload switches, so we’ll need use a expect. Something like the following should do the job.

$ cat>reload.exp
send "r"
expect {
    -re "$prompt" { send "reloadr" }
expect {
    -re "Do you want to continue.*]" { send "yr" }

Our expect script can be executed with clogin by exchanging the -x argument for the -s argument. You’ll find more elaborate expect scripts in the RANCID package. In time you’ll learn TCL and build up a library of scripts to handle all scenarios.

$ echo switch-{1..100} | xargs -n1 -P30 sh -c 'clogin -s reload.exp $1 > $0.reload'

Verify reloaded and configuration was properly loaded by fetching the running configuration and grepping it as per the verification steps above.

Variables in configuration pushes

Sometimes you have some configuration you need to push that requires variables to be substituted. In this case I’m assuming you have some kind of input file that is easily parsed. Often I’ll have some kind of inventory-esque file that is simple a comma or space delimited list of parameters where I know the particular columns well. You might extract this from a database, or use a little XSLT to extract the data from a horrible XML file, or if it follows a well defined scheme simply generate it new with a little bit of shell (or in Excel if I’m feeling the urge). Or in cases where there is no inventory you can use, you’ll actually walk the network on a regular basis getting configurations and extracting the data from those configurations (use a link state database from a few seed routers as a shortcut to get a list of all routers).

In this case lets say we have a space delimited input file we can use cut to extract the parameters we want and supply them to xargs. Now notice we change xargs to consume a whole line of input at a time: -n1 has become -L1. These parameters are now available in out command as $0, $1, $2 and so on. In this contrived example column one is the hostname, column two is the management IP, column five is the IP address on uplink 1, and we have no DNS so we’re connecting to the management IP.

$ cat input | cut -f" " -d1,2,5 | xargs -L1 -P30 sh -c 'clogin -c "configure; int po1; ip address $2" $1 > $0.config_ip'

You can also generate larger configuration scripts and apply them using a simple template like follows. In this case our very simple template substitutes variables using Perl, creating a script for each device, which is then pushed.

First the template.

router bgp ${local_asn}
  neighbor ${ipv6_neighbor1}
    inherit peer LEAF
    remote-as ${remote_asn}
  neighbor ${ipv6_neighbor2}
    inherit peer LEAF
    remote-as ${remote_asn}
  neighbor ${ipv6_neighbor3}
    inherit peer LEAF
    remote-as ${remote_asn}
  neighbor ${ipv6_neighbor4}
    inherit peer LEAF
    remote-as ${remote_asn}

Now transform the template creating a script for each device.

while read hostname local_asn remote_asn ipv6_neighbor1 ipv6_neighbor2 ipv6_neighbor3 ipv6_neighbor4; do
  perl -p -i -e 's/${([^}]+)}/defined $ENV{$1} ? $ENV{$1} : $&/eg' < config_template > $hostname.template
done < input

Push the configurations.

$ cat input | cut -d" " -f1 | xargs -L1 -P30 sh -c 'clogin -x $0.template > $0.config_peers'

And then follow with your verification steps.

Playing with AVX

A friend recently got me interested in crypto currencies. I had not been paying much attention to them, but they are quite interesting and digging into the details has been a fun project. (I now own 2.0 BTC, hopefully soon I can buy a car with them.) Looking into the code I learned about the new-ish AVX instructions that extend SSE registers to 256 bit. Holy crap that’s a big register. Where have I been for the past 8 years. And there is a 512 bit version coming next apparently. I had to have a play with these, so I fished out an old project to produce MD5 using SSE and modified it to reflect the new AVX extensions.

Sadly it turned out AVX2 isn’t supported on my laptop (or any other CPU I had handy) and the plain-old AVX is a bit half-baked. For example it doesn’t support integer operations. Or shift operations. This leaves you needing to cast back into SSE in weird ways like:

/* AVX1 is half-baked. It doesn't have any integer support. */
#define AVX_ADD_INT(x, y) 
    _mm_add_epi32(_mm256_extractf128_si256(x, 1), _mm256_extractf128_si256(y, 1)), 1)

/* Or shift support. */
#define SSE_ROTATE_LEFT(x, n) 
  _mm_or_si128(_mm_slli_epi32((x), (n)), _mm_srli_epi32((x), (32-(n))))
#define AVX_ROTATE_LEFT(x, n) 
    _mm256_castsi128_si256(SSE_ROTATE_LEFT(_mm256_castsi256_si128(x), n)), 
    SSE_ROTATE_LEFT(_mm256_extractf128_si256(x, 1), n), 1)

Sadly the result of this isn’t good. The AVX version is a whole order of magnitude slower than the regular MD5 (see “width = 8” below). While the SSE version was twice as fast. I suspect this is something to do with having to transition between AVX and SSE in the above code but I couldn’t be bothered digging into it too deeply. Once AVX2 arrives it’ll be easier to use. Given it’s the resolution time of year I put the source code on Github. Finally going someway towards fulfilling a very old resolution to put all my code on Github. This code is not in anyway optimised, at the time I originally wrote it (8-ish years ago) there were other SSE MD5 libraries out there that were giving better results. Today this can only be more so. But if you want to learn a little bit about SIMD it might provide a simple example.

iterations = 100000, width = 1, ticks = 45.00, hashes =
iterations = 100000, width = 4, ticks = 25.00, hashes =
iterations = 100000, width = 8, ticks = 699.00, hashes =

Blog posts on Verilog are popular


So after Chinese spammers (above) put my hosting account over its traffic limit yesterday I took at look at my Google Analytics.

And, wow. One post accounts for 40% of traffic over the last 12 months. Lesson: Verilog is sexy. Or at least sexier than everything else I write. 🙂


Accommodating full DFZ tables on merchant silicon

Had an interesting conversation at lunch the other day in which the subject of full tables on merchant silicon came up. Merchant silicon, such as Trident and Trident 2, have fairly limited FIB sizes today. This is fine in the data center use cases that they predominantly feature in, but what if you want to replace your “carrier-grade” routers with merchant silicon based switches? I was first thinking about this subject back in 2011 with the then topical subject of scale-out routers, and this is how I was thinking it might go.

If we were an innovative shop, who was making use of fixed form factor switches based on merchant silicon, and we wanted to lower the cost of ports at our peering points, we could build a BGP peering router as follows.

  1. Build a Clos out of fixed form factor devices. We’d prepackage this into a rack.
  2. We’d house a couple of servers in the rack that would act as the “control plane” engines.
  3. We’d use OpenFlow to proactively program state on the switches so that BGP packets (and other relevant control plane cruft) would be forwarded to the servers for processing (e.g. pings, arps, …).
  4. The servers would run Quagga and peer as you might expect. In this way it starts to look something like a TX Matrix.
  5. Our controller on the servers would then partition up the state and distribute it proactively amongst the switches such that if an ingress switch does not know the egress switch and port for a packet, it will know which switch has that state via an aggregate for that switches state.
  6. Not intended to be a complete description, but the data plane could at its simplest comprise of a base layer of tunnels between every switch and every port. Thus the state that the controller has to then dynamically partition is distribute is the mapping between a prefix learned from BGP and the appropriate tunnel to an egress port.
  7. The interesting stuff is in how you partition this state nicely. This could be a naive algorithm at its simplest, or to start. But in time it may be prudent to develop a method to better balance load, or to reduce misses on the ingress switch. For example we might pursue a FIB caching strategy where if we have a number of middle stage switches that we initially partition the full state across with proactive entries, this would only be adjusted on large timescales based on the balance of traffic to each aggregate. On the tier 1 switches we would using a strategy where wee might monitoring flows and install state for the top-n most used prefixes on each ingress switch.

Of course later we might consider working with an ODM to produce a basic chassis that is really just a harness for these switches, alleviating the amount of intra-fabric cabling, but still leaving us with a fabric of switches that are independent and controlled by our servers. Needless to say that would be a great strategy to pursue if you also intended to build your data center fabrics out of similar building blocks.

Of course this strategy and ones like it can also be adapted to merchant silicon based chassis such as Arista 7500s or the recently announced Cisco 9500s.

A different, but semi-related strategy that interests me is the placement of NPU functionality onto NICs, or the development of ultra-small switches with large memory for peering scenarios (imagine say a 4x 10 Gbps device the size of a NIC, wrapped in a case, with a build in ARM-based control plane to run Linux). In this case the interest is in a flexible device on which you can do the encapsulation/encapsulation functions for private cloud, and to create a smaller footprint for disruption at peering points (e.g. for upgrades).

Quick update on wavelength autodiscovery

In the last post I outlined two potential schemes for discovering wavelengths. Extending auto-negotiation was not an option, so I pulled down a copy of IEEE 802.3-2008_section_4 to see what else could be used. It turns out there is a mechanism called sequence ordered_sets that can be used to convey state between the PHYs at each end of the link. Currently only two sequence ordered_sets are specified: local fault and remote fault.

In the case of the first scheme I described where a handshake is necessary it would now seem feasible to encode this using sequence ordered_sets. Sequence ordered_sets can convey three bytes of information. With only two defined that leaves a lot of room for conveying the local and remote wavelengths directly as new ordered_sets. But alternatively a “next page” mechanism could be used, in which a new ordered_set is defined that effectively says “the wavelength is in the next ordered_set”.

In the case of the second scheme I described relating to the WDM PON scenario where the head end knows the wavelength that the remote end should use, it could advertise this explicitly in the ordered_set as described above. This way the remote end immediately knows what wavelength to use. Or in a scheme where the remote end iterates over wavelengths the remote fault indicator ordered_set that is already defined could be overloaded to indicate to the remote end to keep iterating until the indicator is ceased. Or preferably, as it’s cleaner than overloading the use of the remote fault signal, a new ordered_set would be added that would provide this “iterate wavelengths” indication.

Of course the usefulness of this is limited to filter based WDM systems with direct detection receivers. Schemes as described above don’t make sense in the coherent detection case as the receiver needs knowledge of the wavelength on which it is meant to receive.