How not to do mesh networking

This is based on a talk I gave at the inaugural IoT Engineers London Meetup. You can find a (hopefully not too embarrassing) video here.

A little context

Converge’s mission is to help make the world’s largest industries more sustainable and efficient through the use of wireless sensor networks and deep data models. We need to be able to reliably collect physical data from large work sites (e.g. a construction site or a factory). We have started off in the construction industry with a very specific problem: concrete maturity monitoring. There are lots of reasons that we chose this as the starting point, and perhaps this will be elaborated on in a separate post. The problem we were solving was that temperature data needed to be read from concrete slabs all over the site in real time. This is because obtaining the concrete strength (which is derived from the temperature) is often the rate-limiting step in constructing a plethora of major infrastructure projects and other buildings. This data collection was previously being done manually, which is costly and causes constant delays, which is even more expensive than the time of the highly-trained engineers whose jobs were being interrupted by the need to trudge around collecting readings.

Construction sites are not friendly places to deploy IoT devices. There is a lot of concrete, a lot of steel, and a completely dynamic environment with people driving metal sheets between devices, or simply unplugging them because that is apparently what one does when faced with an unknown device. Ultimately, this is a bad place for tech products in general, and RF products in particular.

Topologies and environments

There are several different site types on which we tend to be deployed. We can split these up into the very large, for example Hinkley Point C which is 3km x 1km; the very tall, for example many of the towers being built in London which are often sixty stories high; and the very deep, for example the Bond Street Station upgrade which was five levels underground.

Mesh networks allow you to cover very large sites with a single base station (hub)

Given the types of environments described above, you will not get the sort of range you would expect in open air on a construction site. There are also lots of reflective surfaces so multi-path interference is a big problem. On a very large site like Hinkley, therefore, your devices are not going to be easily within range of a base station (and since there is not much infrastructure at the start of a project, placing base stations everywhere can be difficult). A mesh network, therefore, allows you to cover a site by bouncing the signal between devices until you reach those on the edge.

Mesh networks allow you to string repeaters up a tower

On a tower, concrete is typically being poured at the “edge” of the building (i.e. the highest point) as they need to build the core of the building, and the floors, as they go up. In London, at least, cellular signal only reaches up to about level thirty (on average). Once a building is completed, extra cells are added to give people reception inside the building, but until then, nada. This means we can’t move the base station up the tower with the concrete pours. Towers are really awful for cellular connectivity because of the density of concrete elements, especially around the cores. The space constraints and the fact that there is nothing but air around them also means it is hard to lay down infrastructure. Laying a string of repeaters up the core allows us to monitor concrete at the top of the building with a base station down below.

Mesh networks allow you to monitor deep underground

The final case is the most obvious: you simply do not get cellular signal underground. Occasionally there is infrastructure you could plug into, ethernet for example, but that is not always true. Therefore for projects like Crossrail or the London Tube Network, we can sit a base station above-ground, and then string devices down until we get into the tunnels.

Now in all these cases, we need fairly low throughput, as we only need to collect data every 10-20 minutes. Mesh networks inherently introduce latency, but this was not a problem in this application as we are not doing something extremely time-sensitive like oil prospecting or mapping, where the timing of the readings matters greatly (and the data volume is enormous).

Transmission ratio through concrete (C24L = 120mm). Source: NIST (1997)

In 1997, the NIST did a study of the penetration of RF signals into various building materials across a range of frequencies. As one might expect, lower frequencies have far better penetration into materials like concrete. As a small startup, we did not want to have to licence our own spectrum, so we went for one of the open, sub-GHz ISM bands. We chose 868 MHz over 433 MHz, mostly because we had been told that there were a lot of 433 MHz devices on sites already, and we wanted to avoid interference as much as possible.

Therefore we ended up with our choice: a sub-GHz mesh network collecting data every 20 minutes. But why (oh why) did we build our own mesh? The simple (and mostly honest) answer is a mixture of naïvety and bloody-minded optimism (some might say arrogance). The more complex answer was that a lot of the research into mesh networks had been in very controlled environments like the home (e.g. Zigbee) or the lab (e.g. RPL). The environments we were in were not like those, and so we went for a custom design tuned to work in a real-world, industrial environment.

Early prototype Converge Node ft. my hand

Network Design

The network was designed to be very heavily coordinator / worker. The nodes themselves were completely dependent on the hub (or gateway) to provide them with instructions. The hub also provided the bridge between the sub-GHz network and the internet (or at least our servers, via a VPN and a cellular modem). We opted to make the mesh “egalitarian”, meaning that any node can act as either a leaf or a router, and all nodes can collect temperature data. We also made the mesh synchronous, as one of our requirements was power efficiency (try telling a group of engineers on site that they have to trudge around changing batteries every week and see how far you get). A synchronous mesh allows us to have very sleepy nodes which only wake up for small windows to transmit and receive. The mesh code itself was written in Python and ran on a Raspberry Pi connected via GPIO pins to a node which acted as a sub-GHz modem.

Each action was transmitted on a per-node basis by the hub. The scheduler would thus cycle through actions, transmitting them to all the nodes about which it knew. Before we could do any of this, of course, we need some sort of packet to send and receive.

Packet Structure

The packet structure was fairly simple. The first byte was the message type, which was either 4 (hub → node) or 5 (node → hub). This dictated the direction the packet travelled on the tree. The next two bytes contained the ID of the node through which the packet has just passed (sort of a localised sender). After that four bytes were dedicated to s_packet, which was an artefact from the first prototype which, in the end, was never used (and escaped expungement). Then a single byte contained the content length, followed by the content. Similarly, following that was the path length followed by the path. The path was a description of the route the packet would take down the tree. This path was created by the hub at the beginning of transmission as only the hub new about the network tree.

Now that we have a packet structure, we can look at an example schedule:

def temperature(sleep_time=300):
    """Temperature monitoring schedule"""
    return [
        Action.mesh.make({ 'slot_timeout': 2 }), # discovery
        Action.read_battery.make(), # ask for battery data
        Action.read_temperature.make(), # ask for temperature data
        Action.sleep.make({ 'time': sleep_time }), # sleep
        Action.checkpoint.make(), # persist mesh state
        Action.reschedule.make({ # schedule another round
            'preset': 'temperature',
            'argument': { 'sleep_time': sleep_time },
            'offset': sleep_time,
        }),
        Action.checkpoint.make(), # persist mesh state
    ]

Discovering the network

Discovering the network

Before any readings can be taken, the hub needs to build up a network graph (which is really a tree). To do so, it first broadcasts a general “is there anybody out there?” Nodes would respond in random slots to prevent interference. Once a node responds, the hub would have to tell it that it had been discovered, to prevent it responding to any more discovery packets. Once no more nodes responded to the hub’s call, it would assume that everything at this level (depth = 1) had been discovered. The hub would then ask each of the nodes on this level, in turn, to broadcast a discovery packet, and so on.

Reading temperature data

Reading temperature data from the mesh

Since only the hub knows the network tree, in order to read data from, say node 0102, it creates a packet with a path [0101, 0102] and type 4 and sends it to node 0101. Node 0101 sees that it is not the destination node, and therefore passes it along to the next node in the path (setting the previous node ID to its own ID in the process). 0102 then receives this packet, and, seeing that it is the last node in the path, responds to its command (which is “send me temperature data”). The response packet (now with type 5) is sent back up the tree to the hub.

But what happens when it breaks?

Dead edge detection

As has been described, the construction site environment is very dynamic, so it was quite possible that an edge that existed at discovery time (e.g. [0101, 0102]) is interrupted and therefore it is no longer possible to use that branch to communicate with the end node. The hostility of the environment meant that we already tried up to five times to transmit a packet before failing, so at first the hub would try to send the packet along [0101, 0102] five times. When no response was forthcoming from 0102 (we did not have effective link-level acknowledgements), the hub would then test every edge in the branch starting from the beginning. The nice thing about this was that we do not actually need to test all the edges: if all the edges up to the ultimate one work, then the dead edge must be the ultimate one, in which case we only need to do approximately (n-2) round trips. The pathological case is when the penultimate edge is broken, in which case (n+2) round trips are required (including the five retries). The broken edge is removed from the tree, and hopefully that node will be discovered on a subsequent meshing cycle.

Pros and Cons

A strong coordinator/worker relationship was chosen because it has a certain simplicity in its operation. This simplistic algorithm, however, had significant downsides when we deployed it in the field.

The most obvious, repeated problem was that after initial setup of the mesh it would take a very long time for any new devices to join the network. The new device would have to stay awake and listening for packets while the sleep cycle was completed (which could be up to twenty minutes given our reading frequency). The rest of the network would be completely asleep during this time, until the scheduled wakeup time when the hub would send discovery packets and read data. To get around this problem on the initial deployment (when we were setting up infrastructure for the first time on-site) we had a “deployment mode”, when the mesh cycle time was reduced to ten seconds, and no temperature data were read. This was still an irritating problem when adding new devices to the network, which would happen throughout the life of a job.

Having the hub control all routing and discovery meant that, on the one hand, nodes could be very simple. We did not have to manage per-device configuration settings if we wanted to change the reporting frequency, for example. There was also no issue of synchronising network graphs and so on. This simplicity came at a high cost. Given the hectic environment on-site, we often would find our devices had been unplugged, power cables had gone missing or generators relocated or sometimes just shut-down. This meant that the network infrastructure, and especially the hub frequently went down. Having centralised control then meant that the nodes would not know what they were supposed to be doing. Every time we had a network outage (simply a hazard of the environment in many cases), we would not be recording any data in that interim, and therefore would have gaps in the timeseries data we were generating for our customers. This would have a detrimental effect on the integrity of the analyses we could perform on the data. Since the nodes were being asked for data from the hub, they didn’t have a mechanism for constantly recording and then caching these data temporarily until they could be successfully transmitted.

Another aspect of this centralised control (and one that was designed in on purpose) was that, in the absence of a command from the hub to sleep, the nodes would stay awake and listening for packets. This would quickly drain their batteries. The reason for this was that, should a device get “out of sync” with the network, it would be re-synchronised on the next cycle (or the one after). Staying awake for one or two cycles while waiting to reconnect was better than becoming permanently uncontactable (duty cycling is another way to circumvent this, but has its own complications).

Finally, the time taken to read all the data from a network scaled badly with the number of devices, since the hub had to ask each device in turn (traversing a branch of the tree for each). Any dead-edges would cause five round trips on that branch, locking up the hub and preventing it querying any other nodes, which would be running down their batteries listening for packets. Networks with around twenty devices on them would sometimes take at least two minutes to gather all the data. When you consider that during this time all the nodes would be awake and listening for instructions, it is clear that this would reduce their longevity.

Having said all that, and there were a lot of problems (just remembering quite how tough managing these issues on customers’ sites makes one break out into a cold sweat), the system was deployed obut san several difficult sites (e.g. Crossrail tunnels, towers in central London, large corporate offices). These sites ended up relying on the real-time aspect of the data to reduce their programme cycle times, sometimes up to nearly 30%.

Where are we now?

Happily we invested a lot of time in choosing a new platform and networking algorithm, and have not been using our custom mesh (RIP) for about eighteen months. We still use the same sub-GHz ISM band (~868 MHz), but we now use RPL which is an IPv6 routing protocol designed to operate on low-power link-layers. This has given us a lot of advantages: we get to use established tools like traceroute6 and ping6 to debug the network; control over the network is now less centralised, and nodes send data to our ingest endpoint instead of having data requested by the hub (or “border router” in RPL-speak), so we can cache data when the network goes down and we now do not lose data during an outage. Getting RPL working on site was far from a walk in the park, however. While there are many cases of large-scale RPL networks working well, almost all of these are conducted in a research setting. The realities of being on-site meant it took us a long time, and a lot of tweaking, to get the mesh to behave well and to be resilient to failures. Our latency is also massively reduced (down to a few seconds), and it takes a maximum of a couple of minutes to join the edge of a large network.

Conclusion

Building our own mesh network was a stressful experience, but also an interesting and very humbling one. We learnt a huge amount about the intricacies of operating a system in a real-life environment: it always works fine in a lab or the home, but industrial environments are of a different scale and have a different set of complications. A lot of that learning helped inform our later use of more established algorithms like RPL, and the tweaks we had to make to them to get them working sufficiently. I can’t say I’d recommend it, but I hope the story proved insightful.

Like what you read? Give Gideon Farrell a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.