Updating The Edge
Devices on the edge are highly varied in hardware and capabilities, even within the same technology space. Knowing that, how do we design an efficient, scalable, and reliable solution for updating the software on these devices, all while minimizing downtime for the user? There is no one-size-fits-all solution to this problem, but we have some tools and techniques at our disposal to make solving the problem easier. Learn exactly why this is an important and difficult problem to solve, and get exposed to tools and strategies you can implement to streamline deployments to your embedded linux devices on the edge.
VIDEO TRANSCRIPT
Hey you all. Today I’m going to be talking about update strategies for edge devices. This can be a deceptively difficult problem to solve for, especially, if you’re dealing with a fleet of devices with different capabilities. But there are some tools and design strategies we can implement that address a number of painpoints. First, a little bit about me: My name is Kat Cosgrove and for about a year, I was a software engineer on the IoT team here at JFrog before I moved into developer advocacy. Our goal is to bring DevOps to the edge because it should not still be this difficult to update these devices. And in pursuit of this goal, we found a lot of interesting solutions that we can bring into a CI/CD pipeline for embedded Linux devices and eventually build a rather flashy proof of concept that puts several of these solutions on display. Let’s get a little bit of context first. Just how large is the edge? Counting both DOM, censor edge devices or smarter ones and gateways, there are an estimated 20.4 billion edge devices today.
It seems like a lot, yeah. So… How are they being updated now? Some of them can’t be updated, they are throwaway devices, effectively single-use, when they break they need to be replaced. And a lot of those that are being updated, do it in a way that’s kind-of unwieldy. It’s time-consuming, or the infrastructure to support it is complicated, or it can’t happen wirelessly at all and requires physical access. Which is either expensive for the manufacturer because they have to have staff to physically send out there to support the devices or it’s irritating for the user.
Users don’t want to plug a device in to update it anymore, which is fair, I don’t disagree. But in spite of that, the industry is clearly booming. Like, it’s 20.4 billion devices. That’s not small. That’s not a small, insignificant industry. Clearly, in spite of the fact that they can’t really be updated well for the most part, it’s making money. So, why should we spend the time and effort to change something that doesn’t really appear to be a barrier to success? It is really inconvenient. It’s not like small inconvenient. Our lives are increasingly reliant on edge and IoT computing with devices taking over larger and larger amounts of work in consumer, industrial, retail and medical spaces. A lot of you probably work in those industries, and are very aware of that. People don’t want to spend lengthy amounts of time waiting for a software update to complete in an increasing number of parts of their lives, you know? And they definitely don’t want to have to plug the device into a computer to do it manually. It’s also dangerous.
Infrequent or non-existent software updates regardless of the reason for it, makes edge devices a serious security vulnerability in anybody’s network. Your consumers, your clients, a business network, a home network, it is a problem. Everyone knows that unpatched software is a problem. This could mean allowing anything from the exposure of user-client data to a malicious third party to devices being harnessed for a bot-net or cryptocurrency mining. This has already happened on a wide scale and it’s continuing to happen to today. The safety implications for medical are even more extreme but remember, a few years, when a bunch of people’s smart home appliances like their smart refrigerator or whatever, they got harnessed to be used as part of a bot-net that was DDOS-ing other services because they were exposed and they were difficult to update. That’s… really dangerous. OK. We’re acknowledging that it’s a problem. Why are we not fixing this problem? Why does this still exist? We are flat-out still not building for it.
So, a lot of these devices are just not designed with the ability to be updated. They expect to run the same software version they shipped with until they break. The update strategy for these devices is flashing them. This is dangerous because how often do you write a piece of software with 0 bugs in it? How confident are you in your QA team to catch every conceivable edge case? I know we all want to think that we’re, like, we’re wizards, we’re, like super reliable, masters of our craft, genius 10X engineers but… that’s not reality. It just isn’t. Because you just shouldn’t be that confident.
There are on average between 1 and 25 bugs per thousand lines of code. Bugs are making it into the production, you can’t gamble on your code being perfect because you’re not perfect, none of us are. Sorry. We also can’t really rely on the device’s network. The networks hosting these devices are probably unstable, the connections might be intermittent, the speed of the network may be very slow, so… we need to keep the updates really, really, really, small which probably means they need to have it more frequently and it’s also probably not a secure network, so these updates need to be signed in some way, so we know they’re trustworthy. Those 20.4 billion devices are also not running on any kind of standardized hardware. Each class of device is running on something specialized with differences in available communication protocols, memory, storage space, architecture and operating system sophistication. So, how do we design a system that takes these differences into account, handles them and allows us to achieve easier deployments to broader classes of devices instead of just a single board or device type? Can we get a one-size-fits-all solution for a market that comes in a lot of different sizes? Probably not. I’m going to be honest with you, but we can build something that gets us one-size-fits-most or at least something that’s flexible enough that it can be applied to your specific situation with minimal effort. First, we have to think about the future.
We need to stop building devices that can’t be updated. Throwaway devices are not acceptable anymore. Besides the glaring security problem in having a fleet of devices out there in the wild, collecting data that can’t be patched if there is a problem, the ecological impact of designing single-use electronics that might need to be just thrown away as a result of a bug is unacceptable. The way I like to explain to people how many of these, like, crappy single-use electronics out there, there are is the Tokyo Olympics that were supposed to happen this summer all of the medals for the Tokyo Olympics were made entirely from scrap-metal harvested from discarded electronics donated by the people of Japan. That is an absolutely mind boggling amount of electronic trash that we should not allow to continue. It’s absolutely unacceptable. We also need to be designing with the philosophy that your product should improve over the course of its useable life span. This may not apply to edge devices that are just DOM arrays of censors passing data along somewhere else but it definitely applies to smarter edge devices and gateways. The expected lifespan of some edge devices is 5-10 years, they should not be getting worse from the very moment they ship. A
nd now, I’m not saying that you need to build something that’s never going to need to be replaced because that’s not how businesses work, but you shouldn’t allow it to immediately degrade. You should be able to improve it for as long as the hardware is not obsolete. You know, so release updates to improve your physical product for, I’m not asking for a decade, I’m asking for a couple of years. Support it with measurable improvements until the hardware is obsolete and then, people will replace it. We also need to be building more robustly. Brittle software means a brittle product and your users aren’t going to trust that. So, a bad update should never break the device. There needs to be a way to roll-back immediately if there is a problem. Be trustworthy. And there are a lot of moving parts in deploying updates to the edge or deploying updates at all these days, but especially to edge devices and it’s much easier to manage for the developer if you have a robust CI/CD pipeline in place. Just, do your engineers a favor and don’t make them do all of this menial, repetitive stuff every single time. Automate it. Let’s talk about the proof of concept in question. For swampUP last summer, my team built a proof of concept demonstrating fast, reliable, over the air updates for edge devices. We went with a car as our example because it’s flashy, and it’s not something a lot of us think of as an IoT or edge device even though it totally is.
Most cars these days have at least 2 or 3 computers in them, the infotainment system, plus a computer for the transmission and the brakes at minimum, so it’s a data-center on wheels. And all those computers on board, they need updates. Since JFrog wouldn’t buy us a real car, to potentially break during testing which is fair. I’m not mad at you for that, Shlomi. We had to build our own inspired by a Hackathon I had run a few months earlier. So we had a racing simulator set up in the middle of a large track complete with a pedal, a wheel, a screen for the driver, a racing chair and a green screen around the tracks so they had something nice to look at. That is me making sure everything is set up before the conference opens. We allowed people to interact with it in one of two ways: as a developer, writing and pushing updates for the car or driving the car while someone else updated it. Yes, we really did let real randos off of the conference center floor write code and push it to our demo. We were very, very confident in this, it actually only was a problem once. But what we found is that usually somebody from marketing wanted to be the developer and get like help from one of us and the developer wanted to drive the car, and then the person from marketing wanted to mess with the developer by pushing an update that made steering really lose, or inverted the video or something. It was fun for everybody but the car itself is just a heavily modified donkey car.
If you aren’t familiar with those and I was not before I ran a Hackathon with one, it’s a miniature RC car that’s been modified with a raspberry pi and a camera. The library that enables it to do its thing is called Donkey hence, Donkey cars. In its most basic form you can build one for about 250$ in parts and a couple of hours of your time, it’s really not bad. Once everything is set up, you record 10 laps or so on a track marked with brightly colored masking-tape or paper on a dark floor, really you need anything that creates a lot of contrast. As you can see here we painted stripes on a black floor. Then you just dump the recorded images and stirring data back to your computer, train a model and then hand it back to the car. Donkey provides a CLI that makes all of this really, really easy but the project itself is super well-documented so if you want to dig in a little deeper and make some modifications to it, you totally can. They are pretty fun. A whole community exists for modifying and racing these things. T
hey’ll throw bigger batteries on them, more powerful motors, nicer wheels and then just race, it’s really cool. The standard camera is just a regular raspberry pi cam but some folks do add a second camera for stereo vision or a lighter camera for better depth perception. Ours was a little bit beefier than the standard but still not by much, we just swapped out the raspberry pi 3B for a 3B+ with a compute module. So it’s very nearly a stock donkey car, you can build this too. While the interactive part of the demo wasn’t as revolutionary from a technical standpoint, it was really fun to build and still fairly complicated. I’m happy to answer questions about this part of the demo as well, but let’s talk about the actual software updates. How are cars being updated now? So the overwhelming majority of cars can’t receive software updates over the air.
I know somebody listening to this drives a Tesla and is going, “well, actually…” yes I know, Tesla can do it but we’re not talking about Teslas, we’re talking about most cars. So most cars need to be brought into a service center to get a software update and it can take a long time. Cars that can be updated over the air, which again, are in the minority still take a while to do it, it still can sometimes take hours and the car can’t drive while any part of that is going on. Different cars are still using different boards and running on different operating systems, so even within one corner of the market, there is no standardization yet.
And something like 15% of vehicle recalls are a result of bad software that needs to be updated. For example, a couple of years ago, Jaguar had to issue a recall to fix a problem with the brakes on one of its cars, one of its very expensive, luxury cars and another car stranded its owner in the car on the highway for hours, because it could do over the air updates but it wasn’t very intelligent about how it handled them and it couldn’t hold back a bad one. So this poor person was just trapped in their car on the highway for hours in traffic. That’s like my actual nightmare. So this is a very real, very painful, and very, potentially, dangerous issue that presented us with a range of solvable problems for both the end-user and the manufacturer. The combination of all of these issues in one, single device made it a great example for proving that we do not have to do things this way anymore. Because we were working with a car, we were very, very motivated to make this happen quickly, reliably and without the potential for injury to the user.
I know that it wasn’t a real car, but we were pretending that little RC car was a real car. Let’s look at our work flows and tools. We used two distinct work flows, showing off two distinct strategies for the proof of concept. For one work flow, we updated the software running on the car without flashing its firmware. It’s very quick, it doesn’t interrupt the user and it supports rollbacks. This work flow relied on K3S and Helm. For the other, we updated the firmware itself in just minutes with the ability to automatically rollback if something goes wrong, once again and this one relied on Mender, Yocto and Artifactory.
All the updates were scanned for vulnerability using Xray and the Pipelines can manage the triggering of different events. We’ll look at the software work flow first. This is just a quick overview of the technologies that play in a work flow for software updates on a device. We used JFrog Artifactory to manage and store all of our various build artifacts as well as handle promoting builds from development to production. Xray is used to scan our package dependencies before release, we want to make sure that we’re not accidentally pushing like, a known security problem. K3S and Helm are used to deploy to the car.
Since it already integrates the JFrog products, JFrog Pipelines would be used to orchestrate all of this. We’ll talk about each tool individually in a little more detail. K3S is just Kubernetes, but 5 less. It’s also the cutest slogan in open source, I love it. and K3S is really just lightweight Kubernetes. It’s designed for edge devices, uses only 512MB of RAM, produces a binary of just 40MB and has very, very minimal OS requirements. The packages require dependencies are just Containerd, Flannel, Core DNS CNI and some utilities like ip tables and socat It’s wrapped in a launcher that obstructs away some of the more complex issues like generating TLX certificates for secure communications and it sets a few options for you. Then we also relied on Helm the package manager for Kubernetes. I assume that a lot of you are familiar with Helm already but for the people in the room who are not, I will go over it briefly. Helm uses charts to describe complex applications and allows for an easily repeatable and predictable installation.
Your charts serve as the final authority on what your application should look like. They’re also easy diversion and support rollbacks if something goes wrong. This is part of one of the Helm charts. For this demo, if you’ve never seen Helm charts syntax before it’s building a micro-service we were using to allow us to control the miniature car with a racing wheel design for video games. The syntax here is fairly simple. Like all things in DevOps, ultimately, it is just Yaml. run some Yaml on it. The result is using these tools these, just two special open source tools, made updates to the applications running on the car pretty quick and fairly efficient. The average amount from a developer pushing an update to deployment on the car was about 35 seconds, with no interruptions to the user at all. It can happen while the device is in use.
That’s not to say that it necessarily should happen while the device is in use. In the case of a car, I don’t really think that in actual practice it should. And for most updates, I think there should be consent from the user involved or we should at least tell them that it happened. But designing your edge device with this strategy in mind does mean you can update silently. For edge devices without any user interaction at all and no need for a full firmware update, this is probably a pretty compelling option, if you’re building a an HVAC system, a smart HVAC system or something that People aren’t really going to want or need to mess with on a regular basis, this is probably a good option for you. Next we’ll take a look at the firmware work flow. This one was a bit more complicated. Similar to the software work flow, this is an overview of what is going on in the firmware update work flow. Yocto – is used to create our builds, we have pipelines handling all of the events. Again, an Xray checking our packages for vulnerabilities or policy violations.
We’re also using Artifactory again but this time for more than just storing binaries. It’s handling some of the workload for Yocto – storing its build-cache to make things faster later on Mender is used to handle over the air deployment. So let’s talk about Mender a little bit. Over the air updates for embedded Linux devices. Mender ticks several of the boxes we’re looking for, all the updates are cryptographically signed and verified, it supports automatic rollbacks to a previous build if a failure occurs and it allow for several distinct installation strategies. But I’m going to be focusing on the dual AB strategy. Using this method, two partitions exist on the device A and B. Duh. The Bootloader is aware of which partition is currently designated as active and which is not. It boots into the partition designated active. During an update the inactive partition is updated with your new firmware. It streams directly there to the inactive partition and at this point the partition running the update is designated as active and on reboot, the Bootloader flips into that partition instead. Because the partition running the older software still exists although now it’s designated as inactive Mender offers us an additional layer of security.
If there is an issue with the new build that wasn’t caught in earlier testing and prevents it from booting correctly, Mender can automatically rollback to the previous version by just switching to the inactive partition. So, what this would look like in practice is that you you get in the car, you’re driving to the grocery store. While you’re driving to the grocery store, your car gets an update and it starts streaming to partition B, say that’s the inactive one for now. So it downloads to partition B while you’re driving to the grocery store, this does not interrupt you at all.
You don’t notice aside from maybe your car saying “Hi, you got an update. do you want to go ahead and download this?” You say “Yes”, you drive to the grocery store, it doesn’t interrupt you other than that. So you go in, you do your shopping, you come back out, load up your groceries and then you turn on the car and you drive home. And that’s it. So, while the update may have actually taken the entire 15 minutes it took you to drive to the grocery store, the perception from the user is that it happened instantaneously the next time they rebooted the car. So, that’s pretty cool in it of itself. The use of Mender’s AB strategy alone adds a lot of additional speed and security to your edge deployments. But we still need to address the issue of the size of our builds, because they’re kind-of big. Enter the Yocto project providing custom Linux distribution for any hardware architecture. Yocto drastically reduces resources used on the board and minimizes hardware requirements by building a distribution without certain modules, that are unnecessary for your particular hardware.
For instance, say, you know you’re only ever going to have to communicate over Bluetooth. You build a custom Linux distribution that does not include modules for Ethernet and WiFi. Just isn’t there from the very beginning. BitBake is used to write recipes for your build, pulling in layers for different hardware configurations and applications. We used a layer for the board itself, a raspberry pi 3D+ as well as layers for K3S and Mender. Yocto already provides a pretty wide variety of hardware layers for common configurations to get you started more quickly and custom layers can be written to isolate specific applications or behaviors if you want. As an example, this is part of the meta layer that adds K3S to our Yocto build. Bitbake’s syntax is pretty straightforward and it allows for the execution of both python and shell scripts in parallel. You don’t need to be an expert in python or shell scripting for this, it’s all fairly simplified so any software engineer with sufficient drive could absolutely put this together.
The first build will still take a while, but we can speed things up drastically for future builds. Yocto caches everything it downloads during a build which allows it to do incremental updates. If you haven’t used this build-cache, it will only rebuild what’s changed. And this is where Artifactory comes into play. We can use a generic repository to store the Yocto build-cache and tell Yocto to use that during subsequent builds. This strategy reduces the time required to build by as much as 50% for us. That’s a pretty significant speed difference. So the result is using Mender to handle deployments and Yocto for our builds when pulling the build-cache from Artifactory, the time for a full firmware update was between 5 and 10 minutes after the first build. The build is as small as possible, minimizing potential issues with low bandwidth on the device’s network.
The updates are signed, satisfying the security requirement and automatic rollbacks are in place in case of failure. So, we succeeded. I do have a few more tools that didn’t make it into our proof of concept but are still very much worth mentioning. First is OSTree. It’s basically Git for operating systems. Like Git, OSTree is using a content addressed object store with branches to track meaningful file trees. This can be used to update a system, even incrementally and also supports rollbacks. There is a Yocto meta layer for this as well that allows over the air updates using OSTree and actualizer And OSTree is already used by a few different Linux distributions, this is a well established tool. There’s no real reason why we chose Mender over OSTree other than just we came across Mender first, I think. Next is LAVA. A testing framework for operating systems on embedded devices. LAVA stands for Linaro Automation and Validation Architecture and it’s a continuous integration system for deploying operating systems on physical and virtual hardware to run tests. A test can be simple, boot testing Bootloader testing or system level testing, and results are tracked overtime and data can be exported for further analysis.
We have this running in the lab, in the Seattle office. LAVA was designed for validation during development, though testing whether code that engineers are producing works in whatever sense that means for you. Depending on context, this can be testing whether changes in the Linux kernel compile and boot or whether the code produced by JCC is smaller or faster. LAVA is outstanding at device management with templates in place for more than 100 boards. If you need a device type that LAVA doesn’t already know how to support, custom devices can be added although it can be difficult to fully integrate an unknown device. I will admit that the learning curve with LAVA is pretty steep, but it’s worth it if you need to manage a ton of different boards, a ton of different boards. Alright, let’s recap.
The way a lot of edge and IoT devices are being updated now, if at all is kind of broken. It’s a gigantic security flaw that needs to be addressed. It doesn’t have to be this way. We have technology available literally right now that handles some of these problems beautifully for edge devices running embedded Linux. A modern DevOps and a handful of open source tools can make it as easy to deploy updates and security patches to your edge devices as it is to any other application.
If you only need to update applications running on the device K3S and Helm can make things very quick and easy for you, I know Kubernetes has a pretty steep learning curve I was scared of it at first, too, but K3S is simpler to use, I promise. If you need to update the device’s firmware itself, consider Yocto and Mender or OSTree. Definitely store the Yocto build cache in Artifactory and make sure Yocto is actually set up to use it. Thank you for taking the time to listen to me today, I hope I brought you some useful information or at least an interesting perspective. If you have questions, I’ll be in the chat answering them for a while. And you can also reach me on Twitter @Dixie3Flatline or you can just shoot me an email KatC@JFrog.com Enjoy the rest of the conference.