Leveraging Robot Data in Autonomous Vehicle Development – Carl Quinn, Zoox

Engineers at Zoox are building autonomous vehicles to solve the world’s mobility problems. Their vehicles are robots that capture and generate a lot of data. They collect that data, ingest it, catalog it, and make it available to the rest of the company for many uses, such as performance analysis, event triaging, machine learning, and data mining. In this video, Zoox describes how robot data is different than your typical server log data, and how they manage it to turn it into a platform that benefits development and operational teams at Zoox.

VIDEO TRANSCRIPT

Okay, I’ll go ahead and get started. So, this talk today is about leveraging robot data in autonomous vehicle development.

So, I work at a startup called Zoox. We are building autonomous vehicles, and I’ll talk a little bit today about some of the ways that we use the data that we collect on the vehicle to improve our development process and inform what we do.

So, autonomous vehicles are robots and Zoox is building, I don’t know if you know our mission but, Zoox, our mission is to deliver mobility to the world, to dense urban environments, and our goal is to build a full stack, a full solution. So, we are building the vehicles, we are building the AI software, which is going to drive the vehicles, and we are going to own the fleet that we build and we are going to operate the fleet like a taxi service, robot taxi, so the whole stack.

That gives us some advantage, some big advantages, that we think are important as compared to some of the other companies that are, effectively, just tacking on robotics to existing cars because those cars were designed to be driven by humans, and if you throw the robot in there, try to drive it, there’s a lot of things that don’t work as well.

If you build a vehicle from the ground up to do it right then we can do a better job, we think. So, during our development process, we’re working on two forks at the same time, or two main prongs of development work.

And, one of them is, we’re building the vehicle which is, we have some prototypes. Actually, they were… Some of those cars in the video this morning were our car is driving out on Alameda track. They look like dune buggies. So, those are our prototypes for the actual physical car design. But, most of the work we have, the hardest part of the problem is developing the AI software for actually driving.

And, that we do with what are known as level three vehicles. So, I don’t know if you’re familiar with DOT autonomous levels, but one, two, three are the levels where humans are involved at some point in driving the car.

So, level three is a car that’s mostly autonomous, but the human doing the backup. So, that’s what we have here and this is what most of the other autonomous vehicle companies use for their driving model and training model. So, there’s a trained driver in the driver’s seat and a software operator. In our case, we always have a second software operator in the passenger seat taking notes and watching out for certain things.

And then, at some point when our level five vehicles are done and our software is done, then we’ll put them together and be out in a public road some day soon.

So, robots are like frogs or like any other animal. If you think of them like that is that they have sensors or senses, and cameras, lidar, radar, time of flight, sonar. Some autonomous vehicle companies that are making autonomous autopilot cars are betting everything on the camera and, in theory, humans can see, and we can drive, so, therefore, a car should be able to drive if it can see.

But, the compute power that our brains have compared to what we have even in the stack of GPUs is vastly different. So, in order to hedge those limitations and bring in multiple modalities of lidar, radar, time of flight, sonar, other sensors for short range, we can get a lot of cross-reference information, and a lot of reliable input about the environment that you wouldn’t normally have with just cameras.

And then, compute. So, the robots have compute and we use CPU and GPU. A lot of the work, especially with vision and perception processing can be done on GPU. It’s much more efficient for doing machine learning, and that kind of processing.

And then, whenever decisions are made about what to do, those signals go back out and control motors, and actuators, and speed controls, and all sorts of hardware.

And so, every car today has buses all over it, right? So CAN bus, LIN bus, there’s a bunch of different kinds of just normal interconnect on your car that sends signals between sensors and little actuators, controllers, for all different parts of the car. So, our car, our vehicle, is similar.

So, we have a lot of the standard car buses that we’re dealing with at the low level for physical control of things. But, we also have some higher level buses because you’re not going to run lidar stream of data, and camera streams of data over a CAN bus. So, we also have, ethernet buses and other kinds of routing and switching, more like a whole compute. It’s like a tiny data center stuffed in that car.

So, that brings all of the information into the central computer, or what we call it the PCU, which it’s the main brain of the car. And, within the main brain are dedicated processes. So, this is an interesting architecture. I found it fascinating when I joined, in that there is, effectively, a high speed message bus that runs on this machine, which is a multi-core, multiprocessor machine.

But, the message bus then allows different processes, or components, software components, to publish and subscribe. So, pub/sub but at a very high speed, and the messages go across the bus and they have intended recipients who can read the messages back.

They all do their own part processing images, doing detection, doing classifications, building 3D model of the world in real time and maintaining that model as the vehicle moves. Localization, all sorts of decision-making, planning the robot behavior.

So, we have a lot of data and what should we do with it? So let’s capture it somehow. Now, in a traditional server environment, you often end up doing log collection, so you’ll have your servers are writing, maybe, just standard out and it’s just dumps, and you’re trying to write filter scripts or pattern matching to try and pull out interesting logs and route them through your log stack.

Maybe your servers are better, your systems are better, and they’re writing JSON and that’s really nice because it’s much easier to parse and you know what you have, but typically, it’s a little bit fuzzy and you’re trying to extract the nuggets out of the stream of raw text, just line-by-line.

But, what we have is a bit different because what went over the bus in our system are actually different kinds of structured messages. Most of them are protocol buffers. I don’t know if you’re familiar with Google protobufs.

Some of them are ROS messages from older ROS-based open source platform that we use. And, we also use DDS messages, which are a standard from OMG DDS. So, those different kinds of systems use different messages depending on where they came from and what they’re being used for.

So, in a lot of ways taking these messages that are just literally already compact and concise, and already self-contained and parsable, we can just write them to disc in some kind of format, and we already know how to deal with them then. We don’t have to try to parse them. We’re just recording. So, at this point, our recording is more like a flight recorder in an airplane.

So, we have to control when we want to record. We don’t really want to be recording all the time when the car is driving. So, we have, as part of the software operator workflow, and the team that’s going in there to go out on a mission and do a drive, they will control the setup of what software components are going to launch for that drive when they start recording.

And then, during the recording the operators will be taking notes and maybe signaling that something went wrong and writing up a little write up about it. So, all that information can be then packed into the messages and recorded and saved. So, they’re also controlling the recording. So, when we begin, when we end, of the recording, so that’s, that’s important.

So, let me show a bit of a scene here. So, this is what the operators see when they are driving, and without the driver, but the software operator, and you can see they’re turning on different things. So, they turn on lidar, turn off lidar, the planner information. So, all of this is a visualization for them to see what the vehicle, what the planner, what all the components in the AI components are actually doing and thinking.

So, it’s a visualization of the messages themselves. So, the operators can turn them on and off because they don’t always have to see everything. They usually don’t need to see the lidar, but they just mostly want to see what the vehicles planning on doing, but we record everything. So, all of the data that’s going across the bus will be recorded regardless of what they do with it, with the UI.

These are really fun to watch. I don’t know.

And, you can see, the little flags with the letters like Cs or [inaudible 00:09:54] ones. So, those are other agents in the scene and we label them other cars or are they pedestrians? And, it also shows the modality that they’re sensed in. So vision, lidar, radar. Oftentimes we’ll pick them up with two or three.

But, you could see in the case where we see things with only one modality, well that’s a good case for, it’s a good thing we have more than one, that we’re looking for things. If we were just relying on camera, there are a lot of things we wouldn’t pick up that we were able to see with the other modalities.

So, what is it that we are collecting? So, all the messages go across the bus. We don’t collect absolutely every message because some of them are really big, but we do capture most of them. The main thing is that when we capture interesting decisions and lower bandwidth kinds of perception information. So, localization, 3D model prediction planning, all those are very important messages.

They’re not particularly huge. We capture all of that. Sensor input. We do reduce it a little bit spatially and temporally. And, for the camera data, we’re capturing stills but we also capture video flow. But, we compress it H265, because it’s just too big to record raw video from 16 cameras. That’s just a lot of data.

But, actuator control, other metadata like maps and routes. Where did the vehicle go? What map was it using? What did the operators do? What mode, where they in? What switches did they have on and off? All of that we can collect.

So, we have a lot of data that’s flowing. We’re collecting it, in process, from the bus. In theory, you could try to transmit it up over LTE to the back end. That would be cool if we had that kind of bandwidth. But, in practice, we don’t, so we write it to disc locally, and we’ll drive around collecting all the data, so on the runs and when the mission is done, we’ll take a back to our base station and upload it somehow.

So then, how do we want to do that? And, in what format should we record it? So, on the vehicle we’re writing to disc. We originally used, because we started out using some of the open source framework called ROS, Robot Operating System. We used a native format that they had called bag files, which are an indexed database file that you can write messages to, which is pretty handy. A lot of tools work really well with it, but it tends to write everything into one file and it’s in and the index is built afterwards, after the fact.

So, when we’re writing, writing, writing, if we decide we want to be able to do for our missions and driving. You get up to almost a terabyte of data after driving that much. Those files are so big and just completely unwieldy and unusable once we try to port them back to the back end.

Copying and dealing with files of that size was just really a problem. So, we changed the way we recorded and we took advantage of some of the basic nature of how the messages are. So, the messages all have types. So, protobufs have types and we know the type of information for each message. And they’re also published on topics.

So, like a typical message bus, there’s a topic that is the label of what the stream of messages are all about. And so, those topics are named nicely, and then, they’ll have timestamps. So, we have a very accurate clock. We can time, we know when the messages are being published and they’re being written to the disc. So, we can timestamp and make use of that as a way of organizing it.

So, what we did is invented our own format that we call CHUM. So, it stands for chunked messages. It’s a funny name. Some people really hate it, but it’s funny and it caught on. So, the idea of this, I think of it like a strip recorder, if you’ve ever seen those old school with all the pens that are wiggling and writing on the paper. It’s sort of the same thing.

So, each topic is being written in its own file and at the end of every minute we close the file, flush it, and then, start the next file in the next directory. So, there is a directory tree full of files at the leaf, and the directories are all based on time. And then, the files are the topics for that one minute window.

So, you have this one minute that you’re writing the topics, and then, the window closes. You write, and then, move on to the next one. So, the intersection of a topic in a minute gives you a chunk. So, it works really well because we can write continuously and, in theory, you could be writing and uploading the data from an hour ago at the same time you’re writing new data. We don’t actually do that in practice, but, in theory, we could do that because it’s just this continuous flow of writing.

So, we did invent this. So, this is something that is our own proprietary thing, but we built it on top of open source, lower-level libraries because we don’t want to reinvent everything all the way down. So, we built it on top of the SSTs, the SS Tables, sorted string tables, that are the low level storage platform that the Google’s Level DB database builds on top of.

And, it works really well because SSTs are using this kind of a database as an append-only mode write system. So, you’re writing to the end, you’re writing to the end. And then, those little immutable written sequences are then wrapped up in read in memory database, in effect.

So, we use the same thing, we write the records as just two fields. The timestamp in nanoseconds. When was this record written? And then, the message itself is the rest of the payload with a little bit of metadata to describe, what type was this message? Is there any extra little data that needed to go with it? Metadata for the message that maybe came over the bus that wasn’t in the message, but it was about the message.

Things like the published time versus the written time, a few other things like that. And, we can always add more extra data as we need to, that’s out of band from the actual message.

So then, we need to get the data into the system, into our systems, so the data’s on the car and we want to get it into the cloud. And in our case, cloud is AWS, and on-prem, our own data centers, so it’s a mix, and we move stuff around just depending on what makes sense for performance and cost and latency to the systems that are going to need the data.

So, but initially, S3 is a really great storage system, we like it. And so, that’s our main hosting place where things go. So, what we like to do is take the runs that were created by driving, upload them, get them into the cloud, get them cataloged and a little bit of metadata pulled out about what happened, when was the run, who was on the run, what vehicle was it on? And, get that little database built, that sideband to the CHUM storage itself.

But, there’s some challenges in getting that much data up. It’s often really spiky. Our teams drive around during the day. Some days are really busy, we’re doing a lot of driving. They all will tend to get back to the base station similar times, and they want to turn around and get out again, and so, you’ll see these big spikes where everybody shows up.

We want to get the data off, and then, move on. Other days are not nearly as busy and this was a year ago. We do a lot more driving now than we did a year ago, but even then there was a big spikes. So, there’s a demand to get the data off the vehicle, get that vehicle back out again really quickly.

So, we tried a few different things. The vehicles are running, basically, like a small server rack of compute, and it has a bunch of SSDs in there, so they’re just regular state of drives we can plug in. That’s our storage. It works really well. The bandwidth writing to, a big RAID array of SSDs is really good. That’s a straightforward problem.

It’s also interesting thing to think about. Maybe we can just plug in the ethernet and upload everything, That, in theory, should work. But, in practice, if you’ve got a bunch of vehicles waiting, trying to upload a terabyte of data over even a 10 gig ethernet, that starts to take a while because you’re not ever going to get 100% efficiency and there’s going to be something upstream that slows everything down.

So, the 10 gig ethernet ends up being our backup plan. If something’s going wrong with the drives, or if they don’t need the vehicle right away, we’ll sometimes use the 10 gig.

But so, using the SSDs themselves was great. We have kiosks that we set up at the base stations which are just very simple Linux boxes, drive bays, auto-mount the drives, look at what’s on them, upload them, clean them up. So, basically doing a safe move into the cloud. And so, it’s just a dumb machine that’s pumping the data up into the cloud. It works pretty well, but there’s a few things can go wrong.

I guess the most likely thing that goes wrong is the physical connection of those drives in and out of those drive bays. They’re really not made for that. They start to wear out, the pins bend, they don’t seat all the way. So, people plug him in and walk away and nothing will happen. Or, they’ll get unmounted. They may be not cleanly unmounted and they’re pulled, and then, some of the boot information is corrupted on them, and they seem okay but they don’t work.

There’s lots of things going on. So, there’s a lot of operator and runbook stuff that has to happen where people know, “Okay, I did this, I did this.”. How do we address these things? And, the other one too, the network could go down, and now we’re stuck, or we’re just clogged up with so many uploads going on at once that everybody’s blocked up waiting.

And, that usually happens when all the people that know what’s going on are on vacation and I’m the one in charge and I’m on pager duty and, “Oh. What’s going on? I’ve got to figure it out.”

So, of course, that’s fine. So, that does happen occasionally. So, we try to find other solutions, right? How do you mitigate the problem of just dealing with plugging these SATA drives in all the time? And, our current solution that’s working really well, USB 3.1 drives. These have really good, we can write everything we need to onto one drive instead of having eight drives. They’re rugged and the connectors don’t wear out nearly as fast as the SATA pins.

So, this has really been a big boon to our reliability of getting the disc uploaded without a lot of human intervention, right? Things will tend to just work. And, we still have the network if we need to.

So, once we upload the data, we have backend systems that will scan the CHUM and look at what was there, look at the metadata, and then, catalog that up and break it up into runs, that we call them. So, every sequence of recording, when the operator has got in and turned it on, and we’re recording data, we call that a run. So, beginning to end, that’s a run. It had a purpose. It had a driver, so all the metadata about that.

So, we catalog all of those, and for each one we can extract a bunch of metadata. So, we look at everything that’s related to the… Let me try my little pointer. Ooh. Yeah. Okay. Look at that.

The metadata for the run, it tells you everything about the run, the time and what mode it was in, and the vehicle, and the purpose of the run, and where it currently is in the process of going through the pipeline of work. And, we draw a little picture to show where it was, and where it happens. So, this is five blocks from here. That’s where we mostly drive. Up off Broadway and up north from there.

We also scan for… This is still working pretty good… incidents that happen. So, some messages will contain metadata information about things that happen, either the operator took some notes, just arbitrary notes, and they noticed something that somebody should look at.

Or, things that were an issue like the driver had to disengage because the vehicle was moving, maybe, too far to the edge or it didn’t know if it was going to react right to something that was happening.

So, those get pulled out and cataloged, and we actually put those, we call them triage events, they get put into JIRA, which is an abuse of JIRA. We beat it up a lot, but it works really well because that ends up working nicely with the developer workflow and releases of the software, and we can say, “Okay, what issues are fixed?” And, they can group them together and create meta-issues that are then representative of how to reproduce the bug.

So, we have all this data, it’s all cataloged and ready to go. What are we going to do with it? So, I think about the data in a couple of different ways and I get rethinking about how I call these, but I think the one that works out best in my mind is, we work with the data in two ways. One is message-oriented, where we just use the data as it was written by components that are maybe even the same components running offline.

If we want to reproduce a problem, we can actually run most of the partial software stack. Maybe we’re running the planner and we feed it perception information from an actual drive, and the planner thinks it’s in the car driving and it’s making decisions.

So, we can use that for regression testing, or developers can use that to debug the problem on their machine. So, it’s actually hard to debug a problem in real time in a robot. But, if you can reproduce it perfectly because you have the exact data that came into it, you could do that offline, reproduce it, fix it, write a regression test and never break it again.

And then, the other way we look at the data is analytic. I think maybe in an analytic way is a good way to call it is, asking questions after the fact where the data that is there but not really organized to answer your question specifically. We’re not writing a record that answers your question because you thought of the question later.

So, more like traditional data analytics, asking for business information or other metrics and measurements, performance things that might come later on, where you aggregating across lots of data. So, I’ll go into a little more detail on these.

So, for the message data, this is the bulk of our traffic. We, basically, keep that in the data center, still in the same CHUM form. We may move it into a different storage for hot access data, and we make it co-located, more closely available to the big compute clusters that we have.

So, we have, effectively, big super-computer clusters, lots of racks of Linux boxes stuffed full of GPUs. So, they’re doing all sorts of things like machine learning, and running all of our tests that are running, and all the other things that need data.

So, some of the things we can do just by looking at some of the simple meta messages that are on the bus having to do with, we can look at timing between messages, and we can produce reports so that teams can look at how fast are there components running? Are they keeping up with their allotted tick time, the cadence? Are they producing their output within the time window that they’re supposed to? Performance. What kind of CPU?

So, once you can correlate that, I think that’s going to be the next slide, a little bit more detail on performance. But, yeah, so you can look at a lot of information directly plotted with some simple tools that a lot of… Our core team develops these tools that just give a quick look at some of the primordial messages that are being used all over the vehicle.

But, we can also cross-reference some of the same timing information in the messages with behavior of CPU, standard CPU analysis and dumps, so that we can correlate CPU usage, flame graphs with timing, and delays, and what was happening where, because lots of cores are running, they’re all doing different things, they’re all sending messages. It’s really tricky to debug that unless you have really good tools to dig in and correlate. It could be causality of what message was sent, and what thing triggered, maybe, a big CPU spike in another component.

And, of course, we use the data that we collect. So, all of the data from the lidar, especially. Probably radar too, and vision, are used to aggregate together over lots and lots of recordings and build localization information. So, using GPS and other systems combined with what we see in lidar, and averaging that out over a long process, we can create very, very accurate digital maps that tell us where things are exactly in line to each other within like a centimeter or two over the whole city area that we drive.

And, playback. So, we can also take those messages that we recorded and play them back, and effectively render them in different ways. And, I’ll run the movie in a second, but what you can see on the left is camera capture, but then, with overlay that was put on there, where all the information for the overlay, like where the boxes were going to be in that space, what sensors modalities picked up on the different agents. That was all recorded.

All of the information for the 3D imagery on the right, all those agents and what kind they were, and what all their plans were. Those are also recorded, and then, overlaid on top of the map that we run on.

So, this is just one of our published little video montages showing what the vehicle’s doing and we can lay out these video generations based on just what the teams need to build. So, this is one we built out. It’s mostly for us to show off. So, it’s fun to watch. You can see, the pink agents are pedestrians, the blue agents are vehicles, and orange is bicycles or scooters.

And, on the right you can see the planner. So, it shows, on top, the signals indicating what the vehicle is intending to do. If it’s going around double parked cars, and in the bottom is the same, but this is the top view. And, the red gate saying it knows it’s going to stop there. And then, the green lines are the prediction vectors coming out of every agent, so we know prediction, of where we think the pedestrians or other drivers are going to go. Okay.

And, one of the things I mentioned earlier was, when something happened, a disengagement, generally, the operator will have taken notes, disengagement would be recorded in a message. We will bring that around, gather that information, and then, save that in JIRA. And then, the teams can take a look at the JIRA tickets.

QA might look at these first or engineers might look at it right away, and decide when they want to go look at it. So, they have everything they need to know here. All the metadata they need to go out and grab the CHUM and reproduce the problem that occurred, and they can look at that.

We can categorize it and we can ticket it, we can play it back and see what happened in different views, and we can generate different looks. We have ways of laying out windows where we can watch what happens and look at metrics and timing of messages, and all sorts of diagnostic information that happened at the same time.

And, we can save those and snip those off and turn them into tests that we run over, and over, and over again to validate things. We can also take those log tests, so just tests based on real data and, partially automated, and manually craft tests that are more synthetic that we can then use to reproduce this problem in the same way, but then, add some additional fuzzing so you can create… Let’s say, what if that pedestrian walked a little faster, or a little slower, or went to the right a little bit more? So, we can then expand, multiply out, the cases that the particular tests can test for.

Yeah, and I think I already mentioned a lot of these, developers debugging, having all of the data available for people to just debug things on their desktop or run them in the job by writing jobs in our job queue. Machine learning training, of course, and labeling for identifying things.

And then, finally the last bit. I’ll cover this just a little bit, is analytics. For a long time we would just run scripts and pull the data out of CHUM and directly look at it with Python. We can do analytics that way and it works really well for ad hoc queries. If an engineer wants to look at a lot of data and just pull out little threads of information across lots of topics over a long period of time, they can do that using some nice Python analytics tools that we have direct access to the CHUM data, but we wanted to go more, a little more formal, a little, let’s leverage some of the tools that are out there for data analytics.

So, we now proactively ETL a bunch of the CHUM data that we’ve found to be interesting and put it into a columnar store. So, we’re using Redshift now. Could be whatever works in the future. But, given that we have this common set of data that’s very useful for lots of different work, we can get that into the columnar store on a regular basis, regular cadence on ingest, and then, use it to produce all sorts of different kinds of reports.

These are intentionally blurred out a bit, but the idea is, it can tell us how well we’re doing on a particular route in San Francisco. Let’s say we’re driving a certain route that’s challenging, and we want to say, “Well, how well are we doing on that this week? Each release over time, are we getting better?” Or, certain situations occur, jaywalkers or bicycles like to run stop signs. Where do we find that?

We can then track that and gather that information out of the raw data and make it available, and visualize it, and track that in maps, create heat maps where things are.

We can also use it to generate our DMV report that we have to do every year. And so, that’s basically it. Thank you.

 

Try JFrog for free!