How we improved our Xray DB sync process

Batel Zohar | Developer Advocate and Noam Shemesh | Software Architect

You may hear the idea of liquid software and as DevOps and Developers we would like to configure and develop everything super quickly.

So, when something like the “simple” process of Xray’s DB Sync comes up, we needed to think outside the box and find interesting and innovative solutions to reduce the time it takes from 16 hours down to 2 hours.

Video Transcript

Hello, everyone, welcome to swamp up. Today we are going to talk about the excellent DB sync process improvements. We are going to explain what is x ray, what is the DB sync process in x ray and what we need to improve it. Together with me is Batel Zohar, she works with JFrog for four years now, she is a developer advocate and before that she worked in the support team. So it means that says she worked with the customers and with the developers, so she’s familiar with the customers environments and with the code itself. And Noam Shemesh is my amazing partner for today, he is also working for the last four years at JFrog.

 He is part of the CTO team and he has an amazing knowledge from the front end site to the ED side. And he also writing some code. So thank you. Noam, Thank you Batel, let’s start. So first of all, what is JFrog XRay?

 XRay is a tool for dev sec ops that provides you the information about the vulnerabilities, compliances and licenses inside your organization. The artifacts are stored in Artifactory and they are part of the JFrog platform so we have the full solution that can secure our binaries and our artifacts. So how can we install JFrog XRay? JFrog XRay supports different installation types like RPM, Docker, Kubernetes and we may add more in the future. We also have two different versions for on-prem and SaaS solutions which is handling the DB sync process differently. So remember this, we’re going to talk about this and what are the differences in the next slide. So what is XRay DBC?

 To scan vulnerabilities and verify licenses compliance XRay uses a big database that contains many public components vulnerabilities and licenses. When installing the XRay for the firs time there is an initial DBC in the download and it stores all the public data that is going to XRay which is kind of a central knowledge for public information. And yes, today the compress site of the database is approximately 6-7 gigabytes on and our data analysts are working very very hard to make more and more vulnerabilities and components to make sure that we have much more data to find in case that we have any security vulnerability so it’s growing every day.

 Yes, and after we take the 6 gigabytes of the compressed data we need to open that to extract it and we find 45 gigabytes of data that we need to insert to the database. And this data is coming into each one of the installations in 100 megabytes zip files. And each zip file has JSON files and the number is not constant and we’ll see in a moment why.

 Each JSON file has 2000 objects when objects are either components or vulnerabilities. And the size of the object is not fixed so that’s the reason that the JSON file’s size is different is not fixed as well, and zip files can be either with one JSON file or with 150 JSON files and we have both cases. And before we did any improvements, the minimum requirement it took 15 hours to process the full DB sync process and it’s a lot, it’s a lot of times and we couldn’t allow it to the customers. It’s a lot of time to process. So we tried to think about how to solve this problem and we had a few solutions, like the following: the first solution was to distribute the database that’s already there as database to [inaudible], instead of free creating it on the user end.

 Yes, but if we go this way, we might need to have different versions of the data sets which data set is going to match the prosperous version and we support a lot of versions because we don’t… we just tell the customers that they need to install for example, prosperous version 9.6 and further and so, for this reason, it means that for each version of prosperous, we need to have a different datacenter and it might be a lot because we don’t know what are the versions of the customers.

 Of course, we can limit the version to PR supporting but we didn’t want to go this way, at least not at the beginning. Yes, and it will also make the installation much bigger, right? It will weigh more and…

 Yes, the install will be much bigger because we don’t know what version it’s going to be and we need to maybe we need to determine the version after this and have the binaries ready for the customers to download Just like a DB sync, but we still need to add a lot of images.

 Alright, so maybe we can just use an online server to get information on components, vulnerabilities and everything we need for DB sync? Yes, this is a great solution. Actually’ this is what we do in the cloud. So you know this is a great solution. But for on-prem customers, it’s going to be tricky because we need to have very low latency to get the information on the public components and the public vulnerabilities and if we need to have this low latency, it means that we need to have servers everywhere that provide the same information who practically built SCDN in order to just provide this while there are other options so, it doesn’t mean that we didn’t we are not going to head this way, but we wanted to avoid this as long as possible and we could find something else, so we avoided this at the end.

 Alright, so maybe a [inaudible] database for components that are being used less than distribute the common components, kind of mixing. Yes, so exactly, this is kind of the middle ground because we can take the common components and put them locally and then have only, you know, a gigabyte of data or something like that and all the other more exotic components can be downloaded by request. So we could go this way. But if we are doing that it means that it’s going to be harder to support offline on-prem installations and this is something that we also support because we can’t tell the customer, just install XRay go with another command, download the files and then install your data into your database with a different command. And you can do that on two different servers so it means that you don’t have to be connected to the internet and we could say that okay, we are keeping the same process that we have right now and for this type of customer to have 15 hours time to process the DB sync but we found a better way because we found a way also to solve the issue for the customers that are doing the offline sync and we’re going to talk about this now.

 Eventually, we went with the incremental and we will see that in how we improve the DB sync in this presentation so let’s talk about it. So we started our investigation in the first tool that’s called Pprof. Pprof is a CPU performer developed by Google to analyze multi-threaded applications. Pprof comes with the ability to generate visualization of a problems on a graph.

 The result showed that X-Ray was spending a lot of time on the database operation. And here is an example for Pprof graph and what it looks like. Yes and this is a very good example because this is something that we use a lot of times in order to find issues in the DB sync process.

 This is flame graph, this is something that probably you can find all other languages you just need to find the right tool and there are also there are other features for Pprof so make sure to check it out. And you can see here, for example, that the two ad user issue is doing three calls to the database.

 One, two, three. And because of that, we are not sure that that’s what we want because maybe we need to do three calls to the database but maybe we just want to add a user issue, why do we want to do so many calls? So we need to check this out.

 We don’t know if there is an issue there but we need to check this out. Also you can see on the right side that for the add licenses issue we have many more calls, we have 8 or 9 calls for the same from another ad. And probably there is an issue, we wanted to check this out. And this is how we started, we saw there is something that’s doing a lot of work and after we found it, we went to the call and tried to think if all the calls are necessary.

 This is an example, of course, and one thing that we found is that we do a lot of single row inserts into a database. And while we could do many batch inserts to a database. So, for example, if we have components so we can insert one component but we can’t also consolidate a lot of components and insert them in one batch.

 In order to do that, we took we added a channel that gets a lot of components from a lot of places while they have a similar context so it means that it’s not really global context because we have a few channels that get the information and do it in parallel so for example, for one file we can create one channel and for another file we can create another channel. And we take this channel and drain it when half a second passed or 100 entries are ready to be inserted to the database and this worked very well because it made sure that the data was still working, because we do that pretty frequently. And also, pushed a lot of data at once, so we did many less round trips to the database. So this is the first improvement but we couldn’t do this improvement without without doing one thing that is very important, we had to separate the type of the DB sync.

 We started we had before before we started to improve the process is one DB sync, we didn’t have a separation between the daily DB sync and the initial DB sync so it means that we did a lot of improvements for the data DB sync because this is something that we do every day and we didn’t invest so much time in the initial DB sync because it wasn’t even separated, we didn’t we couldn’t do specific improvements, only for the initial DB sync so first thing was to separate them. And then we could do the first improvement which is separating the inserts and now, we can do another improvement which is optimistic insert where instead of just selecting and check if we need to install or update a component we can try to insert and if we have an issue, we are going to fall back to the update.

 But this is we can do that also for the daily DB sync but with the daily DB sync there is a much higher chance that we are going to update and it could take more time. So we wanted to do that for the because it’s going to take more time because we need to do insert and then select and then update instead of just select and update.

 But if we know that the database is empty, we can do the insert and be 99% or even more than that, sure that it’s going to succeed unless there is an error that is not related to the existing of the data. So this is the second improvement, the third improvement was to add more types of workers the workers are something that takes the data and processes it in parallel.

 We add entities workers in the database and entities workers that take the data and push into the database with a single insert like the first improvement. And we wanted to add more workers because when we saw the graph of the CPU, we saw that it’s kind of a sinus graph which working very hard at the beginning and then when it got to the end of the JSON file it stopped working because it’s only processing one entry.

 One component. And because it does only one entry it doesn’t use the CPU, it doesn’t utilize the CPU as we wanted it to utilize. So one improvement here is to add JSON file workers. And then we can take a lot of JSONs in parallel and add in parallel entities in these JSONs. But here we have another problem, if you remember we said at the beginning that a zip file can contain one JSON, or it can contain 150 JSONs so it’s good that we have the JSON file workers but it’s not enough because the zip file workers can help us do many zips in parallel and here we had a different issue, because we didn’t know at the time of the processing if we can do the zip file workers in parallel, if we can run them in parallel because for daily DB sync, the zip files must be synchronized, they must we must insert the previous day of data before we insert the current day of data.

 We don’t want to lose data or something, right? We want to make sure that we have all the installations. -Yes.

 For example, if we have an entity from yesterday that has, for example, if I updated entity A and if I insert it, so entity A on the day before and today I’m going to do another change on entity A then what will happen if I’m going to process the current day before the previous day is that I’m going to insert new data today, and then process the previous data and update the new data so it means that I’m ending with all-data in the database even though I shouldn’t. So For the daily DB sync, we worked only with one worker on the zip file and for the initial DB sync we could work with more workers, with two or more workers. So this is a big change that we did and we could see when we did it we could see the CPU graph that is a straight line which means that the CPU is working all the time in order to process the DB sync and it works very well, so…

 Let’s see. So actually the next improvement is two improvements.

 We push them into one slide because this is post-processing of something but there is no relation between them so one post-processing is the post-processing of the DB sync itself when we did the post-processing of the DB sync, we wanted to create the indexes instead of creating the indexes during the install. And this is another thing that we could do, only because we did a separation of the daily DB sync and the initial DB sync.

 What I mean here is that, if you go to the process cheat-sheet on how to insert a lot of data into the database fast you will see that one of the first recommendations is to create the indexes at the end and not during the insert process because the building of the index is much faster when you finish the data insert. And another thing that is related is that we wanted to create the indexes in parallel this is a great improvement that we saw, but it’s supported only for process that is newer than, I think version 11. And after we did that, and we had to check what is the version of the database so it means now that we are doing different work for different versions, so we have other places that we do that but it means that the DB sync is going to be faster for newer version of prosperous. And the other type of the post-processing which is unrelated to the first one is the post-processing when we do an insert of a component or a vulnerability when we do that for the daily DB sync we are going to do a reverse lookup and see if we have a vulnerability that we… if we need to update something according to the new vulnerability or the updated vulnerability the same goes for the component. And if we need to do that, we need to run another thing but we need to check if we need to do and this check is something that is working with a lot of us, but we know that the database is empty so there is no reason to do this check. And this is another great improvement that we did only because we did the separation between the daily DB sync and the initial DB sync.

 Alright, so our graph looks like that. First of all, we have to migrate to prosperous because before that we worked with a different database, right? We have Ongor or something like that. Then we improve the queries to re-use [inaudible] objects and you can see that easily the graph just decreased a bit. Then we also added sub-component stables/inserts as the first improvement that we talked about at the beginning, so we are just using batch, inserting one by one. And it decreases to 10 hours. Which is good, but it’s not good enough. And then we go with the optimistic insert before update and we also added the components and vulnerability segregation and it took only 4 hours to get a full DB sync which is a huge improvement from 16 hours, almost 16 hours. And then we also had JSON and zip file workers, that’s working great and it helped us with the CPU like Noah mentioned. And we’re also dropping a few indexes before initial DB sync so now we are in only two hours to get a full DB sync so we added the post-processing of the pull out indexes and post-processing when using a component and you can see that it’s taking us only 56 minutes to get a full DB sync which is an amazing improvement from 15 hours. And it’s not only that, whenever we’re using a better [inaudible] and we are archiving 9K of AI ops and read and write faster so we’ll get better solutions and it takes us only 36 minutes to get a full DB sync which is an amazing improvement and you can see that it’s not really related to the software only for the [inaudible] and then we’ll get a better solution, super fast DB sync, , and I guess that we are all really happy, we have more secure environment and I think it’s a great time for questions, thank you very much guys.

 Thank you, thank you everyone for joining.

 You can find us on Twitter or, our emails here here and on LinkedIn, feel free to contact us, ask us any question, drop any comments.

 Yes, exactly, We are here for you guys, thank you very, very much for joining.

 Feel free to reach out.

 Thank you.

 

Release Fast Or Die