How We Improved our Xray DB Sync Process [swampUP 2021]

June 27, 2021

< 1 min read



You may hear the idea of liquid software and as DevOps / Developers we would like to configure/develop everything super quickly. Learn More: https://jfrog.co/3gRFf8h

So, when something like a “simple” process of DB Sync comes up, we have to think outside the box and find interesting solutions to reduce the time it takes from 16 hours to 2 hours.

Speakers

Batel Zohar

Enterprise Solution Lead

Batel Zohar is a Developer Advocate for JFrog and has a background in DevOps support engineering, web development, and embedded software engineering. Prior to this, Batel served as an Enterprise Solutions Lead on a dedicated team that accompanies and assists large customers through the architectural implementation of the JFrog platform. She loves her dogs, plays guitar, and is a fan of Marvel’s movies.
Noam Shemesh

    Noam Shemesh

    CTO Office Architect @ JFrog

    Noam has over 15 years in tech, working in software architecture, development and leadership. Noam loves experiencing technologies from frontend to operating systems and optimizing everything from distributed multi-nodes to a single query.

    Video Transcript

    Hello, everyone, welcome to swamp up.
    Today we are going to talk about the excellent DB sync process improvements.
    We are going to explain what is x ray, what is the DB sync process in x ray and what we need to improve it.
    Together with me is Batel Zohar,
    she works with JFrog for four years now,
    she is a developer advocate and before that she worked in the support team.
    So it means that says she worked with the customers and with the developers,
    so she’s familiar with the customers environments and with the code itself.
    And Noam Shemesh is my amazing partner for today,
    he is also working for the last four years at JFrog.
    He is part of the CTO team
    and he has an amazing knowledge from the front end site to the ED side.
    And he also writing some code. So thank you. Noam,
    Thank you Batel, let’s start.
    So first of all, what is JFrog X-Ray?
    X-Ray is a tool for dev sec ops that provides you the information about the vulnerabilities,
    compliances and licenses inside your organization.
    The artifacts are stored in artifactory and they are part of the JFrog platform so we have the full solution
    that can secure our binaries and our artifacts.
    So how can we install JFrog X-Ray?
    JFrog X-Ray supports different installation types like RPM,
    Docker, Kubernetes and we may add more in the future.
    We also have two different versions
    for on-prem and SaaS solutions which is handling the DB sync process differently.
    So remember this, we’re going to talk about this and what are the differences in the next slide.
    So what is X-Ray DBC?
    To scan vulnerabilities
    and verify licenses compliance
    X-Ray uses a big database that contains many public components
    vulnerabilities and licenses.
    When installing the X-Ray for the firs time
    there is an initial DBC in the download and it stores all the public data that is going to X-Ray
    which is kind of
    a central knowledge for public information.
    And yes, today the compress site of the database is approximately 6-7 gigabytes on
    and our data analysts are working very very hard to make more and more vulnerabilities and components
    to make sure that we have much more data to find in case that we have any security vulnerability
    so it’s growing every day.
    Yes, and after we take the 6 gigabytes of the compressed data
    we need to open that
    to extract it and we find
    45 gigabytes of
    data that we need to insert to the database.
    And this data is
    coming into each one of the installations
    in 100 megabytes zip files.
    And each zip file
    has JSON files
    and the number is not constant and we’ll see in a moment why.
    Each JSON file
    has 2000 objects
    when objects are either components or vulnerabilities.
    And
    the size of the object is not fixed
    so that’s the reason that
    the JSON file’s size is different
    is not fixed as well, and zip files can be
    either with one JSON file
    or with 150 JSON files
    and we have both cases.
    And before we did any improvements, the minimum requirement
    it took 15 hours to process
    the full DB sync
    process
    and it’s a lot, it’s a lot of times
    and we couldn’t allow it to the customers.
    It’s a lot of time to process.
    So we tried to think about how to solve this problem and we had a few solutions, like the following:
    the first solution was to distribute the database that’s already there
    as database to [inaudible], instead of free creating it on the user end.
    Yes, but if we go this way, we might need to have different versions of the data sets
    which data set is going to match the prosperous version
    and we support a lot of versions because we don’t…
    we just tell the customers that they need to install
    for example, prosperous version 9.6 and further
    and so, for this reason, it means that for each version of prosperous, we need to have a different datacenter
    and it might be a lot
    because we don’t know what are the versions of the customers.
    Of course, we can limit the version to PR supporting but we
    didn’t want to go this way, at least not at the beginning.
    Yes, and it will also make the installation much bigger, right?
    It will weigh more and…
    Yes, the install will be much bigger because we don’t know what version it’s going to be and we need to
    maybe we need to determine the version after this and have the
    binaries ready for the customers to download
    Just like a DB sync, but we still need to add a lot of images.
    Alright, so maybe we can just use an online server to get information
    on components, vulnerabilities and everything we need for DB sync?
    Yes, this is a great solution. Actually’ this is what we do in the cloud.
    So you know this is a great solution.
    But
    for on-prem customers, it’s going to be tricky because
    we need to have very low latency to get the information on the public
    components and the public vulnerabilities
    and if we need to have this low latency, it means that we need to have servers everywhere that provide the same information
    who practically built SCDN in order to just provide this
    while there are other options so, it doesn’t mean that we didn’t
    we are not going to head this way, but
    we wanted to avoid this as long as possible
    and we could find something else, so
    we avoided this at the end.
    Alright, so maybe a [inaudible] database for components that are being used less
    then distribute the common components, kind of mixing.
    Yes, so exactly, this is kind of the
    middle ground because we can take the common
    components and put them locally and then
    have only, you know, a gigabyte of data or something like that
    and all the other more exotic components can
    be downloaded by request.
    So we could go this way.
    But if we are doing that
    it means that
    it’s going to be harder to support offline on-prem installations and
    this is something that we also support
    because we can’t tell the customer, just install X-Ray
    go with another command, download the files and then install your data into your database with a different command.
    And you can do that on two different servers
    so it means that you don’t have to be connected to the internet
    and we could say that
    okay, we are keeping the same process that we have right now and for this type of customer to have
    15 hours time
    to process the DB sync
    but we found a better way because we found a way also to solve the issue for the customers that are doing the offline sync
    and we’re going to talk about this now.
    Eventually, we went with the incremental and we will see that in how we improve the DB sync
    in this presentation
    so let’s talk about it.
    So we started our investigation
    in the first tool that’s called Pprof.
    Pprof is a CPU performer developed by Google to analyze multi-threaded applications.
    Pprof comes with the ability to generate visualization of a problems on a graph.
    The result showed that X-Ray was spending a lot of time on the database operation.
    And here is an example for Pprof graph and what it looks like.
    Yes and this is a very good example because this is something that we use a lot of times
    in order to find issues in the DB sync process.
    This is flame graph, this is something that
    probably you can find all other languages
    you just need to find the right tool
    and there are
    also there are other features for Pprof so
    make sure to check it out.
    And you can see here, for example, that the two ad user issue
    is doing three calls to the database.
    One, two, three.
    And because of that, we are not sure that that’s what we want because
    maybe we need to do three calls to the database but
    maybe we just want to add a user issue, why do we want to do so many calls? So
    we need to check this out.
    We don’t know if there is an issue there but we need to check this out.
    Also you can see on the right side that for the add licenses issue we have
    many more calls, we have
    8 or 9 calls for the same
    from another ad.
    And
    probably there is an issue, we wanted to check this out.
    And this is how we started, we saw there is something that’s doing a lot of work
    and after we found it, we went to the call and tried to think
    if all the calls are necessary.
    This is an example, of course, and
    one thing that we found is that we do a lot of
    single row inserts into a database.
    And while we could do many batch inserts to a database.
    So, for example, if we have components so we can insert one component but we can’t
    also consolidate a lot of components and insert them in one batch.
    In order to do that, we took
    we added a channel that gets
    a lot of components from
    a lot of places
    while they have a similar context
    so it means that it’s not really global context
    because
    we have a few channels that get the information and do it in parallel
    so for example, for one file we can create one channel and for another file we can create another channel.
    And we take this channel and drain it
    when
    half a second passed or 100 entries are
    ready to be inserted to the database
    and this worked very well because
    it made sure that the data was still working, because
    we do that pretty frequently.
    And also, pushed a lot of
    data at once, so we did many less round trips to the database.
    So this is the first improvement but we couldn’t do this improvement without
    without doing one thing that is very important, we
    had to separate the type of the DB sync.
    We started we had before
    before we started to improve the process
    is one DB sync, we didn’t have a separation between the daily DB sync and the initial DB sync so it means that
    we did a lot of improvements for the data DB sync because this is something that we do every day
    and we didn’t invest so much time in the initial DB sync because
    it wasn’t even separated, we didn’t
    we couldn’t do specific improvements, only for the initial DB sync so
    first thing was to separate them.
    And then we could do the first improvement
    which is separating the inserts
    and now, we can do another improvement
    which is optimistic insert
    where instead of just
    selecting and check if we need to install or update a component
    we can try to insert
    and if we have an issue, we are going to fall back to the update.
    But this is
    we can do that also for the daily DB sync but with the daily DB sync there is a much higher chance that we are going to update
    and it could take more time.
    So we wanted to do that for the
    because it’s going to take more time because we need to do insert and then select and then update instead of just select and update.
    But if we know that the database is empty, we can do the insert
    and be 99% or even more than that, sure that it’s going to succeed
    unless there is an error that is not related to the existing of the data.
    So this is the second improvement, the third improvement
    was to
    add more types of workers
    the workers are something that takes the data and processes it in parallel.
    We add entities workers in the database and
    entities workers that take the data and push
    into the database with a single insert
    like the first improvement.
    And we wanted to add more workers because
    when we saw the graph of the CPU, we saw that
    it’s kind of a sinus graph which
    working very hard at the beginning and then
    when it got to the end of the JSON file
    it stopped working because it’s only processing one entry.
    One component.
    And because it does only one entry
    it doesn’t use the CPU, it doesn’t utilize the CPU as we wanted it to utilize.
    So one improvement here is to add JSON file workers.
    And then we can take a lot of JSONs
    in parallel and add in parallel entities
    in these JSONs.
    But here we have another problem, if you remember we said at the beginning that
    a zip file can contain
    one JSON, or it can contain
    150 JSONs so
    it’s good that we have the JSON file workers but it’s not enough because
    the zip file workers can
    help us do many zips in parallel
    and here we had a different issue, because
    we didn’t know at the time of the processing
    if we can do the zip file workers in parallel, if we can run them in parallel
    because for daily DB sync, the zip files must be synchronized, they must
    we must insert the previous day of data before we insert the current day of data.
    We don’t want to lose data or something, right? We want to make sure that we have
    all the installations. -Yes.
    For example, if we have an entity from yesterday that
    has, for example, if I updated entity A
    and if I insert it, so entity A
    on the day before and today
    I’m going to
    do another change on entity A
    then
    what will happen if I’m going to process the current day before the previous day is that I’m going to insert new data
    today, and then
    process the previous data and update the new data
    so it means that I’m ending with
    all-data in the database even though I shouldn’t.
    So
    for the daily DB sync, we worked only with one worker on the zip file and
    for the initial DB sync we could work with more workers, with two or more workers.
    So this is a big change that we did and we could see when we did it
    we could see the CPU graph that is a straight line
    which means that the CPU is working all the time
    in order to process the DB sync
    and
    it works very well, so…
    Let’s see.
    So actually the next improvement is two improvements.
    We push them into one slide because
    this is post-processing of something but there is no relation between them
    so one post-processing is the post-processing of the DB sync itself
    when we did the post-processing of the DB sync, we wanted to create the indexes
    instead of creating the indexes during the install.
    And this is another thing that we could do, only because we did a separation of the daily
    DB sync and the initial DB sync.
    What I mean here is that, if you go to the process
    cheat-sheet on how to insert a lot of data into the database fast
    you will see that one of the first recommendations is to
    create the indexes at the end and not during the insert process because
    the building of the index is much faster when you finish the data insert.
    And another thing that is related
    is that we
    wanted to create the indexes in parallel
    this is a great improvement that we saw, but
    it’s supported only for process that is newer than, I think version 11.
    And after we did that,
    and we had to check what is the version of the database so it means now that we are
    doing different work for different versions, so
    we have other places that we do that but it means that
    the DB sync is going to be faster for
    newer version of prosperous.
    And the other type of the post-processing which is unrelated to the first one
    is the post-processing when we do an insert of a component or a vulnerability
    when we do that for the daily DB sync we are going to do a reverse lookup
    and see if we have a vulnerability that we…
    if we need to update something according to the new vulnerability or the updated vulnerability
    the same goes for the component.
    And if we need to do that, we need to run
    another thing but we need to check if we need to do
    and this check
    is something that is working with a lot of us, but we know that the database is empty
    so there is no reason to do this check.
    And this is another great improvement that we did only because we did the separation between the daily DB sync
    and the initial DB sync.
    Alright, so our graph looks like that.
    First of all, we have to migrate to prosperous because before that
    we worked with a different database, right? We have Ongor or something like that.
    Then we improve the queries
    to re-use [inaudible] objects and you can see that easily the graph just decreased a bit.
    Then we also added sub-component stables/inserts
    as the first improvement that we talked about at the beginning, so we are just using batch, inserting one by one.
    And it decreases to 10 hours.
    Which is good, but it’s not good enough.
    And then we go with the optimistic insert before update
    and we also added the components and vulnerability segregation and it took only 4 hours
    to get a full DB sync
    which is a huge improvement
    from 16 hours, almost 16 hours.
    And then we also had JSON and zip file workers,
    that’s working great and it helped us with the CPU like Noah mentioned.
    And we’re also dropping a few indexes before initial DB sync
    so now we are in only two hours to get a full DB sync
    so we added the post-processing of the pull out indexes
    and post-processing when using a component
    and you can see that it’s taking us only 56 minutes to get a full DB sync
    which is an amazing improvement from 15 hours.
    and it’s not only that, whenever we’re using a better [inaudible]
    and we are archiving 9K of AI ops and read and write faster
    so we’ll get better solutions and it takes us only 36 minutes
    to get a full DB sync which is an amazing improvement and you can see that it’s not really related to the software
    only for the [inaudible]
    and then we’ll get a better solution,
    super fast DB sync,,
    and I guess that we are all really happy, we have more secure environment
    and I think it’s a great time for questions, thank you very much guys.
    Thank you, thank you everyone for joining.
    You can find us
    on Twitter or, our emails here
    here and on LinkedIn, feel free to contact us, ask us any question, drop any comments.
    Yes, exactly, We are here for you guys, thank you very, very much for joining.
    Feel free to reach out.
    Thank you.