Virtual Repository Sizing Best Practices

Patrick Russell
2019-09-05 22:03

Virtual Repository Best Practices

    The sad story of "/repo" 

A common desired end-state of Artifactory is to just use one URL for all artifact downloading needs. That way, it doesn't matter what the client type is (Gradle, Maven, P2, etc), they all pull the same binary files and builds are simple.

Also, it means no more managing the contents of many Virtual Repositories, everybody uses one virtual that has "everything".

While this use case sounds easy to implement in theory, in practice there are many problems with using a huge monolithic Virtual Repository. In the past, Artifactory used to support this setup. But no longer.

A Virtual Repository should not aggregate more than 20 repositories unless you are prepared to encounter serious problems.

/repo – The Global Virtual

In Artifactory 2 and Artifactory 3, there was a Global Virtual Repository called "/repo". It aggregated all local and remote repositories found within the Artifactory application.

Things started off well enough, as Artifactory was initially a small and specialized application. Users would point their builds to and know they would be able to pull any Java-based package in Artifactory.

However, problems soon began to manifest.

Builds would pull the wrong artifact from the wrong repository. The Maven Snapshot version would jump wildly depending on which path was resolved first. Downloads would take forever as the repository had to aggregate so many paths.

The repository was permanently disabled in Artifactory 4.7.3 because of problems with Artifactory's developing usage philosophy. What lessons can be learned from this incident?

The Problem with Large Virtuals

Virtual Repositories aggregate other repositories. This is supposed to scale up smoothly, but problems can present themselves when going above 20 aggregated repositories (The scale of included repositories affects this number). 

Artifactory has to perform a calculation event in Virtual Repositories every time it has to serve artifacts. This is to ensure responses are from the right repo. Virtuals use this resolution algorithm:

1. Serve any files found in local repositories first
2. Serve any files found in remote repository caches second
3. Download and serve remote artifacts last

To determine which repository has the artifact, Artifactory always has to check each and every repository in the virtual. The "?trace" API call can be used to view this logic in action. 

The worst-case scenario from this algorithm description should be clear: What if the artifact is in the last remote repository in the Virtual list?

It takes far longer to download verses pulling directly from the remote repository. There are other, more subtle problems that happen when managing a huge Virtual.

UI Performance Problems

Because the Artifacts browser has to show everything within a folder in Artifactory, virtual repositories need to perform a nested check to determine what's displayed.

[Example of a 2-second loading time with just a JCenter remote repository]

This means checking each remote endpoint as well as every local folder the Virtual aggregates. 

Expect clicks to take upwards of 30 seconds to complete on a large Virtual Repository folder from the UI. The virtual has to complete many HEAD requests against multiple remote endpoints in most cases.

Aggregation Issues

Virtual repositories for different repository types will have different logic handlers. For example, if two Maven Snapshot builds share the same path, the virtual will serve an "incorrect" file based on the resolution order:

Swapping the repositories might fix this particular pathing issue, but it could break other builds using the same large Virtual.

Alternatives to large virtuals

The good news is that there is an easy solution to this problem: Use more specific, smaller virtual repositories.

There are fewer limits on the number of virtual repositories that can be created in Artifactory. Each development team can craft their own virtual to suit their build or project's needs. It delegates the problem to the party most interested in maintaining the solution.