How Remote Repository Metadata works

Patrick Russell
2019-09-28 00:19

Offline backups of remote caches

Artifactory has a great set of Remote Repository types for pretty much any package manager. The system underlying this proxying mechanism is actually quite complex.

The main problem is how to handle upstream metadata. Normally, Artifactory's Remote Repository system relies on the client to make the correct request. This is how the Generic Remote Repository functions, the downstream client must make a correct GET request to download the remote package.

For other package types, the clients are more sophisticated and require special care. Each package manager type uses a metadata system to see which packages and dependencies to download.

Some examples include Maven's "maven-metadata.xml" file, PyPi's "simple" HTML page, and Yum's "repodata/*" files. Since the syntax for all of these many package managers are unique, Artifactory has to simplify the system.

Enter the Metadata Cache Retrieval Period.

Metadata Cache Retrieval Period

This setting can be found in the Artifactory Admin -> Repositories -> Remote -> <Remote-repo> -> Advanced menu. 

Its definition reads:

"This value refers to the number of seconds to cache metadata files before checking for newer versions on remote server. A value of 0 indicates no caching."

This is a setting that affects each repository type differently. For example, in Maven repositories it causes Artifactory to check for new "maven-metadata.xml" files every 600 seconds. In Pypi, the "simple.html" file is updated using the same timestamp.

This way, Artifactory does not need to read the metadata file. It just passes the "fresh" metadata to the client, which does the processing to make its requests. This looks like the following sequence, using "Pypi" as an example:

1. A user runs a "pip install" against Artifactory

    a. The pip client requests the "artifactory.com/artifactory/api/pypi/pypi-remote/simple" HTML page

    b. Artifactory's Metadata Cache Retrieval Period has expired, so it serves an updated HTML page. There's a slight delay as it has to re-download the file. If the MCRP has not expired, the cached file is sent instead.

2. Based on this metadata, the pip client requests the package binaries

    a. Artifactory caches the file and returns it to the client

However, this can present a problem. For example, consider the possibility of a developer uploading a package to a remote site. They try to download it through Artifactory right after the upload. In that situation, it wouldn't show up in Artifactory's copy of the metadata and they'd see a 404 Not Found error.

The solution is built into the Artifactory UI. The "Zap cache" function clears the Metadata Cache Retrieval Period and the updated metadata is always sent:

This setting can be tuned to improve download performance. 

Users concerned about accurate, up-to-date metadata should lower the setting, possibly to "0" to disable it. This slows down most pulls (Artifactory has to serve fresh metadata every time), but ensures the most accurate data is always used.

Users concerned about performance should increase this setting. Doing so means Artifactory uses its cache more often than not, at the cost of accuracy. 

Deleted packages and Artifactory's Cache

The above section covers remote uploads, but what if a remote repository deletes a package? Say the binary was cached in Artifactory beforehand, shouldn't it still be available for download?

The answer is actually "No". Because Artifactory is trying to mirror the remote site, its metadata is served directly from the upstream source. Since that metadata says the package was deleted, the client returns a 404 Not Found error.

However, the package was not truly lost! 

To "Undelete" the package, it can be copied to a local Artifactory repository. There, Artifactory's metadata calculation jobs can make a correct listing for the package manager.

Since it is tedious to have to go to the UI and manually copy packages to start using them again, there is a better solution.

Using an Artifactory User Plugin and a Virtual Repository, you can ensure that even deleted files are preserved.

First, download and install the remoteBackup User Plugin from the official JFrog GitHub. It needs to have a "json" configuration file present to function as intended. 

Second, create a local Artifactory repository of the same package type as the remote, and configure the plugin to copy the remote cache to the local.

Finally, create a virtual repository and have remote clients use its URL instead.

With this system, any file that is deleted on the remote site is copied to the local repository. Its metadata will be seamlessly merged with the remote's metadata.

There are some drawbacks to this approach. Usually packages are deleted for a good reason, and there may be some risk in continuing to use a deprecated package. Package overwrites will also never appear, as the old local files will be served first. 

Artifactory's remote repository system is surprisingly complex, and these solutions should help in optimizing performance and stability of that system.