Artifactory Cleanup Best Practices

Patrick Russell
2019-08-23 07:13

Artifactory Cleanup Best Practices 

Artifactory makes great use of Checksum Based Storage, but this mechanism cannot replace regular artifact cleanup duties. Software development can be messy, and a lot of the time many of the artifacts in Artifactory aren't ever used.

For example, many CI/CD builds are configured to run based on Source Control "commits", and these snapshot builds are never actually downloaded once they are sent to Artifactory. 

Given the dynamic nature of software development, most organizations have their own Data Retention policies. It is up to you to determine what data can be cleaned, but there are built in tools to cover the majority of cases.

In general, there are three kinds of techniques used to manage artifact storage in Artifactory:

– Limiting how many snapshots are kept
– Clearing oversized caches
– Deleting unused artifacts

Limiting how many snapshots are kept

Artifactory features a built-in mechanism to limit the "Snapshots" of a build. The purpose of this system is to ensure that "release" artifacts are promoted out of the "snapshots" repository before they are overwritten.

Artifactory supports the "Max Unique Snapshots" tag for six repository types:

– Maven         – NuGet
– Gradle         – Ivy
– Docker        – SBT

Artifactory tracks the number of snapshots using the Artifactory Layout system. This means that users need to follow a predefined pattern when they upload their snapshot artifacts (Most clients handle this automatically). 

For example, this Maven JAR file is recognized as a part of snapshot run number 3:

jfrog/hello/1.0.5-SNAPSHOT/hello-1.0.5-20190620.224837-3.jar

Most CLI clients upload using a specific pattern, and Artifactory's default Layouts should cover those cases. You can customize the layouts of these repository types to handle custom upload paths if needed.

To enable this in Artifactory, update the local repository settings:

When the setting is enabled, uploads above the Max Unique Snapshots number will delete any older releases during the next build run. 

The highest number will always be the latest release.

Clearing oversized caches

Artifactory's remote repositories store downloaded files in a cache. Usually, keeping the entire cache around is beneficial as it speeds up downloads. However, if the artifacts used by projects change, it might be worth clearing the cache periodically.

There are built in systems that support this in Artifactory. To enable automatic cache cleanups, go to the "Advanced" section of the remote repository menu. 

You can add the number of hours before an artifact is cleaned up in the "Unused Artifacts Cleanup Period" section:

This does not mean that artifacts will be deleted after 12 hours have passed. Instead, it marks an artifact as "unused" internally.

There is a separate job found under Admin -> Advanced -> Maintenance called "Cleanup Unused Cached Artifacts" which will perform the cleanup. By default, this cron job will run once every day.

Deleting unused artifacts

On its own, Artifactory normally does not delete binaries automatically. There are exceptions, such as the fields already discussed in this article.

That being said, a lot of storage space can be saved by deleting artifacts that have not been downloaded for a long period of time. The best way to automatically clean unused files is to implement an Artifactory User Plugin.

One of the most popular User Plugins that JFrog has developed is the "artifactCleanup" plugin. The plugin runs on a Cron Job, and automatically deletes any artifact that has not been downloaded for "X" number of days.

If you need to customize the plugin further, you can change the Artifactory Query Language statement in the code:

 def aql = "items.find({"repo":"" + repoKey + "","type": "any","@cleanup.skip":"true"}).include("repo", "path", "name", "type")"

One thing to note: artifactCleanup does not work on Docker Repositories. 

Docker image layers are stored as separate artifacts within an "image" folder. If a layer is already in most Docker clients, it won't get downloaded often. Because of this difference in behavior, there is a seperate "cleanDockerImages" plugin that is recommended instead.

It relies on the download count of the manifest.json file, which is always downloaded when a "docker pull" occurs.