The Artifactory Storage Solution Explained

Adi Vizgan
2021-09-13 11:30

To store artifacts in a resource efficient manner, Artifactory executes checksum-based storage.

How It Works

When a file is deployed to Artifactory, its SHA-1 checksum is calculated as one of the first actions. Then the file is saved to server backend using the checksum as the file name.

Specifically, the file is saved into the Artifactory filestore location, using a directory structure where names are derived from the first two characters of any given checksum. 

For example, a file whose checksum is ac3f5e56… will be stored in the directory as ac, a file whose checksum is dfe12a4b… will be stored as df, and so forth. The example below shows the d4 directory, which contains two artifacts whose checksums start with d4:

User-added image

In parallel, Artifactory creates a database entry that maps a file's checksum to the original filename, along with its path in the upload repository. This manner of binary storage optimizes many Artifactory operations, which are implemented through simple database transactions rather than the actual manipulation of files. And as the checksum is a unique string for each binary file, this method of file storage and management prevents file duplications and corruption.

Garbage Collection

Artifactory uses checksum-based storage to ensure that each binary file is only stored once. When a new file is deployed, Artifactory checks if a binary with the same checksum already exists and, if so, links the repository path to this binary. Upon deletion of a repository path, Artifactory does not delete the binary since it may be used by other paths. However, once all paths pointing to a binary are deleted, the file is actually no longer being used. To make sure the system does not become clogged with unused binaries, Artifactory periodically runs a Garbage Collection function to identify unused (i.e., deleted) binaries and dispose of them from the filestore. By default, this is set to run every four (4) hours and is controlled by a cron expression. For SaaS customers, by default the Garbage Collection runs every twelve (12) hours.