What kind of storage solution does Artifactory implements?

Adi Vizgan
2019-04-23 11:47

Subject

Artifactory implements a checksum based storage in order to store artifacts in a resource effective way.

Description

When a file that is deployed to Artifactory, its SHA1 checksum is being calculated and then renamed to its checksum. Afterwards, it's located in the configured filestore in a directory structure named by the first two characters of the checksum. For example, a file whose checksum is "ac3f5e56…" would be stored in directory and will start with "ac"; a file whose checksum is "dfe12a4b…" would be stored as "df", etc. The example below shows the "d4" directory which contains two artifacts whose checksum starts with "d4":
User-added image

In parallel, Artifactory creates a Data Base entry mapping the file's checksum to the path it was uploaded to in a repository.
This way of storing binaries, optimizes many operations in Artifactory since they are implemented through a simple DB transactions rather than actually manipulating files.

Since the checksum is a unique parameter for each binary and file, this way of storing and managing files, prevents a potential duplication and corruption of files.

Garbage Collection

Artifactory uses a checksum-based storage to ensure that each binary file is only stored once.

When a new file is deployed, Artifactory checks if a binary with the same checksum already exists and if so, links the repository path to this binary. Upon deletion of a repository path, Artifactory does not delete the binary since it may be used by other paths. However, once all paths pointing to a binary are deleted, the file is actually no longer being used. To make sure the system does not become clogged with unused binaries, Artifactory periodically runs a "Garbage Collection" to identify unused ("deleted") binaries and dispose of them from the filestore. By default, this is set to run every 4 hours and is controlled by a cron expression. For SaaS customers the Garbage Collection runs every 12 hours by default.