Checksum-based Storage

Artifactory was built from the ground up for optimal management of binaries with the capability to support any package format that emerged in the software development domain. One of the key features enabling these characteristics is Checksum-Based Storage.

Checksum-based storage

Checksum-based storage uses two of the fundamental components that Artifactory uses to manage binaries: a filestore and a database. The filestore is where binaries are physically stored and Artifactory supports several storage solutions such as local filesystem, network file system, and cloud providers such as S3, Google Cloud Storage and Microsoft Azure. The database maps a file’s SHA1 checksum to its physical storage, and many operations on files within repositories are implemented as transactions in the database. Accessing binaries through a database using checksum-based storage optimizes many aspects of repository management.

Deduplication: Even if a file exists in many places within an organization’s repositories, it is actually physically stored only once. Multiple copies of a file are represented by corresponding references in the database to the single copy that is in the filestore. As a result the overall filestore size can be significantly reduced.

Instantaneous copy and move: Copy and move operations are virtually instantaneous since they do not really involve any activity in the filesystem, but rather just adding and removing references in the database.

Efficient uploads, downloads and replication: Before moving files from one location to another, Artifactory sends checksum headers. If the files already exist in the destination, they are not transferred even if they exist under a different path.

Filesystem performance: Using checksum-based storage removes the need to ever write-lock the filesystem.  Files are never overwritten, and are only truly deleted in the background during garbage collection processes that run when the filesystem is idle.

Fast and robust search: With a checksum-based storage, all repository information and artifact metadata are stored in optimized database-indexes. This means the data is always up-to-date, and searching through it is extremely fast.

Flexible layout: Using checksum-based storage, the database is a layer of indirection between the actual storage and the displayed layout meaning that any layout can be supported – whether one of the standard layouts like Maven1, Maven2, npm, NuGet, Gradle, Ivy etc. or any other custom layout that a user can specify.

The benefits of checksum-based storage are unquestionable. From vastly improved performance when accessing binaries through significant reduction in filestore usage volume to support for any packaging format that may emerge on the market, checksum-based storage is a significant factor in optimizing your CI/CD workflow.