ARTIFACTORY: The complete Garbage Collection Guide

Nir Shervi
2023-01-22 11:07

Introduction

This guide includes a description of how the JFrog Artifactory Garbage collection job works as well as tuning advice and frequently asked questions.

Key terms:

1. Binaries:
Binaries Size equals the amount of physical storage occupied by the binaries/files in the filestore. Each binary is only stored once due to Artifactory’s checksum-based storage, even if it has multiple copies(artifacts).
2. Artifacts:
Artifacts Size equals the amount of storage that would be occupied if every artifact was saved as a new file in the filestore, without checksum-based storage.

Below is an example of the same artifact (left side) that was deployed to Artifactory twice, each time to a different repository. On the righthand side, we see that the binary was saved only once in the filestore.
User-added image
 

Artifactory Garbage Collector

The Garbage Collector's job is to clear out binary files from storage that don't have matching artifacts and free up disk space.

User-added imageDiagram of the Garbage collection process

Artifactory Garbage Colletor has two strategies: Small Garbage Collection (from Artifactory version 6.12.0) and Full Garbage Collection (since day one). In both strategies, Artifactory uses a database query(ies) to determine which binaries should be removed from the filestore by comparing the artifacts table and the binaries table.

Small Garbage Collection

This task runs on every GC execution and involves searching the trash can for artifacts whose Retention Period has expired.

A binary will be removed from storage as well as its references from nodes and binaries database tables if there are no more copies of the corresponding artifact exists.

If there is still even a single copy of this artifact in another path or repository, only the reference from the nodes table will be removed. Artifactory will keep the binary and its corresponding entry in the binaries table.

Full Garbage Collection

This task will be initiated every 20th Garbage Collection run (configurable, default: 20) and includes Small Garbage Collection task, clean up of the Archive Indexes and it 
Optimizes System Storage.

Note: The Full Garbage Collection job may consume a lot of system resources of both Artifactory and its Database.

The Full Garbage Collection uses a batch cleanup mechanism since Artifactory version 7.29.8 to improve performance.. The batch size and number of sub-iterations is configurable, see the tuning section of this guide for more information.
 

Good to Know:

  1. The Small Garbage Collector won't clean up binaries if artifacts are manually deleted in bulk from the Trash Can. You will have to wait until the 20th iteration of the GC when the Full GC will be triggered.
  2. For a binary to qualify for Garbage Collection, it must have a reference in the database's binaries table. Any files that weren't deployed to your storage via Artifactory won't be removed by the Garbage Collector. For such cases, you may use the Prune Unreferenced Data feature.
  3. The Small Garbage Collector won’t work if the Trash Can is disabled in Artifactory (Administration panel  → Artifactory → Settings → “Enable Trash Can”).

 

How to trigger the Garbage Collection

Rest API

Garbage Collection can be triggered by running the Rest API call. To trigger the Full Garbage collection, execute the Rest API call for 20 times.

Example:curl -uusername:password -XPOST "http://<ARTIFACTORY-URL>/artifactory/api/system/storage/gc" 

UI

Log in as an Artifactory user with administrative permissions. Navigate to the Administration panel → Artifactory → Maintenance → Click on “Run Now”, as shown below.
 

User-added image
Verification and Monitoring 

In order to verify that the Small Garbage Collection job was executed, search for the following output in the artifactory-service.log or console.log logs:2022-11-07T17:09:51.474Z [jfrt ] [INFO ] [38dc43ddf24cdacc] [.s.d.b.s.BinaryServiceImpl:728] [24cdacc|art-exec-138] - Triggering Garbage Collection
2022-11-07T17:09:51.475Z [jfrt ] [INFO ] [38dc43ddf24cdacc] [.s.d.b.s.g.GarbageCollector:66] [24cdacc|art-exec-138] - Starting GC strategy 'TRASH_AND_BINARIES'
2022-11-07T17:09:51.476Z [jfrt ] [INFO ] [38dc43ddf24cdacc] [.s.d.b.s.g.GarbageCollector:68] [24cdacc|art-exec-138] - Finished GC Strategy 'TRASH_AND_BINARIES'

In order to verify that the Full Garbage Collection job was executed, search for the following output in the artifactory-service.log or console.log logs:2021-06-03T19:00:52.167Z [jfrt ] [INFO ] [2b5d4bc1dd3e2430] [.s.b.s.GarbageCollectorInfo:96] [art-exec-2270397    ] - Storage garbage collector report:
Number of binaries:      470,507
Total execution time:    49.93 secs
Candidates for deletion: 124
Checksums deleted:       123
Binaries deleted:        123
Total size freed:        15.80 GB
Current total size:      18.74 TB

To keep track of Artifacts and Binaries sizes after Garbage Collection execution, navigate to the Administration panel → Monitoring  →  Storage Status page in the Artifactory UI as a user with admin privileges or use the REST API call.
 

FAQ

Why the Binaries' Size is greater than the Artifacts' Size?

When an artifact is deleted, its database reference is deleted immediately, but the binary stays in the filestore until the next GC run.

Binaries Size greater than Artifacts Size indicates that the GC might not work properly or not running at all.

When the Artifacts Size is greater than or equal to the Binaries Size, the GC operates as expected.

Below is an example of Artifactory Storage Status that indicates that the GC doesn’t work properly or fast enough (Binaries Size greater than Artifacts Size):

User-added image
 

How many binaries are eligible for Full Garbage Collection, and how much space should be freed?

Run the following SQL query on the Artifactory database:SELECT count(b.sha1),  
  Sum (b.bin_length) as binaries_size_in_bytes 
FROM binaries b
WHERE NOT EXISTS 
(SELECT n.node_id 
 FROM nodes n 
 WHERE n.sha1_actual = b.sha1);
 

How to tune the Garbage Collection?

Each of the system properties listed below can be configured in the $JFROG_HOME/artifactory/var/etc/artifactory/artifactory.system.properties file.

Be sure to restart Artifactory for the changes to take effect.

 

1. Scheduling the Garbage Collection

To schedule the Garbage Collection using a CRON expression, navigate to the Administration module → Artifactory → Maintenance. By default, GC runs every four hours resulting in six iterations per day.
User-added image

2. Tuning the number of worker threads

Binaries cleanup is multithreaded and can be configured as below:

artifactory.gc.numberOfWorkersThreads=3Note: When using Microsoft SQL Server, Garbage Collection is single-threaded regardless of the system property above.

 

3. Number of Small Garbage Collection runs (Artifactory 6.12.0 and above)

The following property can be used to configure the number of Small Garbage Collection runs that occur between Full GC runs:

artifactory.gc.skipFullGcBetweenMinorIterations=20 

4. Disable sorted deletion of binaries (Artifactory 7.31.10 and above)

By default, the Full GC will delete binaries in descending order (largest to smallest sizes) and sorting happens in the database, which can affect the DB performance. To disable the sorting set the below property to true.

artifactory.gc.skipOrderByFullGc=false 

 

5. Configure the Full GC batch size and the number of iterations (Artifactory 7.29.8 and above) 

By default, the Full Garbage Collection batch size is set to 10,000.

artifactory.gc.binariesToDeleteBatchSize=10000The property below controls the number of the Full Garbage Collection sub-iterations.artifactory.gc.binariesToDeleteIterationAmount=20Even if there are additional binaries to clean up, the Garbage Collection will stop once the above-mentioned value is reached, and the following message will be displayed to indicate this.2022-08-01 00:38:27,233Z [jfrt ] [WARN ] [a83991fe168767bb] [.s.d.b.s.BinaryServiceImpl:681] [art-exec-1072408 ] - 
The GC is stopping due to maximum iterations reached