Why the storage info REST API account duplicate docker layers and how we can avoid it?

Batel Tova
2020-12-29 15:20

As mentioned in this JIRA ticket we reported the storage info REST API behavior in the past, It's worth mentioning that the size reported on this part of the response does count duplicate layers, and it is intended to be this way.
This call uses the same information that is displayed in the Storage page of the Artifactory UI:
User-added image

As you may already know there are two different sizes: binaries size (orange) and artifacts size (green). The binary size is the size of each file being counted once, and it's the value you want for the repository size, which is represented with the blue line. As you can see on that table, all of the repositories add up to the total artifacts size, which is why the call counts duplicates in the response. 

The reason we don't also provide a binaries size per repo is because that technically doesn't exist. You could delete an entire repository that contains 1TB of artifacts, and if you have enough references in different repos you might not free a single megabyte of space. However, we understand that calculating a deduplicated-size of the artifacts can be useful and we can make that happen using the AQL.

For example, this Python3 script will get you the size of each repo if each checksum is counted once and print out the results:def audit():
import requests # necessary library for API requests
from collections import defaultdict # nice data structure for counting
base_url = 'http://localhost:8081/artifactory/' # your artifactory instance

headers = {'content-type': 'text/plain',} # Headers for query
data = 'items.find({"name":{"$match":"*"}}).include("actual_sha1", "repo", "size")' # Query to find all artifacts

myResp = requests.post(base_url+'api/search/aql', auth=('admin', 'password'), headers=headers, data=data) # Execute the query
myResp = eval(myResp.text)

total = defaultdict(int)
repos = {}

for item in myResp["results"]:
repos[item["repo"]][item["actual_sha1"]] = item["size"]
repos[item["repo"]] = {}
repos[item["repo"]][item["actual_sha1"]] = item["size"]
total[item["repo"]] += int(item["size"])

for repo, artifacts in repos.items():
print("Storage per Artifact for Repo {}".format(repo))
for artifact, size in artifacts.items():
print("[{}] -- Checksum:{} -- Size:{}".format(repo, artifact, size))


for repo, total in total.items():
print("Repo {} uses a total of {} byes.".format(repo, total))

if __name__ == '__main__':

Here is an example for output:

Storage per Artifact for Repo test-generic
[test-generic] -- Checksum:3ae3f83349b04656faa27ae59b2287c06bdc428b -- Size:423232
[test-generic] -- Checksum:591d8d38b865ab1ef4218120779f78ec950d97b0 -- Size:3045251
[test-generic] -- Checksum:a113a4b034a150990514c3f0c6f1c0f2b72384a5 -- Size:1410580
Storage per Artifact for Repo nuget-local
[nuget-local] -- Checksum:3b71f43ff30f4b15b5cd85dd9e95ebc7e84eb5a3 -- Size:1048576
[...more data...]
Storage per Artifact for Repo debian-remote-cache
[debian-remote-cache] -- Checksum:a113a4b034a150990514c3f0c6f1c0f2b72384a5 -- Size:1410580
Storage per Artifact for Repo jcenter-cache
[jcenter-cache] -- Checksum:b899da20a0f408d00cfe32a268458fb401d5d698 -- Size:1546674
Storage per Artifact for Repo test-generic-2
[test-generic-2] -- Checksum:e6bbc45386305b92f08f894deb1b47c66bd3d815 -- Size:788
Repo npm-remote-cache uses a total of 2699 byes.
Repo docker-remote-cache uses a total of 553 byes.
Repo pypi-local uses a total of 163 byes.
Repo libs-release-local uses a total of 34 byes.
Repo conan-local uses a total of 791405 byes.
Repo pypi-remote-cache uses a total of 17592 byes.
Repo debian-local uses a total of 713 byes.
Repo test-generic uses a total of 423232 byes.
Repo nuget-local uses a total of 1048576 byes.
Repo test-cache uses a total of 5836 byes.
Repo debian-remote-cache uses a total of 1410580 byes.
Repo jcenter-cache uses a total of 1546674 byes.
Repo test-generic-2 uses a total of 788 byes.