Why Large Repository Push Replications Can Fail

Yehuda Hadad
2023-01-22 11:07

Note: Although this article is addressing issues related to push replications, the same troubleshooting advice can be applied for pull replications, where your target server will be the one from which data is pulled and your source server will be the one that's pulling the artifacts.

Before you can start to debug an issue like this, you need to understand how the replication mechanism works. When a repository replication is triggered, either by cron or manually, Artifactory will first query your destination repository for the full file list. This is an XML file that includes your entire repository content. Using this information, Artifactory then determines which files should be replicated and which should be deleted (the latter, assuming you've enabled the sync delete option).

A failure can start if your target repository contains a lot of files (i.e., a huge file list) or if your source repository is missing a lot of files that need to be replicated. The reason is since the target Artifactory return the file list in stream and Artifactory checks each entry on the fly, in order to check if the file is missing and in a need to be replicated or deleted.

With this understanding of how the replication process functions and what could trigger a failure, you're now in a better position to investigate the actual cause of the failure you're confronting. We'll suggest two scenarios at which you might be looking:

Note: In addition to the parameters mentioned below, in both scenarios, we will increase the configured socket timeout in the replication setting to match the timeout values we will configure.

1. Your source Artifactory is communicating directly with your target Artifactory server without the use of a reverse proxy or load balancer in between

In this case, the issue is most likely related to the embedded Tomcat timeout. What happens is your source Artifactory starts to replicate artifacts and not reading from the stream during this time. While this is occurring, the configured timeout settings get triggered on your target server and the connection closes, as there was no data streamed. You can resolve this issue by increasing the timeout value. To do so, add the following parameter to the Artifactory connector section in the $ART_HOME/tomcat/conf/server.xml file:

connectionTimeout=“<seconds>”

We recommend starting with a high-value number setting (e.g., 3 hours) and monitoring your replications thereafter. Your observations will let you know whether it will be best to decrease or increase this setting.

2. Your source Artifactory is communicating with your target Artifactory using a reverse proxy or load balancer in between

This situation is a bit more complicated, as you'll need to identify where precisely your connection is getting closed and reset:

  • Between the target server and the load balancer/reverse proxy
  • Between the reverse proxy and the source server
To identify where your connection is getting closed and reset, you'll first want to look at your target Artifactory server in the request log. If the stream was interrupted between the target server and reverse proxy/load balancer, you'll see connection reset errors in the artifactory.log and no file list request entry in the request.log file.
If, however, you do find an entry in the request log file that matches the file list request, then it's probable that your issue is occurring between the reverse proxy/load balancer and the source server. Otherwise, this issue is similar to the one described in Scenario #1 and the same solution should resolve the problem.
If you find an entry for the file list in your target server's request.log file, you'll need to manually obtain the file list from the target repository using the following REST APIAfter running this, you'll receive the entire stream and will then need to check the size of this file and compare it to the size returned from your reverse proxy/load balancer connection. In other words, you'll be checking to see if your reverse proxy transferred the entire file list or only part of it. This information can be seen, for example, in the following Nginx access.log file (the returned request size marked in red):ip = 10.132.0.27 user = "admin" local_time = "23/Apr/2019:12:12:46 +0000" host = 10.132.0.27 request = "GET /artifactory/api/storage/generic-local/?list&deep=1&listFolders=1&mdTimestamps=1&statsTimestamps=1&includeRootPath=1 HTTP/1.1" status = 200 bytes = 281785230 upstream = "172.18.0.2:8081" upstream_time = 371.541 request_time = 475.211 referer = "-" UA = "Artifactory/6.9.1"If your reverse proxy/load balancer returned to the source server the entire stream, then your issue is with another portion of your network, which will require further investigation with your company's networking team.If you see that only part of the file list has been returned to your source server, then the problem is probably with your reverse proxy/load balancer timeout settings. Therefore, once again, the issue is similar to the one described in Scenario #1 and the same solution should resolve the matter. An example for parameters we will edit in Nginx:

  • client_body_timeout: Defines a timeout for reading client request body.
  • send_timeout: Controls the amount of time that Artifactory will be able to hold the stream and not read from it.

Related Errors:2019-03-04 14:27:56,362 [http-nio-127.0.0.1-8081-exec-446] [ERROR] (o.a.r.r.a.ArtifactResource:292) – Could not retrieve list
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Connection reset by peer
2019-04-22 06:52:44,712 [art-exec-1232043] [ERROR] (o.a.a.c.BasicStatusHolder:211) – Error occurred while performing folder replication for 'nuget-local': Connection reset