Why my large repositories replication fails in push replication?

Yehuda Hadad
2019-05-08 08:11

How full replication works:
Before we start to debug the issue let's first understand how the replication mechanism works.
When repository replication is triggered, either by cron or manually, Artifactory first query the destination repository for the full file list. The file list is an XML file that include the entire repository content, using this information Artifactory checks which files should get replicated and which should get deleted (in case sync delete option is enabled).

The issue starts when the target repository contains a lot of file which means huge file list, or in case that the source repository have a lot of files missing and needs to be replicated. 
The reason is since the target Artifactory return the file list in stream and Artifactory checks each entry on the fly, in order to check if the file is missing and in a need to be replicated or deleted.

Now that we know how to process work, we can start investigate the reason for the issue, for this we have two scenarios:
Note: In addition to the parameters mentioned below, in both scenarios, we will increase the configured socket timeout in the replication setting to match the timeout values we will configure.

1. The source Artifactory communicates directly with the target Artifactory server (without reverse proxy/LB in the way):
In this scenario, the issue is most likely related to the embedded tomcat timeout, the issue happen since the source Artifactory start to replicate artifacts and not reading from the stream during this time.
The target server reach the configured timeout settings and close the connection, as there was no data streamed. 
This issue can get solved by increasing the timeout value, in order to do so, add the following parameter to the Artifactory connector section in the $ART_HOME/tomcat/conf/server.xml file: connectionTimeout=“<seconds>”.
We recommend as start to set the value to high number (like 3 hours) and to monitor the replications, then you will know if you can decrease the value or if you even need to increase it.

2. The source Artifactory communicates with the target Artifactory using reverse proxy/load balancer:
This situation is a bit more complicated since we need to understand where the connection get closed/reset: 
a. Between the target server and the load balancer/reverse proxy.
b. Between the reverse proxy and the source server.
In order to find the place which the connection get closed/reset in, we will first look at the target Artifactory server in the request log. In case the stream got cut between the target server and reverse proxy/load balancer, we will see connection resets errors in the artifactory.log and we will not see entry for the file list request in the request.log file.
If we found an entry in the request log file that match the file list request, then, probably our issue is between the reverse proxy/load balancer and the source server, otherwise, the issue is matching the scenario described in scenario #1 and the same solution should resolve our issue.
If we found an entry for the file list in target server request.log, we need to collect the file list from the target repository manually using the following REST API call. After we will receive the entire stream we need to check the size of the file and compare it to the reverse proxy/load balancer connection returned size.
This mean that we will check if the reverse proxy transferred the entire file list or only part of it (for example we can see this info in Nginx access.log file – the returned request size marked in red):

ip = user = "admin" local_time = "23/Apr/2019:12:12:46 +0000" host = request = "GET /artifactory/api/storage/generic-local/?list&deep=1&listFolders=1&mdTimestamps=1&statsTimestamps=1&includeRootPath=1 HTTP/1.1" status = 200 bytes = 281785230 upstream = "" upstream_time = 371.541 request_time = 475.211 referer = "-" UA = "Artifactory/6.9.1"

If the reverse proxy/load balancer returned to the source server the entire stream, then our issue is with other part in the network and this need to get further investigated with your company networking team.
If you see that only part of the file list have been returned to the source server, then the issue is probably with the reverse proxy/load balancer timeouts settings, again, as first step we recommend to in crease the timeout to high number (like 3 hours) and to monitor the replication, afterwards, you can tune the setting according to the environment needs.  An example for parameters we will edit in Nginx:
client_body_timeout Defines a timeout for reading client request body.
send_timeout – Control the amount of time that Artifactory will be able to hold the stream and not read from it.

NOTE: although this KB written for push replication, the same troubleshoot can be done for pull replication, when the target server will be the server will pull from and the source server will be the serve that pulling the artifacts.

Related Errors:

2019-03-04 14:27:56,362 [http-nio-] [ERROR] (o.a.r.r.a.ArtifactResource:292) – Could not retrieve list 
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Connection reset by peer 

2019-04-22 06:52:44,712 [art-exec-1232043] [ERROR] (o.a.a.c.BasicStatusHolder:211) – Error occurred while performing folder replication for 'nuget-local': Connection reset