How to re-cluster RabbitMQ HA for Xray HA

Loren Yeung
2019-07-08 22:43

Subject

Xray HA requires a RabbitMQ broker per node, i.e. if you have 2 Xray nodes, you will have 2 RabbitMQ brokers. During set up, if the ports are not open or the machines are not communicating properly, RabbitMQ may fail to cluster correctly, resulting in messages getting lost, and a non usable Xray HA.

Affected Versions

Affects Xray version 2.0 and above

Details

Using $rabbitmqctl cluster_status shows that each node is basically it's own cluster, rather than a cluster of 2 rabbit nodes. cluster_name will not match, and running_nodes shows 1 node, itself:

Node 1
root@eplus-xray:/# rabbitmqctl cluster_status

Cluster status of node 'rabbit@eplus-xray' ...
[{nodes,[{disc,['rabbit@eplus-xray']}]},
 {running_nodes,['rabbit@eplus-xray']},
 {cluster_name,<<"rabbit@eplus-xray">>},
 {partitions,[]},
 {alarms,[{'rabbit@eplus-xray',[]}]}]

Node 2
root@eplus-xray-2:/# rabbitmqctl cluster_status

Cluster status of node 'rabbit@eplus-xray-2' ...
[{nodes,[{disc,['rabbit@eplus-xray-2']}]},
 {running_nodes,['rabbit@eplus-xray-2']},
 {cluster_name,<<"rabbit@eplus-xray-2">>},
 {partitions,[]},
 {alarms,[{'rabbit@eplus-xray-2',[]}]}]

Some errors that occur when trying to edit the indexed repository list and saving:

[2018/05/22 23:34:39 UTC] [EROR] (jfrog.com/xray/service/permission_service.(*PermissionService).updateOtherNodesAndUiClients:719) 
Failed to reload permissions cache on other nodes: timeout waiting for reply to sync action 'ReloadPermissionCache' from nodes [eplus-xray]
[2018/05/22 23:34:39 UTC] [EROR] (jfrog.com/xray/service/permission_service.(*PermissionService).RemoveResources:397) 
Failed to update other nodes / ui clients: timeout waiting for reply to sync action 'ReloadPermissionCache' from nodes [eplus-xray]
[2018/05/22 23:34:39 UTC] [EROR] (jfrog.com/xray/handlers/binary_managers.BinManagerHandler.SendRepos:490) 
Failed to remove specific repositories from permissions :timeout waiting for reply to sync action 'ReloadPermissionCache' from nodes [eplus-xray]

Resolution

1. Open up the following ports so that RabbitMQ can communicate: 4369 (epmd, discovery service for rabbit), 5672 (rabbitmq), 25672 (rabbitmq HA comm. port), 15672 (dashboard UI, optional, but helpful) https://www.rabbitmq.com/clustering.html#selinux-ports
2. Ensure that each machine can reach all those ports via telnet (curl may not work, as RabbitMQ uses amp rather than http)
3. Attempt to cluster the nodes back. Hostname can be retrieved via $hostname -s.  Run the following on the 'secondary node'
3a. stop rabbit app: $rabbitmqctl stop_app
3b. cluster: $rabbitmqctl join_cluster rabbit@<Hostname>
3c. restart app: $rabbitmqctl start_app
3d. remirror the queues: $rabbitmqctl set_policy ha-all "^" '{"ha-mode":"all"}' 

If you get Erlang distribution being incorrect error, you have a Erlang cookie mismatch under /var/lib/rabbitmq/.erlang.cookie ,for each rabbit node. The cookie must be identical in contents. Ensure that they are matching (take from the primary), then restart the service, $systemctl restart rabbit-server.service

Then, similarly to above:
3a. stop rabbit app: $rabbitmqctl stop_app
3b. reset it: $rabbitmqctl reset
3c. cluster: $rabbitmqctl join_cluster rabbit@<Hostname>
3c. restart app: $rabbitmqctl start_app
3d. remirror the queues: $rabbitmqctl set_policy ha-all "^" '{"ha-mode":"all"}' 

4. check that both nodes' $rabbitmqctl cluster_status shows both rabbitmq nodes running. Check the rabbitmq dashboard to see if the nodes are now showing multiple, features show "ha-all" mirroring.

You can also do a DB sync, and follow its progress when directly accessing both servers in the UI. The progress bar should update for both (near) simultaneously. 

Both nodes will now show: 

# rabbitmqctl cluster_status
Cluster status of node 'rabbit@eplus-xray-2' ...
[{nodes,[{disc,['rabbit@eplus-xray','rabbit@eplus-xray-2']}]},
 {running_nodes,['rabbit@eplus-xray','rabbit@eplus-xray-2']},
 {cluster_name,<<"rabbit@eplus-xray">>},
 {partitions,[]},
 {alarms,[{'rabbit@eplus-xray',[]},{'rabbit@eplus-xray-2',[]}]}]