That's a total bummer. Have not had this issue with the 6 Boxx machines we have linked.
I am curious though, do you have all of your cores on each machine set to be workers?
One thing I have found to be helpful is to choke back the # of cores available as a worker in order to give the machine some breathing room when it is operating tasks such as 'transferring job' or 'stitching'. I don't know if it is actually helping, but we have had more success and less failures once we implemented that.
For example;
Boxx 1 (Designated as Manager) 20 out of 24 cores available as worker
Boxx 2 (Worker only) 22 out of 24 cores available as worker
Boxx 3 (Worker Only) 18 out of 20 cores available as worker
etc etc