Disconnection problems with Network Rendering 6.2

Started by Nicklas Holmgren, September 29, 2016, 03:21:16 AM

Previous topic - Next topic

0 Members and 4 Guests are viewing this topic.

Nicklas Holmgren

Hi!

I am having various problems with our Keyshot Network rendering setup. Don't have the best circumstances for it due to budget causing program version inconsistency:

Master computer:
Win7, Keyshot 4.3 Pro, Network Rendering 6.2 (have Keyshot 6 installed but not Pro version so can't queue animations from that version).
Master computer also has slave running.

Slaves:
Win10, Network Rendering 6.2, Keyshot 5 HD.

I am only concerned with drawing CPU power from the slave computers, them queueing up jobs is not relevant to me. Everything I do is queued from the master computer. Had Keyshot Network Rendering 5.3 running for a full year without any issues, but with the license renewal we were forced to upgrade to 6.2 and then all the issues started.

The problems then. Sometimes it works fine throughout a day, but very very often problems start after one of the slave computers has been set down to 2 cores or 4 or something like that while people are using it. Then when the slave computer is turned back up to All cores at the end of the day it will start rolling disconnections. It'll connect with 8 cores and after the "preparing" step it will drop off the list of slaves, and repeat doing that forever. This is true also for the slave app on the master computer... Which means that it can't connect to itself.

Now I can fix this by spending approximately one-two hours every time restarting all the computers and applications; watchdogs, master, queue, slave, slave tray etc. Restart everything in different orders and it'll eventually, after >20 attempts, start working correctly for another day or so. The order doesn't seem to matter. Going into the config tool and shutting down and restarting the service there doesn't help either. Same problem happens as often if you try to set it down from All cores to something lower, it'll start the repeating disconnects and display 8 cores every time it pops in even though the tray is set to 4, for instance.

That's one thing. Another thing is queue apparently losing connection. Like on the master computer sometimes the queue app will stop displaying what's actually going on. And checking the queue on a different computer will show progress having gone further that what is shown on the master computer. Checking the actual frames it's saved to hard drive indicates that the master computer queue app is the one with wrong information. It's like the queue app on the master computer losing connection to the master - which again is, also on the master computer.

Everything is run as admin on all computers. Same subnet. Admin account. Tried both autodetect and manual host settings - it works exactly as bad in both cases. The keyshot6_network_master.exe DOS window will pretty much constantly display the final rows as:
"Could not listen on state port: 4777"
"Could not listen on state port: 4778"

Even when service is performing as it should.

I know little to none about how to fix networking issues. Could it theoretically be some issue with a firewall blocking the ports? If anyone recognizes these issues please drop me a line, I'm losing 2 hours of sleep every night working toward a sharp deadline with no room for delay.

Edit: Additional problems noticed during the day:
Sending small job to queue taking about an hour to reach 100%. 100 frames, job file is 3mb large. Sending from Keyshot 4.3 to the master which is on the same computer. Did it once already and it failed. Firewall is now completely turned off.

Edit 2: The job mentioned above failed at close to 100% - "An error occurred deleting job 421". This is the only mention of job 421 in the logs:
to sep 29 17:56:34 2016: [II] JobManager::savePriorities()
to sep 29 17:56:34 2016: [EE] Could not open region state file for writing: "C:/KeyshotFiles/KS6resourcefiles/Master/421/regions.bjson"


Before the upgrade to Network 6.2 it would render animations frame by frame chronologically. Making it at least easy to back up the frames and cancel a job when needed in order to restart and re-queue jobs. Now it renders for example frame 1-8, 11-56, 82-117 etc. It leaves gaps out of order and seemingly goes semi-chronologically. I don't know if that is by design or a symptom of the major issues we're having. As a secondary thing it makes it hellish to handle backups of jobs now that we are having these issues.

Thank you,
Nicklas

Nicklas Holmgren

Update:

I was helped by great support from Luxion! If anyone happens upon issues like this in the future try the following.

The problem was twofold:
1. Using Network Rendering 6 with Keyshot 4 creates elevated levels of log entries to the point where the master had no time to manage the queue because it was busy maintaining around 500MB of txt files constantly updating. Solution: In the network configurator turn off log entries or set it to level Critical or above. The log dump is on the Warning tier so anything above that resolved it.
2. Ghosts of previous Network Slaves were running on some machines. Solution: Just end all the tasks and delete the old program folders.

Thanks to Luxion for the help!