Jenkins remoting may fail to maintain connection between master and slave, and in such case can report a cryptic stacktrace in build console / jenkins log. This page gives some guidance on collecting adequate information to help diagnose the problem.
- What your users saw: normally, people who are seeing failed builds are the first one to notice the problem, as they see exceptions like the following show up in the build log. Is it seeing abnormal termination ? Is it seeing EOF like this?
- Failure mode on the master: On master, the details of the connection loss is captured into the slave log file, and not the build log. So you'll have to go find archived slave launch logs under
$JENKINS_HOME/logs/slaves/*/slave.log*, which records 10 more recent connection logs to that slave. This information is also captured in the support bundle from Support Core Plugin under
/nodes/slave/*/launchLog/*. If you look at the bottom of them, you can see how the last 10 connections to the slave has ended.
- Failure mode on the client: Collect slave node jenkins-*.out.log and jenkins-*.err.log. This is where stdout and stderr from a JNLP slaves go. See what error is reported in there. Could be "java.net.SocketTimeoutException: Read timed out", as well as OutOfMemoryException, HotSpot JVM crash dump, etc.
- Determine which side is failing first. "read time out" is not fatal as far as TCP is concerned, but a TCP reset is. This let us know which side is detecting the problem
Disable ping thread and get to the root cause
One of the common cause of the connection loss is the forced connection shutdown by the Ping Thread. When a ping thread detects that it's not getting a reply back in 4 minutes, it proceeds to terminate the connection to prevent an infinite hang. This will leave a message like the following in the slave launch log:
If you've identified that this is how your connection is lost, then you now need to dig deeper and understand what's causing the ping reply to be so late. Doing this diagnosis usually requires that you disable the ping thread, so that you can cause a slave to hang and exhibit the problem without getting killed. Then use tools like
jstack to obtain the diagnostic information of the slave JVM. See here and here for how to use those tools.
If you see that one side is hanging on write to the network while the other side is hanging on read from the network, it indicates a network communication problem. For example, if you remove the network cable to the slave, TCP packets will fail to get there in timely fashion, and ping timeout happens before TCP times out.
Pay particular attention to what
SynchronousCommandTransport$ReaderThread.run is doing, as this is the thread that's responsible for reading remoting requests that arrive from the network and put the tasks into other threads' queue. If this thread is blocking on anything other than reading from the network, then it should be filed as a bug.
SSH Slaves plugin v1.9 contains improvements to help diagnose the connection loss problem. Users who are seeing a problem is encouraged to upgrade to this version.
To better understand the failure mode of an SSH slave connection, it is helpful to learn the layering of protocols in SSH.
- At the base level, SSH connection is a TCP connection. This is referred as "SSH connection."
- Next up, there is an SSH session. An SSH session is a data stream inside a SSH connection. In Jenkins context, a single SSH session inside a SSH connection is connected to the slave process and controls this process.
- Next up, there is a Jenkins remoting channel. This Jenkins protocol implements the general communication framework between two Java processes (Jenkins master and the slave.)
When a connection to an SSH slave is lost, the first thing you need to figure out is which layerthe failure originated from. The following three are the major sources:
- Ping timeout. See a section above for more discussion about ping problem.
- Loss of SSH socket connection by a TCP problem. For example, if SSHD or firewall in-between sends you a TCP RESET packet to abort the TCP connection, it fails in this way. SSH channels and remoting runs inside SSH connection, so when an SSH connection is lost, they are lost as well. We think this is primarily a network problem, but we are still gathering various failure modes that fall into this category. Look for "Socket connection to SSH server was lost" message, which indicates that the failure is from the socket layer. The following is an example of this failure mode, when I "kill -9" SSHD on the slave:
- Abrupt loss of slave JVM process. If the slave experiences a violent JVM crash (such as segfault, kill -9, GC limit), then the slave process will quit, and the remoting channel is lost, but SSH channel and SSH connection stays intact. SSH channel will report back any dying message from the slave process (such as hotspot crash dump that it produces to stdout), and it will also subsequently report the exit code of the slave process. Jenkins master will wait up to 3 seconds for this to come in, and it'll close down the SSH connection. The following example shows one such failure, where the slave has segfaulted. Look for the "Slave JVM has terminated" message, which indicates this failure mode:
The following example shows "kill -9" to the slave process. Note the "Exit signal=KILL" in the message:
There was a bug in remoting that prevented stdout from the slave process from getting captured into the master. Kohsuke hasn't determined exactly when this regression was introduced, but he thinks it's around remoting 2.38 (which appeared in Jenkins 1.560.) You can tell whether your connection is affected by this bug by looking for the message "Evacuated stdout" in slave launch log. If you do not see this message, you are affected. This problem was resolved in remoting 2.48 (which will appear in Jenkins 1.590.) If you are running affected versions of Jenkins, you will not see the hotspot crash dump message.
If you use the option "Launch agent via execution of command on the master", then be aware that the input and output to this command will be used for the communication between the central server and the remote slaves. Do not pollute the output stream with additional output and do not consume anything from the input stream.
Usually, the launcher command will be a shell script on the server that does some preparation and then uses ssh to execute slave.jar on the remote slave.
In the preparation, avoid additional ssh invocations to perform preparations on the remote slave, because ssh will consume bytes from the input stream, even if the remote command does not. See this stackoverflow answer for details and workarounds.