This script can monitor and restart offline nodes if they are not disconnected manually.
Of course you can disable email notification. Thanks author for email notification.
You can run this script in Jenkins Console. But I think it is good idea to create Jenkins task, which will be run periodically (for example hourly). Groovy Plugin needs for Jenkins jobs.
Also see Display Information About Nodes
import hudson.model.* import hudson.node_monitors.* import hudson.slaves.* import java.util.concurrent.* jenkins = Hudson.instance import javax.mail.internet.*; import javax.mail.* import javax.activation.* def sendMail (slave, cause) { message = slave + " slave is down. Check http://JENKINS_HOSTNAME:JENKINS_PORT/computer/" + slave + "\nBecause " + cause subject = slave + " slave is offline" toAddress = "JENKINS_ADMIN@YOUR_DOMAIN" fromAddress = "JENKINS@YOUR_DOMAIN" host = "SMTP_SERVER" port = "SMTP_PORT" Properties mprops = new Properties(); mprops.setProperty("mail.transport.protocol","smtp"); mprops.setProperty("mail.host",host); mprops.setProperty("mail.smtp.port",port); Session lSession = Session.getDefaultInstance(mprops,null); MimeMessage msg = new MimeMessage(lSession); //tokenize out the recipients in case they came in as a list StringTokenizer tok = new StringTokenizer(toAddress,";"); ArrayList emailTos = new ArrayList(); while(tok.hasMoreElements()){ emailTos.add(new InternetAddress(tok.nextElement().toString())); } InternetAddress[] to = new InternetAddress[emailTos.size()]; to = (InternetAddress[]) emailTos.toArray(to); msg.setRecipients(MimeMessage.RecipientType.TO,to); InternetAddress fromAddr = new InternetAddress(fromAddress); msg.setFrom(fromAddr); msg.setFrom(new InternetAddress(fromAddress)); msg.setSubject(subject); msg.setText(message) Transport transporter = lSession.getTransport("smtp"); transporter.connect(); transporter.send(msg); } def getEnviron(computer) { def env def thread = Thread.start("Getting env from ${computer.name}", { env = computer.environment }) thread.join(2000) if (thread.isAlive()) thread.interrupt() env } def slaveAccessible(computer) { getEnviron(computer)?.get('PATH') != null } def numberOfflineNodes = 0 def numberNodes = 0 for (slave in jenkins.slaves) { def computer = slave.computer numberNodes ++ println "" println "Checking computer ${computer.name}:" def isOK = (slaveAccessible(computer) && !computer.offline) if (isOK) { println "\t\tOK, got PATH back from slave ${computer.name}." println('\tcomputer.isOffline: ' + slave.getComputer().isOffline()); println('\tcomputer.isTemporarilyOffline: ' + slave.getComputer().isTemporarilyOffline()); println('\tcomputer.getOfflineCause: ' + slave.getComputer().getOfflineCause()); println('\tcomputer.offline: ' + computer.offline); } else { numberOfflineNodes ++ println " ERROR: can't get PATH from slave ${computer.name}." println('\tcomputer.isOffline: ' + slave.getComputer().isOffline()); println('\tcomputer.isTemporarilyOffline: ' + slave.getComputer().isTemporarilyOffline()); println('\tcomputer.getOfflineCause: ' + slave.getComputer().getOfflineCause()); println('\tcomputer.offline: ' + computer.offline); sendMail(computer.name, slave.getComputer().getOfflineCause().toString()) if (slave.getComputer().isTemporarilyOffline()) { if (!slave.getComputer().getOfflineCause().toString().contains("Disconnected by")) { computer.setTemporarilyOffline(false, slave.getComputer().getOfflineCause()) } } else { computer.connect(true) } } } println ("Number of Offline Nodes: " + numberOfflineNodes) println ("Number of Nodes: " + numberNodes)
10 Comments
Ilya I
I finvally managed to get this script working, but this is the ersult I get:
I don't see any anomalities about slave "aurora". It works just fine. What can I do to fix the problem?
yury milchenko
It is strange. If I have ERROR then I always see: computer.isOffline: true
Ilya I
Well, I also find it strange. But I don't seem to have enough knowledge to identify the cause. My workaround is
If you wish to investigate, I can provide any details.
Sudhakar Shanmugam
It's unable to restart the slave, ends up with below error.
Checking computer node4:
ERROR: can't get PATH from slave node4.
computer.isOffline: true
computer.isTemporarilyOffline: false
computer.getOfflineCause: Connection was broken: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.InputStream.read(InputStream.java:101)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at hudson.remoting.Command.readFrom(Command.java:92)
at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
computer.offline: true
Checking computer node5:
OK, got PATH back from slave node5.
computer.isOffline: false
computer.isTemporarilyOffline: false
computer.getOfflineCause: null
computer.offline: false
Number of Offline Nodes: 1
Number of Nodes: 5
Finished: SUCCESS
Slave connected through jnlp
java -jar slave.jar -noCertificateCheck -jnlpUrl https://node1:9443/computer/node4/slave-agent.jnlp -jnlpCredentials
Jyothi Bosle
Hi, The script is unable to start the slave and ends up with the below error:Checking computer testnode:
ERROR: can't get PATH from slave testnode.
computer.isOffline: true
computer.isTemporarilyOffline: false
computer.getOfflineCause: Connection was broken: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@1844460name=testnode
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
... 6 more
computer.offline: true
FATAL: Domain contains illegal character
javax.mail.internet.AddressException: Domain contains illegal character in string ``JENKINS_ADMIN@YOUR_DOMAIN''
at javax.mail.internet.InternetAddress.checkAddress(InternetAddress.java:1269)
at javax.mail.internet.InternetAddress.parse(InternetAddress.java:1091)
at javax.mail.internet.InternetAddress.parse(InternetAddress.java:633)
at javax.mail.internet.InternetAddress.<init>(InternetAddress.java:111)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:77)
at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:102)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:54)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:182)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:190)
at Script1.sendMail(Script1.groovy:35)
at Script1$sendMail.callCurrent(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:46)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
at Script1.run(Script1.groovy:88)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:650)
at groovy.lang.GroovyShell.evaluate(GroovyShell.java:636)
at hudson.plugins.groovy.SystemGroovy.perform(SystemGroovy.java:98)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
at hudson.model.Build$BuildExecution.build(Build.java:205)
at hudson.model.Build$BuildExecution.doRun(Build.java:162)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
at hudson.model.Run.execute(Run.java:1738)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:410)
BFA Scanning build for known causes...
BFA No failure causes found
BFA Done. 0s
Started calculate disk usage of build
Finished Calculation of disk usage of build in 0 seconds
Started calculate disk usage of workspace
Finished Calculation of disk usage of workspace in 0 seconds
Finished: FAILURE
Mahadevaswamy HN
Hi,
The same script worked for me. It's only listing ONLINE and OFFLINE nodes. It won't restart OFFLINE NODES. Could you please provide me the code for OFFLINE nodes restart?.
Thanks in advance.
Br,
Mahadev
Valentina Ancona
Hi guys,
with the latest upgrade of the Groovy plugin "setTemporarilyOffline" takes a RejectedAccessException due the new security fencing.
Does anyone have found a solution for that?
Thanks.
Valentina
Pavel Kudrys
Hi folks,
I managed to make this script working on my Jenkins configuration. Now I would like to bring it to somewhat different level
I have configured multiple jobs (one per slave), which automatically restart slaves every night. From time to time, one particular slave gets disconnected and the easiest solution is to manually restart the slave machine. And this is what I would like to do automatically, using the above script and my existing "slave restart" jobs.
My idea is to somehow get the list of "offline" slaves and trigger the "restart job" (or jobs), based of the list of offline slaves. I thought about using either "Parametrized Trigger Plugin" or "Inject environment variables" plugin. But I don't have a clue how to get the list of offline slaves (e.g. coma separated list of slave names) from the groovy script and pass it to another build step. Any ideas? Thank you in advance.
sri devi
Hi,
Thank you for the script; I have similar issue; but want to implement the same on a single slave (disconnect & reconnect) connected to Jenkins master through JNLP; please help me with the script; as I am new to Groovy scripting.
Thanks & Regards,
Sridevi
Rakesh Nambiar
Is there any plugin instead of the script?