Child pages
  • Monitor and Restart Offline Slaves
Skip to end of metadata
Go to start of metadata

This script can monitor and restart offline nodes if they are not disconnected manually.


Of course you can disable email notification. Thanks author for email notification.

You can run this script in Jenkins Console. But I think it is good idea to create Jenkins task, which will be run periodically (for example hourly). Groovy Plugin needs for Jenkins jobs.

Also see Display Information About Nodes

import hudson.model.*
import hudson.node_monitors.*
import hudson.slaves.*
import java.util.concurrent.*

jenkins = Hudson.instance

import javax.mail.internet.*;
import javax.mail.*
import javax.activation.*


def sendMail (slave, cause) {
  
 message = slave + " slave is down. Check http://JENKINS_HOSTNAME:JENKINS_PORT/computer/" + slave + "\nBecause " + cause
 subject = slave + " slave is offline"
 toAddress = "JENKINS_ADMIN@YOUR_DOMAIN"
 fromAddress = "JENKINS@YOUR_DOMAIN"
 host = "SMTP_SERVER"
 port = "SMTP_PORT"

 Properties mprops = new Properties();
 mprops.setProperty("mail.transport.protocol","smtp");
 mprops.setProperty("mail.host",host);
 mprops.setProperty("mail.smtp.port",port);

 Session lSession = Session.getDefaultInstance(mprops,null);
 MimeMessage msg = new MimeMessage(lSession);


 //tokenize out the recipients in case they came in as a list
 StringTokenizer tok = new StringTokenizer(toAddress,";");
 ArrayList emailTos = new ArrayList();
 while(tok.hasMoreElements()){
 emailTos.add(new InternetAddress(tok.nextElement().toString()));
 }
 InternetAddress[] to = new InternetAddress[emailTos.size()];
 to = (InternetAddress[]) emailTos.toArray(to);
 msg.setRecipients(MimeMessage.RecipientType.TO,to);
 InternetAddress fromAddr = new InternetAddress(fromAddress);
 msg.setFrom(fromAddr);
 msg.setFrom(new InternetAddress(fromAddress));
 msg.setSubject(subject);
 msg.setText(message)

 Transport transporter = lSession.getTransport("smtp");
 transporter.connect();
 transporter.send(msg);
}


def getEnviron(computer) {
   def env
   def thread = Thread.start("Getting env from ${computer.name}", { env = computer.environment })
   thread.join(2000)
   if (thread.isAlive()) thread.interrupt()
   env
}

def slaveAccessible(computer) {
    getEnviron(computer)?.get('PATH') != null
}


def numberOfflineNodes = 0
def numberNodes = 0
for (slave in jenkins.slaves) {
   def computer = slave.computer
   numberNodes ++
   println ""
   println "Checking computer ${computer.name}:"
   def isOK = (slaveAccessible(computer) && !computer.offline)
   if (isOK) {
     println "\t\tOK, got PATH back from slave ${computer.name}."
     println('\tcomputer.isOffline: ' + slave.getComputer().isOffline()); 
     println('\tcomputer.isTemporarilyOffline: ' + slave.getComputer().isTemporarilyOffline());
     println('\tcomputer.getOfflineCause: ' + slave.getComputer().getOfflineCause());
     println('\tcomputer.offline: ' + computer.offline); 
     
     
   } else {
     numberOfflineNodes ++
     println "  ERROR: can't get PATH from slave ${computer.name}."
     println('\tcomputer.isOffline: ' + slave.getComputer().isOffline()); 
     println('\tcomputer.isTemporarilyOffline: ' + slave.getComputer().isTemporarilyOffline());
     println('\tcomputer.getOfflineCause: ' + slave.getComputer().getOfflineCause());
     println('\tcomputer.offline: ' + computer.offline); 
     sendMail(computer.name, slave.getComputer().getOfflineCause().toString())
     if (slave.getComputer().isTemporarilyOffline()) {
      if (!slave.getComputer().getOfflineCause().toString().contains("Disconnected by")) {
         computer.setTemporarilyOffline(false, slave.getComputer().getOfflineCause())        
      }
     } else {
         computer.connect(true)  
     }
   }
 }
println ("Number of Offline Nodes: " + numberOfflineNodes)
println ("Number of Nodes: " + numberNodes)  

9 Comments

  1. I finvally managed to get this script working, but this is the ersult I get:

    07:52:05 Building on master in workspace /var/lib/jenkins/workspace/check nodes
    07:52:05
    07:52:05 Checking computer demo:
    07:52:05 		OK, got PATH back from slave demo.
    07:52:05 	computer.isOffline: false
    07:52:05 	computer.isTemporarilyOffline: false
    07:52:05 	computer.getOfflineCause: null
    07:52:05 	computer.offline: false
    07:52:05
    07:52:05 Checking computer production:
    07:52:05 		OK, got PATH back from slave production.
    07:52:05 	computer.isOffline: false
    07:52:05 	computer.isTemporarilyOffline: false
    07:52:05 	computer.getOfflineCause: null
    07:52:05 	computer.offline: false
    07:52:05
    07:52:05 Checking computer aurora:
    07:52:05   ERROR: can't get PATH from slave aurora.
    07:52:05 	computer.isOffline: false
    07:52:05 	computer.isTemporarilyOffline: false
    07:52:05 	computer.getOfflineCause: null
    07:52:05 	computer.offline: false
    07:52:06 Number of Offline Nodes: 1
    07:52:06 Number of Nodes: 3
    07:52:06 Finished: SUCCESS
    

    I don't see any anomalities about slave "aurora". It works just fine. What can I do to fix the problem?

    1. It is strange. If I have ERROR then I always see: computer.isOffline: true

      1. Well, I also find it strange. But I don't seem to have enough knowledge to identify the cause. My workaround is

        // def isOK = (slaveAccessible(computer) && !computer.offline)
          def isOK = (!computer.offline)
        

        If you wish to investigate, I can provide any details.

        1. It's unable to restart the slave, ends up with below error.

          Checking computer node4:
          ERROR: can't get PATH from slave node4.
          computer.isOffline: true
          computer.isTemporarilyOffline: false
          computer.getOfflineCause: Connection was broken: java.io.IOException: Connection reset by peer
          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
          at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
          at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
          at sun.nio.ch.IOUtil.read(IOUtil.java:197)
          at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
          at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)
          at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
          at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
          at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
          at java.io.InputStream.read(InputStream.java:101)
          at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
          at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
          at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
          at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
          at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
          at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
          at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
          at hudson.remoting.Command.readFrom(Command.java:92)
          at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

          computer.offline: true

          Checking computer node5:
          OK, got PATH back from slave node5.
          computer.isOffline: false
          computer.isTemporarilyOffline: false
          computer.getOfflineCause: null
          computer.offline: false
          Number of Offline Nodes: 1
          Number of Nodes: 5
          Finished: SUCCESS

          Slave connected through jnlp

          java -jar slave.jar -noCertificateCheck -jnlpUrl https://node1:9443/computer/node4/slave-agent.jnlp -jnlpCredentials

  2. Hi, The script is unable to start the slave and ends up with the below error:Checking computer testnode:
    ERROR: can't get PATH from slave testnode.
    computer.isOffline: true
    computer.isTemporarilyOffline: false
    computer.getOfflineCause: Connection was broken: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@1844460name=testnode
    at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
    at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
    Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
    at sun.nio.ch.SocketDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(Unknown Source)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
    at sun.nio.ch.IOUtil.read(Unknown Source)
    at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
    at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
    at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
    at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
    ... 6 more

    computer.offline: true
    FATAL: Domain contains illegal character
    javax.mail.internet.AddressException: Domain contains illegal character in string ``JENKINS_ADMIN@YOUR_DOMAIN''
    at javax.mail.internet.InternetAddress.checkAddress(InternetAddress.java:1269)
    at javax.mail.internet.InternetAddress.parse(InternetAddress.java:1091)
    at javax.mail.internet.InternetAddress.parse(InternetAddress.java:633)
    at javax.mail.internet.InternetAddress.<init>(InternetAddress.java:111)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
    at java.lang.reflect.Constructor.newInstance(Unknown Source)
    at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:77)
    at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:102)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:54)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:182)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:190)
    at Script1.sendMail(Script1.groovy:35)
    at Script1$sendMail.callCurrent(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:46)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
    at Script1.run(Script1.groovy:88)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:650)
    at groovy.lang.GroovyShell.evaluate(GroovyShell.java:636)
    at hudson.plugins.groovy.SystemGroovy.perform(SystemGroovy.java:98)
    at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
    at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
    at hudson.model.Build$BuildExecution.build(Build.java:205)
    at hudson.model.Build$BuildExecution.doRun(Build.java:162)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
    at hudson.model.Run.execute(Run.java:1738)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:98)
    at hudson.model.Executor.run(Executor.java:410)
    BFA Scanning build for known causes...
    BFA No failure causes found
    BFA Done. 0s
    Started calculate disk usage of build
    Finished Calculation of disk usage of build in 0 seconds
    Started calculate disk usage of workspace
    Finished Calculation of disk usage of workspace in 0 seconds
    Finished: FAILURE

  3. Hi,

    The same script worked for me. It's only listing ONLINE and OFFLINE nodes. It won't restart OFFLINE NODES. Could you please provide me the code for OFFLINE nodes restart?.

    Thanks in advance.

    Br,

    Mahadev

  4. Hi guys,

    with the latest upgrade of the Groovy plugin "setTemporarilyOffline" takes a RejectedAccessException due the new security fencing.

    Does anyone have found a solution for that?

    Thanks.

    Valentina (smile)

  5. Hi folks,

    I managed to make this script working on my Jenkins configuration. Now I would like to bring it to somewhat different level (wink)

    I have configured multiple jobs (one per slave), which automatically restart slaves every night. From time to time, one particular slave gets disconnected and the easiest solution is to manually restart the slave machine. And this is what I would like to do automatically, using the above script and my existing "slave restart" jobs.

    My idea is to somehow get the list of "offline" slaves and trigger the "restart job" (or jobs), based of the list of offline slaves. I thought about using either "Parametrized Trigger Plugin" or "Inject environment variables" plugin. But I don't have a clue how to get the list of offline slaves (e.g. coma separated list of slave names) from the groovy script and pass it to another build step. Any ideas? Thank you in advance. 

  6. Hi,

    Thank you for the script; I have similar issue; but want to implement the same on a single slave (disconnect & reconnect) connected to Jenkins master through JNLP; please help me with the script; as I am new to Groovy scripting.

    Thanks & Regards,

    Sridevi 

     

Write a comment…