Jenkins Infrastructure : 20150903 Wiki Outage

Confluence went unresponsive after running well for 8 weeks. Socket connections do not get picked up by wiki.jenkins-ci.org.

The apache mod-status output status.html suggests that request handling is stuck with Confluence. Note that none of the pending requests are for rendering Wiki pages, which seems to rule out that the cache layer is the problem.

The tail of the stdout stacktrace.txt suggests that there's a memory fill up. See the heap summary at the end and the "GC limit exceed message" close to the top.

The Memory RSS view from the past month shows a slow creep of the memory size over time.

This is the same metrics view during the outage:

Attachments:

stacktrace.txt (text/plain)
status.html (text/html)
outpage.png (image/png)
outage-1mo.png (image/png)