Preventing outages: Monitoring your log files through Powershell

If you can monitor your log files for certain errors or warning messages and get a notification if these errors/warnings occur, you might be able to prevent outages that your users would notice. I’m currently using this method to prevent frequent problems I had at a customer:

The issue: IBM Connections is the Intranet homepage for the customer. Sometimes parts of this homepage go blank. Blogs won’t load

Explanation: The customer uses IBM Content Engagement Center for their Intranet homepage. This homepage shows a couple of Blog articles. However, sometimes the Blogs application has trouble committing events to the SQL database, leading to “hung threads”. The log will show messages like these:

[13-3-18 9:38:13:696 CET] 000000f4 ThreadMonitor W   WSVR0605W: Thread "WebContainer : 1" (00000162) has been active for 711627 milliseconds and may be hung.  There is/are 11 thread(s) in total in the server that may be hung.
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:140)
    at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1782)
....
    at org.apache.roller.business.ibatis.IBatisPersistenceStrategy.update(IBatisPersistenceStrategy.java:415)
    at org.apache.roller.business.ibatis.IBatisWeblogManagerImpl.updateWeblogEntryHitcount(IBatisWeblogManagerImpl.java:1424)
    at org.apache.roller.business.runnable.EntryHitCountProcessingJob.processHits(EntryHitCountProcessingJob.java:77)
    at org.apache.roller.business.EntryHitCountQueue.processHit(EntryHitCountQueue.java:116)
    at org.apache.roller.ui.core.util.RequestUtil.processHit(RequestUtil.java:470)

When the total number of hung threads reaches the maximum number of threads available for the blogs application, the application becomes unavailable leading to the white homepage. The only way to free up the threads then is to kill the Java process that the Blogs application runs in. Killing the JVM with the first hung thread, usually means that the threads on the cluster partners are committed and freed up, so killing one JVM is enough.

Solution: We opened a PMR for this problem with IBM. Though multiple fixes have been proposed by IBM and implemented, none have fixed the problem so far.

Workaround: Now we get to the core of this article. As said in the explanation, the Blogs application doesn’t immediately become unavailable when the first hung thread message appears. It takes between 60 – 90 minutes between the first occurence and the Blogs application going blank.  Of course constantly monitoring the logs of all blogs servers would be undoable, but if we would have a script to do this for us, we could intervene before the Blogs application becomes unavailable. Luckily Chad Scott made a script (for us?) that does exactly that. As this customer runs Connections on Windows, the script was written in Powershell. Regretfully, Chad, being American, wrote the script with US date settings in mind. I therefore changed the script and added an extra variable ‘usdateFormat’, which if set to $false, will allow the script to pick up non US dates. Also I made the SMTP port a variable, as ours is 465. Here’s the script:

# ===== Change these as needed =====
$logFileDir = "C:\IBM\WebSphere\AppServer\profiles\AppSrv01\logs\BlogsServer01"
#$logFilePattern = "trace*.log"
$logFilePattern = "SystemOut*.log"
$progressFile = "C:\IBM\scripts\temp\progress.txt"
$matchString = "WSVR0605W"
$smtpServer = "smtp.ims.insim.biz"
$recipientAddress = @("<emailAddress1>", "<emailAddress2>")
$senderAddress = "sam@nn-group.com"
$subject = "New alert at BlogsServer01 at "
$usdateFormat = $false
$smtpPort = "25"
$debug = $true

# ===== Don't change these =====
$lastTimestampProcessed = $null
if ($usdateFormat -eq $true) {
#US date format
$timestampRegex = "([1-9]|1[012])/([0-9]|1[0-9]|2[0-9]|3[01])/([0-9][0-9]) ([0-9]|1[0-9]|2[01234]):([012345][0-9]):([012345][0-9]):([0-9][0-9][0-9])"
} else {
#Non US date format
$timestampRegex = "(0|)([0-9]|1[0-9]|2[0-9]|3[01])\/(0|)([0-9]|1[012])\/([0-9][0-9]) (0[0-9]|1[0-9]|2[01234]):([012345][0-9]):([012345][0-9]):([0-9][0-9][0-9])"
}
$timeFormat = "d/M/yy H:mm:ss:fff"
$sendEmail = $false
$body = $null

function log {
param([string]$text)
Write-Host $text
}

function debug {
param([string]$text)
if ($debug -eq $true) {
Write-Host $text
}
}

# Get the log files
$logFiles = Get-ChildItem "$logFileDir\$logFilePattern"

# Exit if there are no log files
if ( $($logFiles | Measure-Object).Count -eq 0 ) {
log("No log files found matching pattern $logFileDir\$logFilePattern. Exiting")
Exit
}

# Get the timestamp in progress.txt or create the file if it doesn't exist. Save the original value
# in case it needs to be reverted due to error sending alert email. Note that it is a string object here.
if (Test-Path $progressFile) {
debug("Getting last processed timestamp from $progressFile")
$lastTimestampProcessed = $originalLastTimestampProcessed = Get-Content $progressFile
debug($lastTimestampProcessed)
} else {
debug("Creating $progressFile")
$null >> $progressFile
}

# See if there are any occurrences more recent than the last timestamp processed
ForEach ($logFile in $logFiles) {
debug("Processing $logFile")
Select-String $logFile -Pattern $matchString |
ForEach {
$_.Line | Select-String -Pattern $timestampRegex | 
ForEach {
$thisTimestamp = [datetime]::ParseExact($_.Matches[0].Value, $timeFormat, [System.Globalization.CultureInfo]::InvariantCulture)
if ($thisTimestamp -gt $lastTimestampProcessed) {
$lastTimestampProcessed = $thisTimestamp
debug("New most recent timestamp is $lastTimestampProcessed")
Set-Content $progressFile $lastTimestampProcessed.ToString("yyyy/MM/dd HH:mm:ss.fff")
$body = $_.Line
$sendEmail = $true
} else {
debug("Skipping match because timestamp $thisTimestamp is not more recent than $lastTimestampProcessed")
}
}
}
}

# Send an email if necessary. In rare cases the message send will fail due to SMTP issues. In those cases, the progress file needs to be reset
# So any new watched messages that were found will trigger an alert on the next run
if ($sendEmail) {
log("Sending email to $recipientAddress")
Try {
Send-MailMessage -SmtpServer $smtpServer -To $recipientAddress -From $senderAddress -Port $smtpPort -Subject "$subject $lastTimestampProcessed" -Body $body
} Catch {
if ($originalLastTimestampProcessed -eq $null) {
log("Error sending email. Deleting progress file since there was no timestamp from a prior run")
Remove-Item $progressFile
} else {
log("Error sending email. Reverting last timestamp")
Set-Content $progressFile $originalLastTimestampProcessed
}
}
} else {
log("No new alerts found")
}

We used Task scheduler to run the script every 10 minutes. Since this script runs, we haven’t had a single outage of the Blogs application. Obviously, by playing with the matchString, you could use this script for any message in the WebSphere logs (or any other text log for that matter) that you’d like.