OmniCheck: version 6.0.0 (last update 2008/03/31)

Overview
Changes
Installation
Configuration File
Rule File
Errors

Overview

OmniCheck is a Perl script that is designed to monitor the logfiles of process (or the direct output of processes), perform regular-expression pattern matches against that data, and take a specified notification action.

OmniCheck was originally implemented on HP/UX 9.05, and has been successfully ported to all current revisions of the following operating systems:

In short, if the operating system can run Perl version 5.004 patch 4 or later, it can run OmniCheck. If someone knows of a system where this assertion fails, please post a message in OmniCheck's Help forum the project so I can fix whatever is going wrong.

Changes from last version

Installation

Management via iterative process (cron)

  1. Create installation directory (if not already present) referred to heretofor as {install_dir}
  2. mkdir /opt/omnicheck # or wherever you wish
    
  3. Create logs directory (if being used)
  4. mkdir /opt/omnicheck/logs # or wherever you wish
    
  5. Create configuration file or copy one of the samples from this site (see Configuration for details)
  6. cd {install_dir}
    vi configfile
    
  7. Create rule file(s) in the directory noted as 'home' in the configuration file (see Rule files for details)
  8. vi {install_dir}/rules.{nodename}
    
  9. Begin monitoring in the selected tool (cron)
  10. /usr/local/bin/perl {install_dir}/omnicheck -F {install_dir}/configfile
    

Command-Line Options

The command-line options are as follows:
-F or --config-file
Mandatory: provides the filename of the configuration file.
-F /oap/omnicheck.config

-I or --init
Mandatory for manual initialization: causes OmniCheck to establish a baseline byte offset for each file(s) to be monitored. This function will now occur on first invocation, leaving this command-line option deprecated.

-Z or --zero
Optional, used during manual initialization: causes OmniCheck to set the byte offset for each file(s) to be monitored to zero.

--re_init
Optional: causes OmniCheck to overwrite a previous initialization.

-P or --persistent
Mandatory for daemon execution: causes OmniCheck to enter an event loop; each pass through the event loop will take no less than the number of seconds indicated by the interval configuration file entry value (see below). If it takes more than that, then at the end of that pass, it starts another pass. If there is time left, it will sleep. If you want to wake it up early, send the process ID (PID) a SIGHUP.

NOTE: may not be used with files that consist of multiple physical lines.

-D or --debug value
Optional: sends debugging log statements to OmniCheck's err file. Possible values:
  • debug
  • info
  • notice (default value)
  • warn
  • err
  • crit
  • alert
  • emer

-H or --help
Optional: displays a command-line help screen.

-T or --test
Optional: enables test input mode. In this mode, OmniCheck will read standard input as the content to be monitored: all notifications are simulated and sent to STDOUT. When testing data intended for a single block of a multiple block configuration, use the -B/--block option (described below). Also,

-B or --block
Optional: test block value. When used with -T/--test (described above), this option will restrict the activity of OmniCheck to the configuration block indicated by this option. If not used during test input mode, the first, or 'main', block will be used.

Configuration

The configuration file contains all the necessary information to properly run an instance of OmniCheck. Below is a list of the available entries:
process
Mandatory: the unique name for this instance of OmniCheck. It is used as part of various filenames. Try to be descriptive of what is being monitored.
process: syslog
process: sapi-apps

home
Mandatory: the directory OmniCheck expects to find the rulefiles.
home: /oap
home: /opt/omnicheck

name
Mandatory: the name used when OmniCheck sends mails and pages. The mnemonic NODE can be used to represent the nodename of the host.
name: foobar
name: NODE

file
Mandatory: the file(s) that this instance/block of OmniCheck will monitor. This value can be any of the following:
  • a single file
  • a path to an executable program or script (prefaced with #!)
  • a fileglob matching multiple files
  • a filelist containing multiple files (prefaces with @)
file: /usr/adm/syslog/syslog.log
file: #!/bin/df -k
file: /oap/logs/*.err
file: @/opt/omnicheck/file.list

oldfile
Recommended: the filename of the previous rotation of a single file listed in the file configuration file entry. OmniCheck will parse and data that was written to this file after the last run, but before the file was rotated. This feature only works when a single file is being monitored in the block.
oldfile: /usr/adm/oSYSLOG
oldfile: /oap/logs/.old/process-a.err
oldfile: #!/oap/calc_oldfile_name.sh
See
here for documentation on specifying oldfiles for files within a filelist.

gzip
Mandatory if using GNU ZIP to compress 'oldfiles': the full pathname of the binary for the GNU ZIP utility. This value will be used when 'oldfiles' compressed with gzip are used.
gzip: /usr/bin/gzip

tmpdir
Optional: the directory OmniCheck uses to store working files, such as the tellfile, spoolfile and lockfile. If "tmpdir" is not set in the configuration file, the most OS-appropriate directory will be used.

NOTE: Do not set 'tmpdir' to /tmp on Solaris systems, as that filesystem is cleared on reboot, erasing your tellfiles.

tmpdir: /usr/tmp
tmpdir: /oap

block
Optional: the name used when you want to monitor one file differently than another. The first block is always the "main" block, and the configuration file entries "process", "home", and "tmpdir" in the "main" block are copied into each subsequent block.
block: MAILDB_M01A_SYB
block: dsl_reg_db

rules
Mandatory: the filename for the patterns and associated actions (rules). There can be one or more rulefiles listed on the 'rules' entry, or each rulefile can be listed on their own line: regardless, the files are read in the order they are listed in the configuration file, and the rules therein are processed in the same order. The rulefiles can either be references relative to the home directory or via an absolute path:
# relative to the 'home' directory

rules: rules.local rules.group rules.global
rules: rules.process_1 rules.process_group rules.all_processes

# absolute

rules: /opt/omnicheck/special/rules.special

logs
Optional: the directory to which OmniCheck will write its omnicheck.err and omnicheck.out log files. If not defined, the home directory will be used.
logs: /oap/logs
logs: /usr/local/omnicheck/logs

out
Optional: the filename to use for STDOUT of OmniCheck. Unix date(1)-style mnemonics can be added to provide an auto-rotation feature. The default value is omnicheck.out
out: omnicheck_%Y%m%d.out
out: omnicheck_%H:%M:%S.out

err
Optional: the filename to use for STDERR of OmniCheck. Unix date(1)-style mnemonics can be added to provide an auto-rotation feature. The default value is omnicheck.err
err: omnicheck_%Y-%m-%d.err
err: omnicheck_%H:%M:%S.err

debug
Optional: the level of messages to generate in the omnicheck.err file. Messages at lower levels will be written, i.e., 'crit' includes 'alert' and 'emer', et cetera
# highest level = most log entries

debug: debug 
debug: info

# notice is default production level

debug: notice 
debug: warn
debug: err 
debug: crit
debug: alert
debug: emer

interval
Mandatory for running in persistent (daemon) mode: provides OmniCheck the minimum length of time (in seconds) to spend in each iteration of its event loop. If an event loop takes longer than interval seconds, the next event loop will start immediately after the last.
interval: 300

IRS
Optional: defines an input record separator (IRS) for logs that consist of multiple lines. When this configuration entry is used, it should contain a regex matching the beginning of a new multiple-line log entry. OmniCheck will match each line of the new content for this pattern, and if found, will begin a new record with this line. Note: the multiple-line records will contain newlines.
IRS: ^\d\d\d\d-\d\d-\d\d-\d\d\.\d\d\.\d\d\.\d+
IRS: ^[A-Z] [A-Z][a-z][a-z] [ \d]\d \d\d:\d\d:\d\d \d\d\d\d 

There is an optional :trim tag you can associate with the IRS entry to match a set of lines, trim off the matching part, like a date/time stamp, then concatenate the remainders into one line.

IRS: ^\d\d\d\d-\d\d-\d\d-\d\d\.\d\d\.\d\d\.\d+ :trim
IRS: ^[A-Z] [A-Z][a-z][a-z] [ \d]\d \d\d:\d\d:\d\d \d\d\d\d :trim

NOTE: may not be used in persistent mode.

Flags to control OmniCheck's function:

production
Recommended: tells OmniCheck whether the monitored file(s) is 'in production' or not. A true value would be 'on', 'yes', or '1', whereas a false value would be 'off', 'no', or '0' (zero). When the production value is false, pages are downgraded to mails, and actions that would contact the oncall now contact the admin.
production: no  # or 0 or off
production: yes # or 1 or on

farm
Recommended: tells OmniCheck whether the monitored file(s) is part of a redundant set of objects or not.

The theory goes that a single component of a farm can endure an failure without causing adverse impact to the farm as a whole.

See 'production' above for true and false values. When the farm value is true, the effect is the same as if the production value is false.

farm: no  # or 0 or off
farm: yes # or 1 or on

maint
Recommended: tells OmniCheck whether the monitored file(s) is under maintenance work or not. See 'production' above for true and false values. When the maint value is true, no notifications will be sent.
maint: no  # or 0 or off
maint: yes # or 1 or on

quiet
Optional: tells OmniCheck when to be 'quiet', and not send any alerts. It follows the structure of crontab to provide values for the minute, hour, day, month, and day-of-week. Any trailing values not assigned are assumed to match all possible values (*).
quiet: * 15-19 * * *   # no alerts between 3:00pm and 7:59pm
quiet: * 15-19         # same as above
quiet: * * * * 0,6     # no alerts on Saturday or Sunday

Required Entries for sending mail or pages

smtphost
the name of the host handling SMTP traffic for your site, or the path to a SMTP-capable binary on your system. Any necessary options for the binary must be provided.
smtphost: localhost
smtphost: relay.mail.here.com
smtphost: /usr/lib/sendmail -t

pagerhost
the name of the host that will translate an email into a pager message.
pagerhost: pager.foo.com
pagerhost: page.mail.here.com

admin
the name of the administrator for the file/system being monitored. Valid values are bare Unix username, a fully-formed email address, a simple file containing either a username or email address, or an executable script/program whose output is either a username or email address. The value of this entry replaces action references to 'admin'.
admin: jblow
admin: jblow@here.com
admin: /oap/omnicheck_admin
admin: #!/opt/omnicheck/get_admin.sh

oncall
the name of the oncall personnel for the file/system being monitored. Valid values are bare Unix username, a fully-formed email address, a simple file containing either a username or email address, or an executable script/program whose output is either a username or email address. The value of this entry replaces action references to 'oncall'.
oncall: jblow
oncall: jblow@here.com
oncall: /oap/omnicheck_oncall
oncall: #!/opt/omnicheck/get_oncall.sh

organization
Optional: This field can be used to identify to which group that an instance of OmniCheck belongs, as well as invoke different actions within a single rule: see here for more information.
organization: QA_Team
organization: NorthAm.Prod
organization: Foobar

fqdn
May be mandatory: Certain Unix-based architectures do not provide proper hostname identification (you know who you are). For those systems, you can provide a name to use for mail and page events.
fqdn: foobar.db.foo.com

Method of integration oldfiles into filelists

Configuration data for 'oldfiles' can be added to the contents of a filelist (the 'file' configuration value starts with an at-sign (@). Only single files can have their oldfiles specified in this manner.
syslog.log{tab}syslog.log.0
syslog.log{tab}syslog.log.gz

Rule files

Rule files are the core of OmniCheck: they provide the patterns to use, and the actions to take when a pattern matches against the data being monitored. Each rule in the file must be separated by some amount of blank lines, and is comprised of two parts: the pattern and the actions.

Patterns

The pattern follows Perl's regular expression syntax, with some additional features. The following are in order of precedence.
The patterns must follow proper Perl regular expression syntax. Any occurance of the these special characters in the data to monitor must be escaped with backslashes \ in the pattern: Any pattern with non-escaped special characters will be considered corrupted, will not be used by OmniCheck, and will be noted in the report file (if enabled) and/or in debug output (if enabled).

Actions

The actions are the list what to do when a pattern matches. The available actions are:

Altering when actions act

OmniCheck can be instructed to take a specified action only if a specific number of lines match the pattern. Known as a threshhold within OmniCheck, its syntax is this:
if >= 10 mail admin ; test messages

The valid relations are:

There must be a space after the word if and after the numeric value. Space between the relation and numeric value is optional.

Actions can be coded to only activate when a specific organization is using the rule file. This feature reads the 'organization' configuration file entry to test against the login in the rule file. In the following example, the FOO organization will get "host issue" mail, the BAR team will get a "fix me" page to their oncall, and everyone else will ignore the pattern:

<pattern>
if org eq "FOO" mail admin-team ; host issue
elsif org eq 'BAR' page oncall ; fix me
else ignore admin ; not important
endif
Use either single (') or double (") quotes to surround the organization value within the rule.

If there are actions outside, or after, an if block, they will always take effect. In this example, only instances used by the FOO organization will get the "host issue" mail, but all instances will send the "fix tomorrow" mail:

<pattern>
if org eq "FOO" mail admin-team ; host issue
endif
mail admin ; fix tomorrow
NOTE: Organization-sensitive actions and threshholded actions are current mutually exclusive, i.e., you cannot do this:
if (org eq "FOO" && >= 10) mail admin ; FOO and over 10 
if (>= 10 || org eq "BAR") mail admin ; over 10 or BAR
If this is a feature that is requested, the effort will be applied to work out the parsing logic. For now, however, you can do one or the other.

Pattern-action interaction

OmniCheck can capture sections of the lines that match the pattern and use them in the actions as pieces of the subject of a mail. Parentheses are used to surround the section of pattern to capture, then number variables ($1, $2, etc.) are used to insert the captured values into the action:
ftpd\[\d+\]: FTP LOGIN FROM (\d+\.\d+\.\d+\.\d+) as (\w+)
mail admin ; FTP from $1 as $2

Note: pattern-action interactions are now functional, so that patterns like this:

Error: (\w+) ... Description: "([^"]+)"
will now capture the data within two parentheses and provide it as expected.

Thresholding Use

To use the thresholding feature, you need to tag the pattern with a label. This label needs to be two or more alphanueric characters long, the first being alphabetic (think variable name), followed by a double pound (similar to the pattern expiration feature).

Then, you need to add a threshold to your actions:

Alpha##foobar
if 30/day mail admin ; lots of foobar
if over 30/day mail admin ; lots of foobar
if under 30/day mail admin ; not enough foobar

On each iteration of OmniCheck, the number of matches for all patterns that have been tagged will be stored in a .thresh file in the tmpdir directory, timestamped to when the match occurred. Also, the .thresh file is kept manageable by trimming off data entries that exceed by 2 times the maximum threshold time value within the rule files. If the number of matches for a particular pattern, including what is currently matching the pattern within the current iteration, and if the preface control word 'over' is used, and the number is greater than or equal to the quantity per time unit specified in the action, then the action is invoked; otherwise, it is not.

If the number of matches for a particular pattern, including what is currently matching the pattern within the current iteration, and if the preface control word 'under' is used, and the number is less than or equal to the quantity per time unit specified in the action, then the action is invoked; otherwise, it is not.

The default control is 'over'.

Labels can contain varibles that are filled in with captured sections from the pattern, just as in actions.

Alpha_$1##foo: (\w+)
if 30/day mail admin ; lots of foobar
if over 30/day mail admin ; lots of foobar
if under 30/day mail admin ; lots of foobar
The available time units are:

Errors and Messages

Here is a list of log entries that may occur in omnicheck.err, depending on the value of the debug configuration file directive:

Critical Errors (crit)

OmniCheck cannot run without -F option...exiting
OmniCheck requires a configuration file to run, and that file's name is provided via the -F command-line option. Failing to provide that command-line option will cause OmniCheck to terminate.
cannot open configuration file {string}: {string}
OmniCheck must be able to read its configuration file. Check the ownership and/or the permissions on the file and restart.
cannot open tellfile {string} for writing ... exiting
The tellfile cannot be opened for writing by the user-id running OmniCheck. Check the permissions on the tellfile and restart.
cannot use persistent mode with multi-line records... exiting
An attempt was made to analyze multi-line records in persistent (daemon) mode, which was designed out of OmniCheck to solve a memory leak. To monitor multi-line records, use OmniCheck in an iterative (cron) mode.
tellfile {string} not owned by user
The tellfile is not owned by the user-id running OmniCheck. Change the ownership of the tellfile (likely 'root') and restart.
previous OmniCheck instance detected
Only one instance of OmniCheck running under the same configuration file can be active at any one point in time. If a second instance it started, it will detect it and terminate the new instance.

Error Messages (err)

already initialized: use --re_init
This error will occur when attempting to initialize a previously-initialized instance of OmniCheck. As the message states, to do this requires the --re_init command-line option.
cannot open new tellfile {string} for writing ... exiting
OmniCheck creates a new tellfile when it updates the information to track the byte offsets of the files it monitors. This message indicates that the directory permissions and/or ownership does not allow the user-id running OmniCheck to create this file. Check the directory permissions and/or ownership and restart.
cannot open tellfile {string} for reading
This message indicates that the file permissions and/or ownership for the tellfile do not allow the user-id running OmniCheck to open the file for reading. Check the directory permissions and/or ownership and restart.
cannot open tellfile {string} for writing
This message indicates that the file permissions and/or ownership for the tellfile do not allow the user-id running OmniCheck to open the file for writing. Check the directory permissions and/or ownership and restart.
could not connect to {string} on port 25 ... mail not sent
This message indicates that an attempt to connect to a sendmail daemon on port 25 failed, and the mail or page event that was going to use that connection will fail. Check the status of the sendmail daemon and re-attempt.

Informational Messages (info)

{string} appended with {string}
Appended a pre-existing configuration directive with additional data.
action {string}
Found an action command within a rule.
block {string}
Changed to a new block within the configuration.
ext_pattern /{string}/
Found an extended pattern.
found #include file {string}
Found an external file to be read inline into the configuration.
found new rule
Found the start of a new rule.
pass 1
First pass through the configuration: processing #include and #!include directives.
pass 2
Second pass through the configuration: processing directives into their appropriate blocks.
pass 3
Third pass through the configuration: processing globs, filelists, file content substitutions, file execution substitutions, and mnemonic (OS and NODE) substitutions.
pass 4
Fourth and final pass through the configuration: trickle directives from 'main' block to other blocks.
pattern /{string}/
Found the pattern for the rule.
reading config file {string}
Reading the configuration file.
rule complete
Reached the end of a rule (blank line or rulefile EOF).
{string} replaced with {string}
Replaced a pre-existing configuration directive with new data.

Notices (notice)

DST period: not using cache
This message indicates that at this particular time of the year, Daylight Savings Time may or may not be in effect, so any cached value for the appropriate timezone offset will not be used.
block processing pattern {string}
This message indicates that OmniCheck is in a multi-line log processing mode, and that {string} is what separates one log entry from another.
line processing
This message indicates that OmniCheck is in a single-line log processing mode.
sending mail via local sendmail program
This message indicates that mail from OmniCheck will be sent using the local sendmail program.
sending mail via sendmail port
This message indicates that mail from OmniCheck will be sent using a sendmail daemon, running either on the local system or elsewhere.
cached TZ offset: {string}
This message indicates that the cached Daylight Savings Time timezone offset will be used.

Warnings (warn)

exec action cannot run {string}
The external program associated with an 'exec' action cannot be executed, either because it does not have its execute bit set for the user-id running OmniCheck, or for some other reason. Check the permissions on the external program.
cannot run {string} to read output
The external program associated with a text feed for monitoring cannot be executed, either because it does not have its execute bit set for the user-id running OmniCheck, or for some other reason. Check the permissions on the external program.
#!include failure: cannot execute {string} ... ignored
This message indicates that an external program was intended to be used to provide a portion of the OmniCheck configuration file, but was not executable. Check the permissions and/or ownership of the external program.
#include failure: cannot read {string} ... ignored
This message indicates that an external file was intended to be used to provide a portion of the OmniCheck configuration file, but was not readable. Check the permissions and/or ownership of the external file.
{string} cannot be read: setting {string}:{string} to null
This message indicates that an external file was intended to be used to provide a list of files to be monitored, but was not readable. Check the permissions and/or ownership of the external file.
Couldn't resolve {string}: {string}
This message indicates that the IP address of the system could not be resolved using the Perl gethostbyname() routine. This is used in a routine to obtain the fully-qualified domain name of the system from the hostname.
Couldn't re-resolve {string}: {string}
This message indicates that the fully-qualified domain name of the system could not be resolved using the Perl gethostbyaddr() routine.
{string} cannot be executed: setting {string}:{string} to null
This message indicates that an external program was intended to be used to provide a value for a configuration file directive, but was not executable. Check the permissions and/or ownership of the external program.
cannot open rulefile {string}: {string}
This message indicates that a rulefile could not be opened for reading. Check the permissions and/or ownership of the rulefile.
should never be here - {string} !~ {string}
This message indicates that a line in a monitored file with multi-line log entries does not have the appropriate prefacing pattern, like the header of a Sybase log entry.
cannot open monitored file {string}: {string}
This message indicates that OmniCheck cannot open a monitoried file for reading. Check the permissions and/or ownership of the monitored file

Report Messages

These messages will occur in the omnicheck.out file as they occur during the normal operation of OmniCheck:
Conflict: /{string}/
pre-empts /{string}/
This two-line report message indicates that one rule pattern will match all of the lines that are intended to be matched by a later pattern. The solution is to move the rule with the second pattern to a point in the rulefile before the rule with the first pattern.
Corrupt pattern: /{string}/
This report message indicates that a pattern contains some sort of illegal character that will cause it to fail when used as a pattern. The solution is to correct the error in the pattern.
OmniCheck initialized for application {string}
This report message indicates that OmniCheck has been initialized successfully.
No pattern conflicts detected
This report message indicates that no one pattern will match log entries intended to be matched by a later pattern.
Pattern overlap analysis
This report message indicates the start of the pattern overlap analysis, determining whether one pattern will match log entries intended to be matched by a later pattern.
ignored
This report message indicates that the log entries matched by the current pattern will be ignored.
maintenance mode - no actions taken
This report message indicates that the current block of this instance of OmniCheck is in a maintenance mode, and will not send any notifications (mail, page, file, or exec).
pattern {string} contains metachar
This report message indicates that a pattern contains some sort of illegal character that will cause it to fail when used as a pattern. The solution is to correct the error in the pattern.
rule downgraded
This report message indicates that based on the state of either the production or farm configuration directives within the current block, the actions and recipients within the current rule will be downgraded from page to mail, and from notifying the oncall to notifying the administrator.
threshhold not met for {string}
This report message indicates that a threshhold of a certain number of matching log entries was configured for the current rule, and that the number of log entries matched by this pattern at this time did not meet or exceed that number.
{number} log entries in {string}
This report message indicates how many new log entries were found in the monitored file.
sent page to {string} (title: {string})
This report message indicates that a page message was sent.
sent mail to {string} (title: {string}
This report message indicates that a mail message was sent.
executed '{string}'
This report message indicates that log entries were sent to an external program as input.
filed data to {string}
This report message indicates that log entries were appended to an external file.
found {number} entries matching /{string}/
This report message indicates that a number of log entries were found in the monitored file that matched the pattern of the current rule.