Agent (10.2.0.4) crashes – on a site (64-bit) with many databases (10.2.0.3) a lot, intermittently – , too many open files, emagent.trc gives ‘health check’ error

Agent gave messages – in the past – like this:

Number files opened by Agent is 1140.
These files appeared to be the $ORACLE_HOME/dbs/hc<instance>_.dat which is loaded a lot in memory.

Checked this out by first determining which process the agent has:
–> ps -ef |grep emagent
Then checked the memory map of the process of the agent (here e.g. : 23645)
–> pmap -x 23645
And this is a very, very long list,so I hit the jackpot…

Logging of the emagent.trc ($AGENT_HOME/sysman/log):

2008-04-22 15:29:20,221 Thread-4094679984 ERROR engine: [oracle_database,<rep_database>,health_check] :
nmeegd_GetMetricData failed : Instance Health Check initialization failed due to one of the
following causes: the owner of the EM agent process is not same as the owner of the Oracle
instance processes; the owner of the EM agent process is not part of the dba group; or the
database version is not 10g (10.1.0.2) and above.

Cause: Bug 5872000 – HEALTHCHECK ERROR OCCURS FOR 32BIT DATABASE ON 64BIT OS DUE TO BUG4526916 FIX.
The Healthcheck file, namely $RDBMS_HOME/dbs/hc_.dat file differs in  size from the memory structure used by the Agent to read it. This file is created by the database on startup time, if not present.

This happens when the database is e.g. 10.2.0.4 and the agent 10.2.0.3 and vice versa.
In the latter case this is solved by upgrading the database to 10.2.0.4. or 11.1.0.7.

Possible solutions for myself:

1. Apply Patch 5872000 to 32-bit or 64-bit databases on 64-bit machines.
This needs to be applied on top of 10.1 -> 10.2.0.3, and 11.1.0.6 databases.
The following file may need to be removed from the DATABASE $ORACLE_HOME/dbs directory before starting up the database after patch application: hc_.dat
NOTE: This file is created on database start up if not present. The agent uses this file for the
Healthcheck metric. By recreating the file on start up after the patch application, the file is
the correct one needed by the agent.

An easier way for the time being:
2. Disable the Healthcheck metric per database in grid Control:
This has no consequences for the monitoring the database if it’s still up for example (I tested this first..).
In the Metric and Policy settings page, tab Metrics, you will not see the metric Health Check displayed, even if you choose All metrics instead of the default Metrics with thresholds value in the Drop Down list titled View.

The Health Check metric is a composite metric which includes 7 metrics:
Instance Status
Instance State
Maintenance
Mounted
State Description
Unavailable
Unmounted

a. Go to the database Home Page for which you want to disable this metric
At the bottom pane Related Links, click on the link Metric and Policy Settings
b. Go to the metric Instance Status (or to any other metric belonging to the Health Check metric)
Click on the link in the column “Frequency Schedule”: 15 seconds by default
c. Once in the Edit Collection Settings Home Page
Press the disable button to disable this metric collection or
Change the collection frequency and any other value you want to change in this page
Note: you will see at the bottom of the page a sheet titled “Affected Metrics” which lists all the metrics which will be changed in the same way as the current metric.
You will notice that all the metrics pertaining to the Health Check metric are listed there.
Hence they will all been disabled or they will all have the new frequency collection as the one currently updated.
d. Click on Continue then on OK in order for this changes to be saved in the repository
e. Then click on OK once the Update confirmation received.
A new collection file for this database will be created in the Monitoring Agent of this target in the directory $ORACLE_HOME/sysman/emd/collection.

Oh, and don’t forget to stop and start the agent on the target nodes after you’ve done this.

Grid 10.2.0.5:

Mr Akhtar Tiwana checked this issue with Oracle support,  and they suggested to remove the warning and critical thresholds for health check metrics (making them NULL) and that will do the same. The functionality to disable these metrics apparently have been taken away in grid 10.2.05.

Used sources:
564617.1 Agent Fails on Instance Health Check Following Upgrade To 10.2.0.4
566607.1 Healthcheck Metric Collection Fails Since Agent was Upgraded to 10.2.0.4 on Linux x86-64 platform