Multipath timeout issues with extended 11.2.0.2 – cluster setup. Part II

The second and final post about an issue with a RAC-configuration with two SAN’s.  Problem was a i/o-freeze of minutes when crashing one of the two SAN’s. The first post I ended with a ‘cliffhanger’  because we had a solution, but not tested it yet. Now we tested it.

Start with a mockup of the first post.

Setup:

3 HP DL380 G6 systems with a basic RHEL 5u5 x86_64 installation (2 x RAC clusternodes, 1 x NFS-voting-node)

2 SAN’s HP EVA 6400 systems with 2 controllers each (resulting in 8 paths per device)

Oracle 11.2.0.2

Test: power off 1 SAN.  Default result / problem: i/o freeze of minutes, Oracle didn’t like it, started to evict, shutdown, startup = expected behaviour after such a long i/o freeze. But this is not the intention when installing a RAC with two SAN’s….

By |January 3rd, 2012|Categories: Database, RAC|Tags: , , , |0 Comments

Param ‘_datafile_write_errors_crash_instance’ , TRUE or FALSE?

Since 11.2.0.2 there’s a new parameter, “_datafile_write_errors_crash_instance” to prevent the intance to crash when a write error on a datafile occurs .  But.. should I use this or not.  The official text of this parameter:

This fix introduces a notable change in behaviour in that
from 11.2.0.2 onwards an I/O write error to a datafile will
now crash the instance.

Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.

By |August 26th, 2011|Categories: Database, RAC|Tags: , , |0 Comments

Multipath timeout issues with extended 11.2.0.2 – cluster setup

We were setting up a 2 node Oracle Grid Infrastructure (RAC) – extended – cluster on top of RHEL 5.5 according to the Oracle standard documentation, with of course a third NFS-node as voting node. Also using ASM to create “host-based”mirror blockdevices for the Oracle software.

The setup is as follows:

3 HP DL380 G6 systems with a basic RHEL 5u5 x86_64 installation (2 x RAC clusternodes, 1 x NFS-voting-node)

2 SAN’s HP EVA 6400 systems with 2 controllers each (resulting in 8 paths per device)

Oracle 11.2.0.2

We did choose this configuration in stead of a configuration with Dataguard because of our high demand of failover-time in case of a node- / SAN- disaster. Should be within 30 seconds. This post raises the question if we made the right decision….

The following analyses and testing by the way has been the effort of my collegae Chris Verhoef, a former RedHat-consultant:

With this setup we are facing the issue that if we loose a complete SAN, the IO’s to the ASM diskgroups will be blocked for approx 3 till 4 minutes. Oracle does not like this. After 70 seconds after a freeze, rdbms is starting to reboot (expected behaviour).  To shorten this  time we have done some testing with the following parameters:

checker timeout

no_path_retry

dev_loss_tmo

By |August 25th, 2011|Categories: Database, RAC|Tags: , , , |0 Comments

Red Hat 6 and Oracle, status of certification

Red Hat 6 has been there a while, so what about certification with Oracle and when? Nothing yet on the Oracle support site, no press releases (maybe I missed one..). But Red Hat had a blog-post about it a while ago (august 2011):

We’re pleased to announce that on Tuesday, August 9, we formally submitted to Oracle full certification test results of the Oracle 11gR2 database (Single Instance and RAC (including ASM) for x86 and x86-64) on Red Hat Enterprise Linux 6. Oracle database certification is a self-certification program whereby operating system vendors perform extensive testing and submit the results to Oracle for audit and approval.

By |August 24th, 2011|Categories: Database|Tags: , , |1 Comment

Errors in alert log [NI cryptographic checksum mismatch] TNS-12599

Using rdbms 11.2 as the repository for our Grid Control environmont, noticed a lot of the same errors in the alert file of the repository or/and the target database:

NI cryptographic checksum mismatch error: 12599.
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.2.0 – Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.2.0 – Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.2.0 – Production
Time: 24-MAR-2011 11:58:31
Tracing not turned on.
Tns error struct:
ns main err code: 12599

TNS-12599: TNS:cryptographic checksum mismatch
ns secondary err code: 2526
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0

Note: 1150874.1 gives the cause, workaround and solution:

m4s0n501
By |May 25th, 2011|Categories: Database, grid control|0 Comments