Since 126.96.36.199 there’s a new parameter, “_datafile_write_errors_crash_instance” to prevent the intance to crash when a write error on a datafile occurs . But.. should I use this or not. The official text of this parameter:
This fix introduces a notable change in behaviour in that
from 188.8.131.52 onwards an I/O write error to a datafile will
now crash the instance.
Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.
This fix introduces a new hidden parameter to control if the instance
should crash on a write error or not:
With this fix:
If _datafile_write_errors_crash_instance = TRUE (default) then
any write to a datafile which fails due to an IO error causes
an instance crash.
If _datafile_write_errors_crash_instance = FALSE then the behaviour
reverts to the previous behaviour (before this fix) such that
a write error to a datafile offlines the file (provided the DB is
in archivelog mode and the file is not in SYSTEM tablespace in
which case the instance is aborted)
When you ask Oracle for advice, you get the following answer:
20+ years ago a feature was added to Oracle to offline a datafile when there was an error writing a dirty buffer to it and it was not part of the system tablespace. At that time it made sense to do this since neither RAC or even
OPS was implemented and storage arrays did not exist. Then the most likelycause of an I/O error was a problem with the direct attached disk drive holding the datafile. By offlining the datafile the database might be able to continue running. Customers assumed that a disk failure would require restoring a backup and doing a media recovery so taking the file offline might improve availability. High availability was not expected.
Today almost all customers use highly available storage arrays accessible from multiple hosts. Now most I/O errors are either transient or are local to the host that encounters them. Real disk failures are hidden by the storage array redundancy. Customers expect a disk failure to have no effect on the operation of the database.
Unfortunately the code to offline a datafile on an I/O error is still there. The effect is that an error on one node in a cluster offlines the datafile and usually takes down the entire application on all nodes or even crashes all instances if the problem is with an undo tablespace. For example dismounting a file system on one node in a cluster makes that node get I/O errors for the files on that file system. This makes a mistake on one node take down the entire cluster.
Offlining a datafile on a write error is even a problem with single instance. Most I/O errors today will go away if the database is restarted on another machine or if the current machine is rebooted. However if the I/O error took a datafile offline, then the administrator must do a media recovery to make the application function again. This is an unusual procedure that takes awhile.
If the database instances do not crash it takes longer for the administrator to find out that the application is not working even though the database appears to be up and running. This is a problem with both RAC and single
Question: One concern is that a failed datafile write to a non-critical tablespace will bring down the database when it occurs in the only open instance.
It is true that there may be some situations where taking the file offline would be better. On the other hand there are cases where crashing in single instance is better because rebooting the server or restarting the instance will bring it up sooner with no need for manual intervention. Since we have to choose without knowing much about the system we have to base our choice on the odds of the failure being one case or the other. Twenty years ago a datafile was on one disk, almost all I/O errors were disk failures and a disk failure always meant doing media recovery. In that situation taking the datafile offline was clearly the right thing to do, even if the tablespace was critical to the application – it was going to need media recovery in any case.
Today systems are much different.– Storage arrays and mirroring mean that disk failures almost never require media recovery. I/O write errors usually stop happening when the system is reinitialized.
– Many customers have mechanisms like CRS to automatically restart the database, possibly on a different node.
Now it is much more likely that restarting the instance will resolve the problem without doing any media recovery, and it will happen automatically. The chance that the application can continue running with the offline datafile has always been slight, but when media recovery was going to be required anyway there was no harm in trying to offline the file. Now there is a lot of harm in offlining the file since it prevents automatic recovery and requires an administrator to perform tasks he is unfamiliar with. Today crashing the instance has a better chance of getting the application running sooner.