Recover Oracle database after disk loss

April 6, 2020, 1:05 am

≫ Next: Recover A Lost Oracle Datafile With No Backup

≪ Previous: How Recover an Oracle Database Backup in Windows When Everything is Lost

PURPOSE

-------

This article aims at walking you through some of the common

recovery techniques after a disk failure

SCOPE & APPLICATION

-------------------

All Oracle support Analysts, DBAs and Consultants who have a role

to play in recovering an Oracle database

Loss due to Disk Failure

------------------------

What can we lose due to disk failure:

A) Control files

B) Redo log files

C) Archivelog files

D) Datafiles

E) Parameter file or SPFILE

F) Oracle software installation

Detecting disk failure

-----------------------

1) Run copy utilities like "dd" on unix

2) If using RAID mechanisms like RAID 5, parity information may mask

the disk failure and more vigorous check would be needed

3) As always, check the Operating system log files

4) Another obvious case would be when the disk could not be seen

or mounted by the OS.

5) On the Oracle side, run dbv if the file affected is a datafile

6) The best way to detect disk failure is by running Hardware

diagnostic tools and OS specific disk utilities.

Next Action

------------

Once the type of failure is identified, the next step is to rectify them.

Options could be:

(1) Replace the corrupted disk with a new one and mount them with

the same name (say /oracle or D:\)

(2) Replace the corrupted disk with a new one and mount them with

a different name (say /oracle1 as the new mount point)

(3) Decide to use another existing disk mounted with a different name

(say /oracle_new)

The most common methods are (1) AND (3).

Oracle Recovery

---------------

Once the disk problem is sorted, the next step is to perform recovery

at the Oracle level. This would depend on the type of files that is lost (see

"Loss due to Disk Failure" section) and also on the type of disk recovery done

as mentioned in the "Next Action" section above.

(A) Control Files

------------------

Normally, we have multiplexing of controlfiles and they are expected to be

placed in different disks.

If one or more controlfile is/are lost,mount will fail as shown below:

SQL> startup

Oracle Instance started

....

ORA-00205: error in identifying controlfile, check alert log for more info

You can verify the controlfile copies using:

SQL> select * from v$controlfile;

**If atleast one copy of the controlfile is not affected by the disk failure,

When the database is shutdown cleanly:

(a) Copy a good copy of the controlfile to the missing location

(b) Start the database

Alternatively, remove the lost control file location specified in the

init parameter control_files and start the database.

**If all copies of the controlfile are lost due to the disk failure, then:

Check for a backup controlfile. Backup controlfile is normally taken using

either of the following commands:

(a) SQL> alter database backup controlfile to '/backup/control.ctl';

-- This would have created a binary backup of the current controlfile --

-->If the backup was done in binary format as mentioned above, restore the

file to the lost controlfile locations using OS copying utilities.

--> SQL> startup mount;

--> SQL> recover database using backup controlfile;

--> SQL> alter database open;

(b) SQL> alter database backup controlfile to trace;

-- This would have created a readable trace file containing create controlfile

script --

--> Edit the trace file created (check user_dump_dest for the location) and

retain the SQL commands alone. Save this to a file say cr_ctrl.sql

--> Run the script

SQL> @cr_ctrl

This would create the controlfile, recover database and open the database.

** If no copy of the controlfile or backup is available, then create a controlfile

creation script using the datafile and redo log file information. Ensure that the

file names are listed in the correct order as in FILE$.

Then the steps would be similar to the one followed with cr_ctrl.sql script.

Note that all controlfile related SQL maintenance operations are done in the

database nomount state

(B) Redo logs

---------

In normal cases, we would not have backups of online redo log files. But the

inactive logfile changes could already have been checkpointed on the datafiles

and even archive log files may be available.

SQL> startup mount

Oracle Instance Started

Database mounted

ORA-00313: open failed for members of log group 1 of thread 1

ORA-00312: online log 1 thread 1: '/(path)/REDO01.LOG'

ORA-27041: unable to open file

OSD-04002: unable to open file

O/S-Error: (OS 2) The system cannot find the file specified.

** Verify if the lost redolog file is Current or not.

SQL> select * from v$log;

SQL> select * from v$logfile;

--> If the lost redo log is an Inactive logfile, you can clear the logfile:

SQL> alter database clear logfile '/(path)/REDO01.LOG';

Alternatively, you can drop the logfile if you have atleast two other

logfiles:

SQL> alter database drop logfile group 1;

--> If the logfile is the Current logfile, then do the following:

SQL> recover database until cancel;

Type Cancel when prompted

SQL>alter database open resetlogs;

The 'recover database until cancel' command can fail with the following

errors:

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error

below

ORA-01194: file 1 needs more recovery to be consistent

ORA-01110: data file 1: '/(Path)/SYSTEM01.DBF'

In this case , restore an old backup of the database files and apply the

archive logs to perform incomplete recovery.

--> restore old backup

SQL> startup mount

SQL> recover database until cancel using backup controlfile;

SQL> alter database open resetlogs;

If the database is in noarchivelog mode and if ORA-1547, ORA-1194 and ORA-1110 errors occur, then you would have restore from an old backup and start the database.

Note that all redo log maintenance operations are done in the database mount state

-----------------

If the previous archive log files alone have been lost, then there is not much

to panic.

** Backup the current database files using hot or cold backup which would ensure

that you would not need the missing archive logs

(D) Datafiles

--------------

This obviously is the biggest loss.

(1) If only a few sectors are damaged, then you would get ora-1578 when

accessing those blocks.

--> Identify the object name and type whose block is corrupted by querying dba_extents

--> Based on the object type, perform appropriate recovery

--> Check metalink Note:28814.1 for resolving this error

(2) If the entire disk is lost, then one or more datafiles may need to be

recovered .

SQL> startup

ORACLE instance started.

...

Database mounted.

ORA-01157: cannot identify/lock data file 3 - see DBWR trace file

ORA-01110: data file 3: '/(path)/USERS01.DBF'

Other possible errors are ORA-00376 and ORA-1113

The views and queries to identify the datafiles would be:

SQL> select file#,name,status from v$datafile;

SQL> select file#,online,error from v$recover_file;

** If restoring to a replaced disk mounted with the same name, then :

(1) Restore the affected datafile(s) using OS copy/restore commands from the

previous backup

(2) Perform recovery based on the type of datafile affected namely SYSTEM,

ROLLBACK or UNDO, TEMP , DATA or INDEX.

(3) The recover commands could be 'recover database', 'recover tablespace'

or 'recover datafile' based on the loss and the database state

** If restoring to a different mount point, then :

(1) Restore the files to the new location from a previous backup

(2) SQL> STARTUP MOUNT

(3) SQL> alter database rename file '/old path_name' to 'new path_name';

-- Do this renaming for all datafiles affected. --

(4) Perform recovery based on the type of datafile affected namely SYSTEM,

ROLLBACK or UNDO, TEMP , DATA or INDEX.

(5) The recover commands could be 'recover database', 'recover tablespace'

or 'recover datafile' based on the loss and the database state

The detailed steps of recovery based on the datafile lost and the Oracle error

are outlined in the articles referenced at the end of this note.

NOARCHIVELOG DATABASE

=====================

The loss mentioned in (A),(B) and (D) would be different in this case

wherever archive logs are involved.

We will discuss the datafile loss scenarios here:

(a) If the datafile lost is a SYSTEM datafile, restore the complete

database from the previous backup and start the database.

(b) If the datafile lost is Rollback related datafile with active transactions,

restore from the previous backup and start the database.

offline the datafile (after commenting the rollback_segments parameter

assuming that they are private rollback segments) and open the database.

(d) If the datafile is temporary, offline the datafile and open the database.

Drop the tablespace and recreate the tablespace.

(e) If the datafile is DATA or INDEX,

**Offline the tablespace and start the database.

**If you have a previous backup, restore it to a separate location.

**Then export the objects in the affected tablespace ( using User or

table level export).

**Create the tablespace in the original database.

**Import the objects exported above.

If the database is 8i or above, you can also use Transportable tablespace

feature.

(E) Parameter file

---------------

This is not a major loss and can be easily restored. Options are:

(1) If there is a backup, restore the file

(2) If there is no backup, copy sample file or create a new file and add the

required parameters. Ensure that the parameters db_name, control_files,

db_block_size, compatible are set correctly

(3) If the spfile is lost, you can create it from the init parameter file if it is available

(F) Oracle Software Installation

----------------------------

There are two ways to recover from this scenario:

(1) If there is a backup of the Oracle home and Oracle Inventory, restore

them to the respective directories. Note if you change the Oracle Home,

the inventory would not be aware of thid new path and you would not be

able to apply patchsets. Also restore to the same OS user and group.

(2) Perform a fresh Install, bringing it to the same patchset level

PRACTICAL SCENARIO

==================

In most cases, when a disk is lost, more than one type of file could be lost.

The recovery in this scenario would be:

(1) A combination of each of these data loss recovery scenarios

(2) Perform entire database restore from the more recent backup and apply

archive logs to perform recovery. This is a highly preferred method

but could be time consuming.

↧

Recover A Lost Oracle Datafile With No Backup

April 6, 2020, 1:08 am

≫ Next: How to Recover an Oracle Database Having Added a Datafile Since Last Backup

≪ Previous: Recover Oracle database after disk loss

Problem Description:

====================

You have inadvertantly lost a datafile at the OS level and there are no current

backups.

You are in archivelog mode.

You have ALL Archivelogs available since the datafile was created initially (creation date).

Problem Explanation:

====================

Since there are no backups, the database cannot be opened without this file

unless dropped and the tablespace dropped. If this is an important file and

tablespace, this is not a valid option.

Problem References:

===================

Oracle 7 Backup and Recovery Workshop Student Guide, Failure Scenario 14

Search Words:

=============

ORA-1110, lost datafile, file not found.

Solution Description:

=====================

This files have to be recreated and recovered. Do the following:

1) Go to svrmgrl and connect internal.

2) SVRMGR>shutdown immediate. (If this hangs, issue shutdown abort)

3) SVRMGR>startup mount

4) SVRMGR> select * from v$recover_file;

SAMPLE:

FILE# ONLINE ERROR CHANGE# TIME

---------- ------- ------------------ ---------- --------------------

11 OFFLINE FILE NOT FOUND 0 01/01/88 00:00:00

(Noting the file number that was reported in the error)

5) SVRMGR> select * from v$datafile where FILE#=11;

SAMPLE:

FILE# STATUS ENABLED CHECKPOINT BYTES CREATE_BYT NAME

---------- ------- ---------- ---------- ---------- ---------- --------

11 RECOVER READ WRITE 4.9392E+12 0 10240 /tmp/sample.dbf

(Note the status is RECOVER and the CREATE_BYTE size)

(Note the NAME)

6) Recreate the datafile.

SVRMGR> alter database create datafile '/tmp/sample.dbf'

as '/tmp/sample.dbf' size 10240 reuse.

(Note that the file "created" and the file created "as" are

the same file. The "size" needs to be the same size as it

was when it was created.)

7) Check to see that it was successful.

SVRMGR> select * from v$datafile where FILE#=11;

8) Bring the file online.

SVRMGR> alter database datafile '/tmp/sample.dbf' online;

9) Recover the datafile.

SVRMGR> Recover database;

Note: During recovery, all archived redo logs written to since the original

datafile was created must be applied to the new, empty version of the

lost datafile."

10) Enjoy!!

SVRMGR> alter database open;

Solution Explanation:

=====================

Recreating the file and recovering it rewrites it to the OS and brings it up to

date.

Solution References:

====================

Oracle 7 Backup and Recovery Workshop Student Guide, Failure Scenario 14

↧

How to Recover an Oracle Database Having Added a Datafile Since Last Backup

April 6, 2020, 1:11 am

≫ Next: Recreating a missing Oracle datafile with no backups

≪ Previous: Recover A Lost Oracle Datafile With No Backup

NOTE: In the images and/or the document content below, the user information and environment

data used represents fictitious data from the Oracle sample schema(s), Public Documentation

delivered with an Oracle database product or other training material. Any similarity to actual

environments, actual persons, living or dead, is purely coincidental and not intended in any manner.

HOW TO RECOVER A DATABASE HAVING ADDED A DATAFILE SINCE THE LAST BACKUP

-----------------------------------------------------------------------

This bulletin outlines the steps required in performing database recovery

having added a datafile to the database since the last backup was taken.

Section A is applicable to Oracle release 7.x. Section B applies only to

Oracle releases 7.3.x and above.

PLEASE READ THROUGH ALL STEPS AND WARNINGS BEFORE ATTEMPTING TO USE THIS

BULLETIN.

A. Current controlfile, backup of datafile exists (Oracle release 7.x)

===================================================================

A valid (either hot or cold) backup of the datafiles exists, except for the

datafile created since the backup was taken. The current controlfile exists.

The database is in archivelog mode (see note (c) at bottom of page).

1. Restore ONLY the datafiles (those that have been lost or damaged) from the

last hot or cold backup. The current online redo logs and control file(s)

must be intact.

2. Mount the database

3. Create a new datafile using the 'ALTER DATABASE CREATE DATAFILE' command.

a. The datafile can be created with the same name as the original

file. For example,

SQLDBA> alter database create datafile

2> '/oracle/dbs/testtbs.dbf';

Statement processed.

b. The datafile can be created with a different filename to the original.

This option might be chosen if the original file was lost due to disk

failure and the failed disk was still unavailable; the new file would

then be created on a different device. For example,

SQLDBA> alter database create datafile

2> '/oracle/dbs/testtbs.dbf'

3> as

4> '/oracle/dbs/testtbs.dbf';

Statement processed.

The above command creates a new datafile on the dev2 device. The file

is created using information, stored in the control file, from the

original file. The command implicitly renames the filename in the

control file.

NOTE: IT IS VERY IMPORTANT TO SPECIFY THE CORRECT FILENAME WHEN

RECREATING THE LOST DATAFILE. IF YOU SPECIFY AN EXISTING

ORACLE DATAFILE, THAT DATAFILE WILL BE INITIALISED AND WILL

ITSELF REQUIRE RECOVERY.

4. Recover the database.

SQLDBA> recover database

ORA-00279: Change 6677 generated at 06/03/97 15:20:24 needed for thread 1

ORA-00289: Suggestion : /oracle/dbs/arch/arch000074.arc

ORA-00280: Change 6677 for thread 1 is in sequence #74

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

At this point the recovery procedure will wait for the user to supply the

information requested regarding the name and location of the archived log

files. For example, entering AUTO directs Oracle to apply the suggested

redo log and any others that it requires to recover the datafiles.

Applying suggested logfile...

Log applied.

Media recovery complete.

5. Open the database

SQLDBA> alter database open;

Statement processed.

B. Old controlfile, no backup of datafile (Oracle release 7.3.x and above)

=======================================================================

A valid (either hot or cold) backup of the datafiles exists, except for the

datafile created since the backup was taken. The controlfile is a backup from

before the creation of the new datafile. The database is in archivelog mode

(see note (c) at bottom of page).

NOTE : 'svrmgrl' has been replaced by SQL*Plus starting from Oracle8i

So the 'SVRMGR>' prompt is than replaced by 'SQL>'

1. Restore the datafiles (those that have been lost or damaged) from the

last hot or cold backup. Also restore the old copy of the controlfile.

The current online redo logs must be intact.

2. Mount the database

3. Start media recovery, specifying backup controlfile

SVRMGR> recover database using backup controlfile

ORA-00279: Change 6677 generated at 06/03/97 15:20:24 needed for thread 1

ORA-00289: Suggestion : /oracle/dbs/arch/arch000074.arc

ORA-00280: Change 6677 for thread 1 is in sequence #74

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

At this point, apply the archived logs as requested. Eventually Oracle

will encounter redo to be applied to the non-existent datafile. The

recovery session will exit with the following message, and will return

the user to the Server Manager prompt:

ORA-00283: Recovery session canceled due to errors

ORA-01244: unnamed datafile(s) added to controlfile by media recovery

ORA-01110: data file 5: '/oracle/dbs/testtbs.dbf'

4. Recreate the missing datafile. To do this, select the relevant filename

from v$datafile:

SVRMGR> select name from v$datafile where file#=5;

NAME

-------------------------------------------------------

UNNAMED0005

Now recreate the file:

SVRMGR> alter database create datafile

2> 'UNNAMED0005'

3> as

4> '/oracle/dbs/testtbs.dbf';

5. Restart recovery

SVRMGR> recover database using backup controlfile

ORA-00279: Change 6747 generated at 09/24/97 16:57:18 needed for thread 1

ORA-00289: Suggestion : /oracle/dbs/arch/arch000079.arc

ORA-00280: Change 6747 for thread 1 is in sequence #79

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

Apply archived logs as requested. Prior to Oracle8, recovery must apply

the complete log which was current at the time of the datafile creation

(in the above example, this would be log sequence 79). A recovery to a

point in time before the end of this log would result in errors:

ORA-01196: file 1 is inconsistent due to a failed media recovery session

ORA-01110: data file 1: '/oracle/dbs/systbs.dbf'

If this happens, re-recover the database and ensure that the complete log

is applied (plus any further redo if required). This limitation does

not exist from Oracle 8.0+.

Eventually, Oracle will request the archived log corresponding to the

current online log. It does this because the (backup) controlfile has no

knowledge of the current log sequence. If an attempt is made to apply the

suggested log, the recovery session will exit with the following message:

ORA-00308: cannot open archived log '/oracle/dbs/arch/arch000084.arc'

ORA-07360: sfifi: stat error, unable to obtain information about file.

SVR4 Error: 2: No such file or directory

At this stage, simply restart the recovery session and apply the current

online log. The best way to do this is to try applying the online redo

logs one by one until Oracle completes media recovery:

SVRMGR> recover database using backup controlfile

ORA-00279: Change 6763 generated at 09/24/97 16:57:59 needed for thread 1

ORA-00289: Suggestion : /oracle/dbs/arch/arch000084.arc

ORA-00280: Change 6763 for thread 1 is in sequence #84

Specify log: {<RET>=suggested | filename | AUTO | CANCEL}

/oracle/dbs/log2.dbf

Log applied.

Media recovery complete.

6. Open the database

SVRMGR> alter database open resetlogs;

The resetlogs option must be chosen to resynchronize the controlfile.

NOTES:

======

a) These techniques can be used whether the database was closed either

cleanly or uncleanly (aborted).

b) If the database is recovered using an incomplete recovery technique (either

time-based, cancel-based, or change-based), and is recovered to a point in

time before the datafile was originally created, any references to that

datafile will be removed from the database when the database is opened.

Oracle handles this situation as follows:

- The 'alter database create datafile....' command creates a reference in

the controlfile for the datafile.

- Incomplete recovery terminates before applying redo that would create a

corresponding row for the datafile in the file$ dictionary table.

- When the database is opened, Oracle detects an inconsistency between file$

and the controlfile and resolves in favor of file$, deleting the entry

from the controlfile.

c) It may be possible to recover the datafile using this technique even if the

database is not in archivelog mode. However, this relies on the required

redo being available in the online redo logs.

↧

Recreating a missing Oracle datafile with no backups

April 6, 2020, 1:13 am

≫ Next: How to Recover from a Lost or Deleted Oracle Datafile with Different Scenarios

≪ Previous: How to Recover an Oracle Database Having Added a Datafile Since Last Backup

GOAL

How to recreate a datafile that is missing at the operating system level. Missing/inaccessible files may be reported with one or more of these errors:

ORA-01116: error in opening database file %s

ORA-27041: unable to open file

ORA-01157: cannot identify/lock data file %s - see DBWR trace file

ORA-01119: error in creating database file '%s'

No backup or copy of the datafile is required. We only need the redo logs starting from the time of the datafile creation to the current point in time.

Note: plugged-in datafiles do not apply in this scenario and needs to be plugged-in again from its source.

SOLUTION

When a datafile goes missing at the operating system level, you would normally need to restore and recover it from a backup. If you do not have backups of this datafile, but do have redo logs you can still create and recover the datafile. You only need the redo logs starting from the datafile creation time to now.

Prior to 10g, you would use the following SQL command:

SQL> alter database create datafile 'missing name' as 'misisng name';

SQL> recover datafile 'missing name';

SQL> alter database datafile '<missing name>' online;

As of 10g, you can also do this in RMAN.

1) RMAN will create the datafile if there is no backups or copies of this datafile:

RMAN> restore datafile <missing file id>;

2) Recover the newly created datafile:

RMAN> recover datafile <missing file id>;

3) Bring it online:

RMAN> sql 'alter database datafile <missing file id> online';

Example:

RMAN> list copy of datafile 6;

specification does not match any datafile copy in the repository

RMAN> list backup of datafile 6;

specification does not match any backup in the repository

RMAN> restore datafile 6;

Starting restore at 14 JUL 10 10:20:02

using channel ORA_DISK_1

creating datafile file number=6 name=/opt/app/oracle/oradata/ORA112/datafile/o1_mf_leng_ts_63t08t64_.dbf

restore not done; all files read only, offline, or already restored

Finished restore at 14 JUL 10 10:20:05

RMAN> recover datafile 6;

Starting recover at 14 JUL 10 10:21:02

using channel ORA_DISK_1

starting media recovery

media recovery complete, elapsed time: 00:00:00

Finished recover at 14 JUL 10 10:21:02

RMAN> sql 'alter database datafile 6 online';

sql statement: alter database datafile 6 online

↧

How to Recover from a Lost or Deleted Oracle Datafile with Different Scenarios

April 6, 2020, 1:14 am

≫ Next: Oracle ORA-1157 Troubleshooting

≪ Previous: Recreating a missing Oracle datafile with no backups

PURPOSE

This article explains the various scenarios for ORA-01157 and how to avoid them.

SCOPE & APPLICATION

This article is intended for Oracle Support Analysts, Oracle Consultants and

Database Administrators.

TROUBLESHOOTING STEPS

How to Recover from a Lost Datafile in Different Scenarios

In the event of a lost datafile or when the file cannot be accessed an ORA-01157

is reported followed by ORA-01110.

Besides this, you may encounter error ORA-07360 : sfifi: stat error, unable to

obtain information about file. A DBWR trace file is also generated in the

background_dump_dest directory. If an attempt is made to shutdown the database

normal or immediate will result in ORA-01116, ORA-01110 and possibly ORA-07368.

This article discusses various scenarios that may be causing this error and the

solution/workaround for these.

Throughout this note we refer to "backups" but if you have a valid physical standby database

you may also use the standby database's datafiles to recover the primary database.

Datafile not found by Oracle

- Unintentionally renamed or moved at the Operating System (OS) level.

Simply restore the file to its original location and recover it

- Intentionally moved/renamed at OS level.

You are re-organising the datafile layout across various disks at the OS.

After moving/renaming the file you will have to rename the file at database

level, and recover it.

Note:115424.1 How to Rename or Move Datafiles and Logfiles

Datafile damaged/deleted

If the file is damaged/deleted and an attempt is made to start the database

will result in ORA-01157, ORA-01110. Then depending upon the type of datafile

lost different action needs to be taken. Check for a faulty hard disk. The

file may have gone corrupt due to faulty disk. Replace the bad disk or create

the file on a non-faulty disk.

Lost datafile could be in one of the following:

1. Temporary tablespace

If the datafile belongs to a temporary tablespace, you will have to simply offline

drop the datafile and then drop the tablespace with including contents option.

Thereafter, re-create the temporary tablespace.

Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery

2. Read Only Tablespace

In this case you will have to restore the most recent backup of the read-only

datafile. No media recovery is required as read-only tablespaces are not

modified. Note however that media recovery will be required under the following conditions:

a. The tablespace was in read-write mode when the last backup was taken

and was made read-only afterwards.

b. The tablespace was in read-only mode when last backup was taken and

was made read-write in between and then again made read only

In either of the above cases you will have to restore the file and do a media

recovery using RECOVER DATAFILE statement. Apply all the necessary archived redo

logs until you get the message "Media Recovery Complete".

Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery

3. User Tablespace

Two options are available:

a. Recreate the user tablespace.

If all the objects in the tablespace can be re-created (recent export is

available; tables can be re-populated using scripts; SQL*Loader etc)

Then, offline drop the datafile, drop the tablespace with including

contents option. Thereafter, re-create the tablespace and re-create

the objects in it.

b. Restore file from backup and do a media recovery.

Database has to be in archivelog mode.If the database is in NOARCHIVELOG

mode, you will only succeed in recovering the datafile if the redo to be

applied to it is within the range of your online redo logs.

Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery

4. Index Tablespace

Two options are available:

a. Recreate the Index tablespace

If the index can be easily re-created using script or manual CREATE INDEX

statement, then best option is to offline drop the datafile,drop the

index tablespace, and re-create it and recreate all indexes in it.

b. Restore file from backup and do a media recovery.

If the index tablespace cannot be easily re-created, then restore the

lost datafile from a valid backup and then do a media recovery on it.

Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery

5. System (and/or Sysaux) Tablespace

a. Restore from a valid backup and perform a media recovery on it

b. Rebuild the database.

If neither backup of the datafile nor the full database backup is

available, then rebuild database using full export, user level/table

level export, scripts, SQL*Loader, standby etc. to re-create and

re-populate the database.

Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery

6. Undo Tablespace

While handling situation with lost datafile of an undo tablespace you need to

be extra cautious so as not to lose active transactions in the undo segments.

The preferred option in this case is to restore the datafile from backup and

perform media recovery.

i. If the database was cleanly shutdown.

Ensure that database was cleanly shutdown in NORMAL or IMMEDIATE mode.

Update your init file with "undo_management=manual"

Restart the database

Drop and recreate the undo tablespace

Update your init file with "undo_management=auto"

Restart the database

ii. If the database was NOT cleanly shutdown.

If the database was shutdown aborted or crashed, you may not be able to drop

the datafile as the undo segments may contain active transactions.

You will need to restore the file from a backup

and perform a media recovery.

7. Lost Controlfiles and Online Redo Logs

If the datafiles are in a consistent state, not needing media recovery, but you have lost

all the controlfiles and online redologs, then while

attempting to create controlfile using scripts will complain of missing

redologs. In this case use RESETLOGS option of the create controlfile

script and then open the database with RESETLOGS option.

8. Lost datafile and no backup

If there are no backups of the lost datafile then you can re-create the

datafile with the same size as the original file and then apply all the

archived redologs written since original datafile was created to the new

version of the lost datafile.

Note: Please put the restore and recovery from backup as the first and prefer

option for case 2 - 6.

Note:1060605.6 Lost datafile and no backup.

↧

Oracle ORA-1157 Troubleshooting

April 6, 2020, 1:17 am

≫ Next: Troubleshoot Oracle clusterware Grid Infrastructure Startup Issues

≪ Previous: How to Recover from a Lost or Deleted Oracle Datafile with Different Scenarios

PURPOSE

This note is intended to list the common reasons and solutions for the ORA-1157 error.

SCOPE

NOTE: In the images and/or the document content below, the user information and environment data used represents fictitious data from the Oracle sample schema(s), Public Documentation delivered with an Oracle database product or other training material. Any similarity to actual environments, actual persons, living or dead, is purely coincidental and not intended in any manner.

This article is intended for Oracle Support and Oracle database administrators.

DETAILS

Oracle Error: ORA-1157

An ORA-01157 is issued whenever Oracle attempts to access a file but cannot find or lock the file.

Error Explanation:

01157, 00000, "cannot identify/lock data file %s - see DBWR trace file"

Cause: The background process was either unable to find one of the data files or failed to lock it because the file was already in use. The database will prohibit access to this file but other files will be unaffected. However the first instance to open the database will need to access all online data files. Accompanying error from the operating system describes why the file could not be identified.

Action: Have operating system make file available to database. Then either open the database or do ALTER SYSTEM CHECK DATAFILES.

ORA-01157 errors are usually followed by ORA-01110 and possibly an Oracle operating system layer error such as ORA-07360. A DBWR trace file is generated in the background_dump_dest directory.

For Example, on Solaris platform, the following errors will appear:

ORA-01157: cannot identify/lock data file 19 - see DBWR trace file

ORA-01110: data file 19: '/<disk_path>/users02.dbf'

From the DBWR trace file:

ORA-01157: cannot identify/lock data file 19 - see DBWR trace file

ORA-01110: data file 19: '/<disk_path>/users02.dbf'

ORA-27037: unable to obtain file status

SVR4 Error: 2: No such file or directory

Additional information: 3

Common Causes and Solutions for ORA-1157

Note: Throughout this note we refer to "backups" but if you have a valid physical standby database

you may also use the standby database's datafiles to recover the primary database.

1. The datafile does exist, but Oracle cannot find it.

The datafile may have been renamed at the operating system level, moved to a different directory or disk drive either intentionally or unintentionally.

In this case, restore and recover the datafile or move the datafile to its original name.

2. The datafile does not exist or is unusable by Oracle. The datafile has been physically removed or damaged to an extent that Oracle cannot recognize it anymore.

For example, the datafile might be truncated or overwritten, in which case

ORA-27046 will accompany ORA-1157 error.

For example:

ORA-27046: file size is not a multiple of logical block size

In this case, the user has two options:

A Recreate the tablespace that the datafile belongs to.

This option is best suited for USERS, INDEX, TEMPORARY tablespaces.

It is also recommended for UNDO tablespaces if the database had been SHUTDOWN CLEANLY, so that no active transactions are there in the rollback segments of this tablespace.

If the tablespace is SYSTEM tablespace, then this amounts to recreating or rebuilding the database.

This method is best suited for temporary tablespaces (since they do not contain important data), but can be used for USERS tablespaces and INDEXES tablespaces.

This method would be helpful wherein reasonably recent exports of the objects in the tablespace are available, or that the tables in the tablespace can be repopulated by running a script or program, loading the data through SQL*Loader, etc.

The steps involved are:

1. If the database is down, mount it.

STARTUP MOUNT;

2. Offline drop the datafile.

ALTER DATABASE DATAFILE 'full_path_file_name' OFFLINE DROP;

3. If the database is at mount, open it.

ALTER DATABASE OPEN;

4. Drop the user tablespace.

DROP TABLESPACE tablespace_name INCLUDING CONTENTS;

Note: The users can stop with this step if they do not want the

tablespace anymore in the database.

5. Recreate the tablespace.

CREATE TABLESPACE tablespace_name DATAFILE 'datafile_full_path_name' SIZE required_size;

6. Recreate all the previously existing objects in the tablespace.

This can be done using the creation scripts for the objects in that tablespace or using the recent export dump available for that tablespace objects.

B. Recover the datafile using normal recovery procedures.

This option is best suited for READ ONLY tablespaces and for USERS, INDEX tablespaces where recreating is not a feasible option.

If the tablespace is of type UNDO, then this is the method to be used if the database was not SHUTDOWN CLEANLY.

(that is, if shutdown abort had been used or the database had crashed)

If the tablespace is SYSTEM, then this is the recommended method, if there are backups and archivelogs are available. If the database is in

NOARCHIVELOG mode, then you can recover only if the required changes are present in the ONLINE redologs.

In many situations, recreating the user tablespace is impossible or too laborious. The solution then is to restore the lost datafile from a backup

and do media recovery on it. If the database is in NOARCHIVELOG mode, you will only succeed in recovering the datafile if the redo to be applied

to the datafile is within the range of the online logs.

This method would be ideal for READ ONLY tablespaces. If the tablespace was not switched to READ-WRITE after backup was taken and if the tablespace was

READ ONLY at the time of backup, then recovery is just restoring the backup of this tablespace.

These are the steps:

1. Restore the lost file from a backup.

2. If the database is down, mount it.

STARTUP MOUNT;

3. Issue the following query:

SELECT V1.GROUP#, MEMBER, SEQUENCE#,

FIRST_CHANGE#

FROM V$LOG V1, V$LOGFILE V2

WHERE V1.GROUP# = V2.GROUP# ;

This will list all your online redolog files and their respective sequence and first change numbers.

4. If the database is in NOARCHIVELOG mode, issue the query:

SELECT FILE#, CHANGE# FROM V$RECOVER_FILE;

If the CHANGE# is GREATER than the minimum FIRST_CHANGE# of your logs, the datafile can be recovered. Just keep in mind that all the logs to

applied will be online logs, and move on to step 5.

If the CHANGE# is LESS than the minimum FIRST_CHANGE# of your logs, the file cannot be recovered. Your options at this point would be to restore

the most recent full backup (and thus lose all changes to the database since) or recreate the tablespace as explained in scenario a.

5. Recover the datafile:

RECOVER DATAFILE 'full_path_file_name' ;

6. Confirm each of the logs that you are prompted for until you receive the message "Media Recovery Complete". If you are prompted for a non-existing

archived log, Oracle probably needs one or more of the online logs to proceed with the recovery. Compare the sequence number referenced in the

ORA-280 message with the sequence numbers of your online logs. Then enter the full path name of one of the members of the redo group whose sequence

number matches the one you are being asked for. Keep entering online logs as requested until you receive the message "Media Recovery Complete" .

7. If the database is at mount point, open it.

Operating Systems (OS) Tempfiles missing:

When using TEMPORARY tablespaces with tempfiles, the absence of the tempfile at the OS level can cause ORA-1157. Since Oracle does not checkpoint tempfiles, the database can be opened even with missing tempfiles.

The solution in this case would be to drop the logical tempfile and add a new one.

For example:

select * from dba_objects order by object_name;

ERROR at line 1:

ORA-01157: cannot identify/lock data file 1026 - see DBWR trace file

ORA-01110: data file 1026: '/<disk_path>/temp2_01.tmp'

Solution:

alter database tempfile '/<disk_path>/temp2_01.tmp' drop;

select tablespace_name, file_name from dba_temp_files;

alter tablespace temp2 add tempfile '/<disk_path>/temp2_01.tmp' size 5m;

ORA-1157 due to OS issues/3rd party software

1. When trying to access Quick I/O files with vxfddstat, or other applications, getting an error message similar to "Cannot open file".

Oracle may return an error message similar to:

ORA-01157: cannot identify data file 1 - file not found

ORA-01110: data file 1: '/<disk_path>/system01.dbf'

The users need to contact Veritas support in this case. To access their support site, point your web browser to:

http://support.veritas.com/

Click on: 'Product Listing'

Click on: 'File System for UNIX'

Enter 'Oracle' and click 'Search' to view relevant information from their Knowledge Base.

2. It is possible to get this error on HP if the kernel parmeter nflock is not set high enough. This might prevent Oracle to lock the required datafiles. Controlfile recreation might fail with ORA-27041 and ORA-1157 for the same reasons.

There may be more errors in the trace files in dump directory such as :

ORA-27086: skgfglk: unable to lock file - already in use

ORA-01157: cannot identify/lock data file 263 - see DBWR trace file

ORA-0110: data file 263: '/<disk_path>/system01.dbf'

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 2

ORA-07445: exception encountered: core dump [%s] [%s] [%s] [%s] [%s] [%s]

ORA-01110: data file %s: '%s'

ORA-01242: data file suffered media failure: database in NOARCHIVELOG mode

ORA-01115: IO error reading block from file %s (block # %s)

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 3

To resolve these issues, increase the relavent kernel parameter on HP. The recommended settings would be :

nproc 4096 Max Number of Processes

nfile 63488 Max Number of Open Files

nflocks 4096 Max Number of File Locks

Refer to the OS Installation Guide for more information on these parameters.

To resolve the ORA-27041 'File table overflow' on Solaris, experience with the system settings below resolved -

/etc/system parameters

set vxfs:vxfs_ninode=692416

set ncsize=519312

Please note that tuning these parameters is outside the scope of Oracle Support.

For further info refer to Sun document available on sunsolve.sun.com (76671) which explains ncsize tuning.

And, tuning vxfs:vxfs_ninode is Veritas-specific, please post questions directly to Veritas.

3. It is possible to get ORA-1157, if the datafiles that Oracle requires are locked by some other process, for example a backup software might be locking

the files for backup.

On MS Windows, the user can get any of the following errors:

ORA-01157: signalled during alter database open

ORA-01157: can not identify datafile

ORA-01110: data file 10: '<disk_path>\index01.dbf'

ORA-27047: Unable to read header of file

OSD-04006: Read file failure

Error 33: process can not access file

The operating system error 33 is an error_lock_violation indicating that a portion of the data file is locked by another NT process.

ORA-1157 - cannot identify datafile - file not found

ORA-1110 - datafile 11 '<disk_path>\index02.dbf'

ORA-9202 - sfifi: error identifying file

OSD-4006(OS 203) - The System could not find the environment option that was entered

Other error combinations that may show up in the alert log include:

ORA-1115 - IO error reading block from file %s (block # %s)

ORA-1110 - datafile 11 '<disk_path>\index02.dbf'

ORA-9206 - sfrfb: error reading from file

OSD-4006(OS 203) - The System Could not find the environment option that was entered

ORA-1242 - data file suffered media failure: database in NOARCHIVELOG mode

ORA-1114 - IO error writing block to file <name> block #

ORA-9205 - sfqio: error reading or writing to disk

OSD-4016(OS 33) - The process cannot access the file because another process has locked a portion of the file.

Additionally, the following errors will appear:

KCF: write/open error dba=0x703473d block=0x3473d online=1

file=7 /<disk_path>/users01.dbf

error=9211 txt: 'OSD-4008 : WriteFile error (OS 203) - The System Could not find the environment option that was entered

In some cases, the alert log may also show errors such as:

Instance terminating due to error 1110.

Instance terminated by background process PID=<pid>

background process TERMINATING INSTANCE DUE TO ERROR 472

ORA 472 - PMON process terminated with error

The following events may also be reported in the Microsoft Windows event viewer:

23 Error ReadFile() failure.

25 Error WriteFile() failure.

If this is a cold backup, need to wait until the backup is done and then

startup the database, or end the backup and startup the database.

Alternatively, the solution is that the backup software should be configured

such that it does not lock open files.

Refer to the BACKUP software documentation for information on how to do this.

For example, the Seagate Backup Exec Software, should be configured with

'Read and Not lock' option to take online backups.

This is true on some Unix platforms also.

For example, the IBM AIX ADSM backup utility can fall over before a successful run was performed, thereby holding the lock on Oracle datafiles.

The solution in this case would be to clear the lock on the datafile manually.

i. Run $ ps -ef | grep <SID> - look for an existing process on a datafile

ii. Do $ kill -9 on the process id

4. ORA-1157 is possible if the Oracle datafiles were copied to a directory using the File Manager on Windows.

This is true if the filenames are greater than the usual 8.3 format.

i.e., the files are of greater than 8 characters long name or have greater than 3 characters of extention.

The avoid this problem, when copying files with Windows 95/98, DO NOT use File Manager. If the files have long file names (e.g.: more than the

standard 8.3 file name convention), File Manager will rename the files with an 8.3 file name.

To preserve long filenames, Explorer should be used to copy files. If File Manager is already used and the files have a tilde (~) in the file name, the

files needs to be renamed to their original names.

5. ORA-1157 can be possible when using NETWORK appliance. Network Applicance acquires locks on the datafiles for certain operations.

These locks may be retained by the Network Appliance due to an instance or host machine failure. In these cases, the locks must be manually released from the Network

Appliance by the System Administrator.

The command for this would be:

As root on the netapp, from the prompt:

rc_toggle_basic

sm_mon -l <hostname>

The hostname being referenced in the above command should be the machine name where the Oracle instance is running from.

6. ORA-1157 is possible if Oracle files were restored as a different user.

After restoring a datafile and issuing recover datafile, the error ORA-1157 (cannot identify datafile - file not found) can occur. It does not appear

to recognize the restored datafile inspite that:

- The datafile exists at the OS.

- select * from v$datafile shows the correct path for the datafile.

- "alter system check datafiles" is successful.

- backup control file to trace shows full path to datafile is correct.

In this case, it could be a permissions problem at the OS level.

Check the permissions of the datafiles. When the datafile was restored, it may have been done by another user, other than Oracle such as root.

In this case, the file can be seen by Oracle, but not accessed. It cannot read and write to the datafile because it does not own the file.

Changing the ownership permissions to the Oracle user will allow the restore datafile command to succeed.

ORA-1157 - Other possibilities

1. Corrupt controlfile can cause ORA-1157

Users can get ORA-1157 in some extreme cases, where, the controlfile becomes corrupt.

One possible type of corruption that can result in these errors is having a trailing blank after the filename in the control file. This can be seen by

examining a text version of the control file produced using the 'ALTER DATABASE BACKUP CONTROLFILE TO TRACE' command.

(The trace file will be placed in the location specified by the user_dump_dest initialization parameter).

Example:

--------

'/<disk_path>/index1.dbf ' -- corrupt

'/<disk_path>/index02.dbf' -- non-corrupt

In these case, the solution would be to try using the other controlfile copies (if they are not corrupt) by changing the CONTROL_FILES parameter in

the init.ora to exclude the corrupt control file. If all the control files are corrupt, then the controlfile can be recreated.

It is also possible to have other 'unexpected' control characters embedded in the control file.

For example:

"^J" may be present in the datafile name.

The above solution will hold good in this case also.

2. Check if there was any tablespace/datafile added in the primary after the standby was setup.

If so, create the datafile(s) on the standby database manually. When the files exist, recovery can continue. If datafiles are not automatically created on the standby site.

For example:

The redo does not create a new datafile for you. The create datafile command from startup mount is:

alter database create datafile 'datafile_full_path_name';

3. RMAN restore can produce 'fake' ORA-1157 errors in the alert.log

During an RMAN restore operation, it is possible to encounter the following error(s) in the alert log:

ORA-01157: cannot identify/lock data file N - see DBWR trace file

ORA-01110: data file N: 'filename'

ORA-27037: unable to obtain file status

SVR4 Error: 2: No such file or directory

This problem occurs if RMAN is restoring datafiles that have been removed from disk prior to the restore operation.

These errors (which occur multiple times for each affected datafile) can be a source of concern and confusion for DBAs. However, they are entirely

expected and, apart from monitoring the size of the alert log (and archiving it if necessary), are nothing to worry about.

4. ORA-1157 with newly installed seed database.

If seed database was used, Oracle will copy the datafiles from CD to disk and try to startup the database. But, if the datafiles are corrupt in the CD,

that is, if the CD is damaged, it could result in ORA-1157 error.

The solution to this would be to use new set of CDs or create database using manual creation scripts.

↧

Troubleshoot Oracle clusterware Grid Infrastructure Startup Issues

April 6, 2020, 1:20 am

≫ Next: Is it possible to access the oracle asm diskgroup if the underline disk partition table is corrupted

≪ Previous: Oracle ORA-1157 Troubleshooting

PURPOSE

This note is to provide reference to troubleshoot 11gR2 and 12c Grid Infrastructure clusterware startup issues. It applies to issues in both new environments (during root.sh or rootupgrade.sh) and unhealthy existing environments. To look specifically at root.sh issues, see note 1053970.1 for more information.

SCOPE

This document is intended for Clusterware/RAC Database Administrators and Oracle support engineers.

DETAILS

Start up sequence:

In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).

For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1

Cluster status

To find out cluster and daemon status:

$GRID_HOME/bin/crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

$GRID_HOME/bin/crsctl stat res -t -init

--------------------------------------------------------------------------------

NAME TARGET STATE SERVER STATE_DETAILS

--------------------------------------------------------------------------------

Cluster Resources

--------------------------------------------------------------------------------

ora.asm

1 ONLINE ONLINE rac1 Started

ora.crsd

1 ONLINE ONLINE rac1

ora.cssd

1 ONLINE ONLINE rac1

ora.cssdmonitor

1 ONLINE ONLINE rac1

ora.ctssd

1 ONLINE ONLINE rac1 OBSERVER

ora.diskmon

1 ONLINE ONLINE rac1

ora.drivers.acfs

1 ONLINE ONLINE rac1

ora.evmd

1 ONLINE ONLINE rac1

ora.gipcd

1 ONLINE ONLINE rac1

ora.gpnpd

1 ONLINE ONLINE rac1

ora.mdnsd

1 ONLINE ONLINE rac1

For 11.2.0.2 and above, there will be two more processes:

ora.cluster_interconnect.haip

1 ONLINE ONLINE rac1

ora.crf

1 ONLINE ONLINE rac1

For 11.2.0.3 onward in non-Exadata, ora.diskmon will be offline:

ora.diskmon

1 OFFLINE OFFLINE rac1

For 12c onward, ora.storage is introduced:

ora.storage

1 ONLINE ONLINE racnode1 STABLE

To start an offline daemon - if ora.crsd is OFFLINE:

$GRID_HOME/bin/crsctl start res ora.crsd -init

Case 1: OHASD does not start

As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:

CRS-4124: Oracle High Availability Services startup failed.

CRS-4000: Command Start failed, or completed with errors.

Automatic ohasd.bin start up depends on the following:

1. OS is at appropriate run level:

OS need to be at specified run level before CRS will try to start up.

To find out at which run level the clusterware needs to come up:

cat /etc/inittab|grep init.ohasd

h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up. Oracle Linux 7 (and Red Hat Linux 7) uses systemd to manage start/stop services (example: /etc/systemd/system/oracle-ohasd.service)

Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.

To find out current run level:

who -r

2. "init.ohasd run" is up

On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:

ps -ef|grep init.ohasd|grep -v grep

root 2279 1 0 18:14 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run

If any rc Snncommand script (located in rcn.d, example S98gcstartup) stuck, init process may not start "/etc/init.d/init.ohasd run"; please engage OS vendor to find out why relevant Snncommand script stuck.

Error "[ohasd(<pid>)] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started." may be reported of init.ohasd fails to start on time.

If SA can not identify the reason why init.ohasd is not starting, the following can be a very short term workaround:

cd <location-of-init.ohasd>

nohup ./init.ohasd run &

3. Cluserware auto start is enabled - it's enabled by default

By default CRS is enabled for auto start upon node reboot, to enable:

$GRID_HOME/bin/crsctl enable crs

To verify whether it's currently enabled or not:

$GRID_HOME/bin/crsctl config crs

If the following is in OS messages file

Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready Services startup disabled.

Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr

The reason is the file does not exist or not accessible, cause can be someone modified it manually or wrong opatch is used to apply a GI patch(i.e. opatch for Solaris X64 used to apply patch on Linux).

4. syslogd is up and OS is able to execute init script S96ohasd

OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.

If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.

To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:

From:

case `$CAT $AUTOSTARTFILE` in

enable*)

$LOGERR "Oracle HA daemon is enabled for autostart."

To:

case `$CAT $AUTOSTARTFILE` in

enable*)

/bin/touch /tmp/ohasd.start."`date`"

$LOGERR "Oracle HA daemon is enabled for autostart."

After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:

From:

case `$CAT $AUTOSTARTFILE` in

enable*)

$LOGERR "Oracle HA daemon is enabled for autostart."

To:

case `$CAT $AUTOSTARTFILE` in

enable*)

/bin/sleep 120

$LOGERR "Oracle HA daemon is enabled for autostart."

5. File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:

Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.

Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"

If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.

6. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid

ls -l $GRID_HOME/cdata/*.olr

-rw------- 1 root oinstall 272756736 Feb 2 18:20 rac1.olr

If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:

2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR

2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory

2010-01-24 22:59:10.472: [ OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory

2010-01-24 22:59:10.473: [ OCRRAW][1373676464]proprinit: Could not open raw device

2010-01-24 22:59:10.473: [ OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]

2010-01-24 22:59:10.473: [ CRSOCR][1373676464] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]

2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26

2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry

2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR

2010-01-24 23:01:46.275: [ OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5

2010-01-24 23:01:46.275: [ OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.

2010-01-24 23:01:46.275: [ OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted

2010-01-24 23:01:46.275: [ OCRRAW][1228334000]proprinit: Could not open raw device

2010-01-24 23:01:46.275: [ OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]

2010-01-24 23:01:46.276: [ CRSOCR][1228334000] OCR context init failure. Error: PROCL-26: Error while accessing the physical storage

2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26

2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) : Failed to initialize Oracle Local Registry

2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR

2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user

2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user

ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails

2010-08-04 13:13:11.102: [ CRSPE][35] Resources parsed

2010-08-04 13:13:11.103: [ CRSPE][35] Server [] has been registered with the PE data model

2010-08-04 13:13:11.103: [ CRSPE][35] STARTUPCMD_REQ = false:

2010-08-04 13:13:11.103: [ CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]

2010-08-04 13:13:11.103: [ CRSOCR][31] Multi Write Batch processing...

2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...

2010-08-04 13:13:11.112: [ CRSPE][35] SERVERS:

:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool

------------- SERVER POOLS:

Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED

2010-08-04 13:13:11.113: [ CRSPE][35] Dumping ICE contents...:ICE operation count: 0

2010-08-04 13:13:11.113: [ default][35] Dump State Done.

The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>".

By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.

7. ohasd.bin is able to access network socket files:

2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))

2010-06-29 10:31:01.571: [ OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]

2010-06-29 10:31:01.571: [ OCRSRV][3267002960]th_init: Local listener did not reach valid state

In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.

8. ohasd.bin is able to access log file location:

OS messages/syslog shows:

Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.

Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.

9. After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES

10. OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:

15058/1: 0.1995 close(2147483646) Err#9 EBADF

15058/1: 0.1996 close(2147483645) Err#9 EBADF

Call stack of ohasd.bin from pstack shows the following:

_close sclssutl_closefiledescriptors main ..

The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call "close"). If the bug happens when trying to start other resources, "CRS-5802: Unable to start the agent process" could show up as well.

11. Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device

12. ohasd.bin started fine, however, "crsctl check crs" shows only the following and nothing else:

CRS-4638: Oracle High Availability Services is online

And "crsctl stat res -p -init" shows nothing

The cause is that OLR is corrupted, refer to note 1193643.1 to restore.

13. On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start

14. For EL7/OL7, patch 25606616 is needed: TRACKING BUG TO PROVIDE GI FIXES FOR OL7

15. If ohasd still fails to start, refer to ohasd.log in <grid-home>/log/<nodename>/ohasd/ohasd.log and ohasdOUT.log

Case 2: OHASD Agents do not start

OHASD.BIN will spawn four agents/monitors to start resource:

oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc

orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc

cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)

If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state.

1. Common causes of agent failure are that the log file or log directory for the agents don't have proper ownership or permission.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

One example is "rootcrs.pl -patch/postpatch" wasn't executed while patching manually resulting in agent start failure:

2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/orarootagent]

2015-02-25 15:43:54.382154 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]

2015-02-25 15:43:54.384105 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]

The solution is to execute the missed steps.

2. If agent binary (oraagent.bin or orarootagent.bin etc) is corrupted, agent will not start resulting in related resources not coming up:

2011-05-03 11:11:13.189

[ohasd(25303)]CRS-5828:Could not start agent '/ocw/grid/bin/orarootagent_grid'. Details at (:CRSAGF00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.

2011-05-03 12:03:17.491: [ AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) : Failed to start the agent /ocw/grid/bin/orarootagent_grid

2011-05-03 12:03:17.491: [ AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403

2011-05-03 12:03:17.491: [ AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized

2011-05-03 12:03:17.492: [ CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to start the agent process

2011-05-03 12:03:17.492: [ CRSPE][1128372576] {0:0:184} CRS-2674: Start of 'ora.diskmon' on 'racnode1' failed

2011-06-27 22:34:57.805: [ AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]

2011-06-27 22:34:57.805: [ AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) : Agent start failed

2011-06-27 22:34:57.806: [ AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]

The solution is to compare agent binary with a "good" node, and restore a good copy.

truss/strace of ohasd shows agent binary is corrupted

32555 17:38:15.953355 execve("/ocw/grid/bin/orarootagent.bin",

["/opt/grid/product/112020/grid/bi"...], [/* 38 vars */]) = 0

32555 17:38:15.954151 --- SIGBUS (Bus error) @ 0 (0) ---

3. Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start" #10 for details.

4. Refer to: note 1964240.1 - CRS-5823:Could not initialize agent framework

Case 3: OCSSD.BIN does not start

Successful cssd.bin startup depends on the following:

1. GPnP profile is accessible - gpnpd needs to be fully up to serve profile

If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:

2010-02-02 18:00:16.251: [ GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...

2010-02-02 18:00:16.263: [ GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified. prf=0x165160d0

2010-02-02 18:00:16.263: [ GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6

2010-02-02 18:00:16.263: [ GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""

Otherwise messages like following will show in ocssd.log

2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)

2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"

2010-02-03 22:26:17.057: [ GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon

2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).

2010-02-03 22:26:17.057: [ CSSD][3852126240]clsgpnp_getProfile failed, rc(13)

The solution is to ensure gpnpd is up and running properly.

2. Voting Disk is accessible

In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.

2010-02-03 22:37:22.212: [ CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)

2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks

2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery

2010-02-03 22:37:22.227: [ CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found

2010-02-03 22:37:22.228: [ CSSD][1145538880]###################################

2010-02-03 22:37:22.228: [ CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread

ocssd.bin may not come up with the following error if all nodes failed while there's a voting file change in progress:

2010-05-02 03:11:19.033: [ CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0

The solution is to start ocssd.bin in exclusive mode with note 1364971.1

If the voting disk is located on a non-ASM device, ownership and permissions should be:

-rw-r----- 1 ogrid oinstall 21004288 Feb 4 09:13 votedisk1

3. Network is functional and name resolution is working:

If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:

2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)

2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ] failed to determine host from clsinet, using default

2010-02-03 23:26:25.810: [ CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39

2010-02-03 23:26:25.811: [ CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39

2010-02-03 23:26:25.811: [ CSSD][1206540320]clssscmain: failed to open gipc endp

If there's connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:

2010-09-20 11:52:54.014: [ CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894

2010-09-20 11:52:54.016: [ CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0

.. >>>> after a long delay

2010-09-20 12:02:39.578: [ CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254

2010-09-20 12:02:39.895: [ CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)

2010-09-20 12:02:39.895: [ CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.

2010-09-20 12:02:39.895: [ CSSD][1107286336]###################################

2010-09-20 12:02:39.895: [ CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener

2010-09-20 12:02:39.895: [ CSSD][1107286336]###################################

To validate network, please refer to note 1054902.1

Please also check if the network interface name is matching the gpnp profile definition ("gpnptool get") for cluster_interconnect if CSSD could not start after a network change.

In 11.2.0.1, ocssd.bin may bind to public network if private network is unavailable

4. Vendor clusterware is up (if using vendor clusterware)

Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:

$GRID_HOME/bin/lsnodes -n

racnode1 1

racnode1 0

If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:

2010-08-30 18:28:13.207: [ CSSD][36]clssnm_skgxninit: skgxncin failed, will retry

2010-08-30 18:28:14.207: [ CSSD][36]clssnm_skgxnmon: skgxn init failed

2010-08-30 18:28:14.208: [ CSSD][36]###################################

2010-08-30 18:28:14.208: [ CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon

Before the clusterware is installed, execute the command below as grid user:

$INSTALL_SOURCE/install/lsnodes -v

One issue on hp-ux: note 2130230.1 - Grid infrastructure startup fails due to vendor Clusterware did not start (HP-UX Service guard)

5. Command "crsctl" being executed from wrong GRID_HOME

Command "crsctl" must be executed from correct GRID_HOME to start the stack, or similar message will be reported:

2012-11-14 10:21:44.014: [ CSSD][1086675264]ASSERT clssnm1.c 3248

2012-11-14 10:21:44.014: [ CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0) that the cluster is at

2012-11-14 10:21:44.014: [ CSSD][1086675264]###################################

2012-11-14 10:21:44.014: [ CSSD][1086675264]clssscExit: CSSD aborting from thread clssnmvDDiscThread#

Case 4: CRSD.BIN does not start

If the "crsctl stat res -t -init" shows that ora.crsd is in intermediate state and if this is not the first node where crsd is starting, then a likely cause is that the csrd.bin is not able to talk to the master crsd.bin.

In this case, the master crsd.bin is likely having a problem, so killing the master crsd.bin is a likely solution.

Issue "grep MASTER crsd.trc" to find out the node where the master crsd.bin is running. Kill the crsd.bin on that master node.

The crsd.bin will automatically respawn although the master will be transferred to crsd.bin on another node.

Successful crsd.bin startup depends on the following:

1. ocssd is fully up

If ocssd.bin is not fully up, crsd.log will show messages like following:

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)

2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29

2010-02-03 22:37:51.639: [ CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2. OCR is accessible

If the OCR is located on ASM, ora.asm resource (ASM instance) must be up and diskgroup for OCR must be mounted, if not, likely the crsd.log will show messages like:

2010-02-03 22:22:55.186: [ OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]

[ OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge

ORA-15077: could not locate ASM instance serving a required diskgroup

2010-02-03 22:22:55.189: [ OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]

2010-02-03 22:22:55.189: [ OCRASM][2603807664]proprasmo: The ASM instance is down

2010-02-03 22:22:55.190: [ OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.

2010-02-03 22:22:55.190: [ OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable

2010-02-03 22:22:55.190: [ OCRASM][2603807664]proprasmcl: asmhandle is NULL

2010-02-03 22:22:55.190: [ OCRRAW][2603807664]proprinit: Could not open raw device

2010-02-03 22:22:55.190: [ OCRASM][2603807664]proprasmcl: asmhandle is NULL

2010-02-03 22:22:55.190: [ OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]

2010-02-03 22:22:55.190: [ CRSOCR][2603807664] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge

ORA-15077: could not locate ASM instance serving a required diskgroup

] [7]

2010-02-03 22:22:55.190: [ CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26

Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.

If the OCR is located on a non-ASM device, expected ownership and permissions are:

-rw-r----- 1 root oinstall 272756736 Feb 3 23:24 ocr

If OCR is located on non-ASM device and it's unavailable, likely crsd.log will show similar message like following:

2010-02-03 23:14:33.583: [ OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory

2010-02-03 23:14:33.583: [ OCRRAW][2346668976]proprinit: Could not open raw device

2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]

2010-02-03 23:14:34.587: [ OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory

2010-02-03 23:14:34.587: [ OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory

2010-02-03 23:14:34.587: [ OCRRAW][2346668976]proprinit: Could not open raw device

2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]

2010-02-03 23:14:35.589: [ CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26

If the OCR is corrupted, likely crsd.log will show messages like the following:

2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]

2010-02-03 23:19:39.429: [ OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT

2010-02-03 23:19:39.429: [ OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted

2010-02-03 23:19:39.429: [ OCRRAW][3360863152]proprinit: Could not open raw device

2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]

2010-02-03 23:19:40.432: [ CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26

If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:

2010-03-10 11:45:12.510: [ OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]

[ OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge

ORA-01031: insufficient privileges

2010-03-10 11:45:12.528: [ OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]

2010-03-10 11:45:12.529: [ OCRASM][611467760]proprasmo: The ASM instance is down

2010-03-10 11:45:12.529: [ OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.

2010-03-10 11:45:12.529: [ OCRRAW][611467760]proprioo: No OCR/OLR devices are usable

2010-03-10 11:45:12.529: [ OCRASM][611467760]proprasmcl: asmhandle is NULL

2010-03-10 11:45:12.529: [ OCRRAW][611467760]proprinit: Could not open raw device

2010-03-10 11:45:12.529: [ OCRASM][611467760]proprasmcl: asmhandle is NULL

2010-03-10 11:45:12.529: [ OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]

2010-03-10 11:45:12.530: [ CRSOCR][611467760] OCR context init failure. Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge

ORA-01031: insufficient privileges

] [7]

If oracle binary in GRID_HOME has wrong ownership or permission regardless whether ASM is up and running, or if grid user can not write in ORACLE_BASE, likely crsd.log will show following:

2012-03-04 21:34:23.139: [ OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR]

[ OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547, loc=kgfokge

2012-03-04 21:34:23.139: [ OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact

2012-03-04 21:34:23.633: [ OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7]

2012-03-04 21:34:23.633: [ OCRASM][3301265904]proprasmo: The ASM instance is down

2012-03-04 21:34:23.634: [ OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.

2012-03-04 21:34:23.634: [ OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable

2012-03-04 21:34:23.635: [ OCRASM][3301265904]proprasmcl: asmhandle is NULL

2012-03-04 21:34:23.636: [ GIPC][3301265904] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5326]

2012-03-04 21:34:23.639: [ default][3301265904]clsvactversion:4: Retrieving Active Version from local storage.

2012-03-04 21:34:23.643: [ OCRRAW][3301265904]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.

2012-03-04 21:34:23.645: [ OCRRAW][3301265904]proprinit: Could not open raw device

2012-03-04 21:34:23.646: [ OCRASM][3301265904]proprasmcl: asmhandle is NULL

2012-03-04 21:34:23.650: [ OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26]

2012-03-04 21:34:23.651: [ CRSOCR][3301265904] OCR context init failure. Error: PROC-26: Error while accessing the physical storage

ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage

ORA-12547: TNS:lost contact

2012-03-04 21:34:23.652: [ CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26

The expected ownership and permission of oracle binary in GRID_HOME should be:

-rwsr-s--x 1 grid oinstall 184431149 Feb 2 20:37 /ocw/grid/bin/oracle

If OCR or mirror is unavailable (could be ASM is up, but diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:

2010-05-11 11:16:38.578: [ OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]

[ OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge

ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295

ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295

ORA-15001: diskgroup "OCRMIR

2010-05-11 11:16:38.647: [ OCRASM][18]proprasmo: kgfoCheckMount returned [6]

2010-05-11 11:16:38.648: [ OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted

2010-05-11 11:16:38.648: [ OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26]

2010-05-11 11:16:38.648: [ OCRRAW][18]propriodvch: Error [8] returned device check for [+OCRMIR]

2010-05-11 11:16:38.648: [ OCRRAW][18]dev_replace: non-master could not verify the new disk (8)

[ OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]

[ OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid

2010-05-11 11:16:46.587: [ OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed

2010-05-11 11:16:46.587: [ OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)

2010-05-11 11:16:46.587: [ OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)

2010-05-11 11:16:46.587: [ OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match

[ OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid

[ OCRAPI][19]procr_ctx_set_invalid: aborting...

2010-05-11 11:16:46.587: [ CRSD][19] Dump State Starting ...

3. crsd.bin pid file exists and points to running crsd.bin process

If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Failed to record pid for CRSD

2010-02-14 17:41:57.927: [ clsdmt][1092499776]Terminating process

2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai

The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd

If pid file does exist and the PID in this file references a running process which is NOT the crsd.bin process, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:

2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid

2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535 >> 1535 is output of "cat /ocw/grid/crs/init/racnode1.pid"

2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))

[ clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9

2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD

2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon

2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535

2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE

2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5

2011-04-06 15:53:39.203: [ AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED

2011-04-06 15:53:39.203: [ AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED

2011-04-06 15:54:10.511: [ AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING

2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535

2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535

To verify on OS level:

ls -l /ocw/grid/crs/init/*pid

-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid

cat /ocw/grid/crs/init/*pid

1535

ps -ef| grep 1535

root 1535 1 0 Mar30 ? 00:00:00 iscsid >> Note process 1535 is not crsd.bin

The solution is to create an empty pid file and to restart the resource ora.crsd, as root:

# > $GRID_HOME/crs/init/<racnode1>.pid

# $GRID_HOME/bin/crsctl stop res ora.crsd -init

# $GRID_HOME/bin/crsctl start res ora.crsd -init

4. Network is functional and name resolution is working:

If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:

2010-02-03 23:34:28.412: [ GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0

2010-02-03 23:34:28.428: [ OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.

2010-02-03 23:34:28.434: [ OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]

2010-02-03 23:34:28.434: [ CRSOCR][2235814832] OCR context init failure. Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]

2010-02-03 23:34:28.434: [ CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44

Or:

2009-12-10 06:28:31.974: [ OCRMAS][20]proath_connect_master:1: could not connect to master clsc_ret1 = 9, clsc_ret2 = 9

2009-12-10 06:28:31.974: [ OCRMAS][20]th_master:11: Could not connect to the new master

2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!

2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!

Or:

2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)

To validate the network, please refer to note 1054902.1

5. crsd executable (crsd.bin and crsd in GRID_HOME/bin) has correct ownership/permission and hasn't been manually modified, a simply way to check is to compare output of "ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin" with a "good" node.

6. crsd may not start due to the following:

note 1552472.1 -CRSD Will Not Start Following a Node Reboot: crsd.log reports: clsclisten: op 65 failed and/or Unable to get E2E port

note 1684332.1 - GI crsd Fails to Start: clsclisten: op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)

7. To troubleshoot further, refer to note 1323698.1 - Troubleshooting CRSD Start up Issue

Case 5: GPNPD.BIN does not start

1. Name Resolution is not working

gpnpd.bin fails with following error in gpnpd.log:

2010-05-13 12:48:11.540: [ GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...

2010-05-13 12:48:11.540: [ GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY

2010-05-13 12:48:11.541: [ GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)

2010-05-13 12:48:11.541: [ GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"

In above example, please make sure current node is able to ping "node2", and no firewall between them.

2. Bug 10105195

Due to Bug 10105195, gpnp dispatch is single threaded and could be blocked by network scanning etc, the bug is fixed in 11.2.0.2 GI PSU2, 11.2.0.3 and above, refer to note 10105195.8 for more details.

Case 6: Various other daemons do not start

Common causes:

1. Log file or directory for the daemon doesn't have appropriate ownership or permission

If the log file or log directory for the daemon doesn't have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

2. Network socket file doesn't have appropriate ownership or permission

In this case, the daemon log will show messages like:

2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

2010-02-02 12:55:20.485: [ clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))

3. OLR is corrupted

In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):

2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.

2012-07-22 00:15:16.575: [ CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].

2012-07-22 00:15:16.585: [ CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].

2012-07-22 00:15:16.585: [ CTSS][1]ctss_main: CTSS init failed [19]

2012-07-22 00:15:16.585: [ CTSS][1]ctss_main: CTSS daemon aborting [19].

2012-07-22 00:15:16.585: [ CTSS][1]CTSS daemon aborting

The solution is to restore a good copy of OLR note 1193643.1

4. Other cases:

note 1087521.1 - CTSS Daemon Aborting With "op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)"

Case 7: CRSD Agents do not start

CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:

orarootagent: responsible for ora.netn.network, ora.nodename.vip, ora.scann.vip and ora.gns

oraagent: responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc

To find out the user resource status:

$GRID_HOME/crsctl stat res -t

If crsd.bin can not start any of the above agents properly, user resources may not come up.

1. Common cause of agent failure is that the log file or log directory for the agents don't have proper ownership or permissions.

Refer to below section "Log File Location, Ownership and Permission" for general reference.

2. Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start" #10 for details.

Case 8: HAIP does not start

HAIP may fail to start with various errors, i.e.

[ohasd(891)]CRS-2807:Resource 'ora.cluster_interconnect.haip' failed to start automatically.

Refer to note 1210883.1 for more details of HAIP

Network and Naming Resolution Verification

CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.

To validate network and name resolution setup, please refer to note 1054902.1

Log File Location, Ownership and Permission

Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.

In Grid Infrastructure cluster environment:

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's what it looks like under $GRID_HOME/log in cluster environment:

drwxrwxr-x 5 grid oinstall 4096 Dec 6 09:20 log

drwxr-xr-x 2 grid oinstall 4096 Dec 6 08:36 crs

drwxr-xr-t 17 root oinstall 4096 Dec 6 09:22 rac1

drwxr-x--- 2 grid oinstall 4096 Dec 6 09:20 admin

drwxrwxr-t 4 root oinstall 4096 Dec 6 09:20 agent

drwxrwxrwt 7 root oinstall 4096 Jan 26 18:15 crsd

drwxr-xr-t 2 grid oinstall 4096 Dec 6 09:40 application_grid

drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:15 oraagent_grid

drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap

drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar

drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid

drwxr-xr-t 2 root root 4096 Jan 26 20:09 orarootagent_root

drwxrwxr-t 6 root oinstall 4096 Dec 6 09:24 ohasd

drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid

drwxr-xr-t 2 root root 4096 Dec 6 09:24 oracssdagent_root

drwxr-xr-t 2 root root 4096 Dec 6 09:24 oracssdmonitor_root

drwxr-xr-t 2 root root 4096 Jan 26 18:14 orarootagent_root

-rw-rw-r-- 1 root root 12931 Jan 26 21:30 alertrac1.log

drwxr-x--- 2 grid oinstall 4096 Jan 26 20:44 client

drwxr-x--- 2 root oinstall 4096 Dec 6 09:24 crsd

drwxr-x--- 2 grid oinstall 4096 Dec 6 09:24 cssd

drwxr-x--- 2 root oinstall 4096 Dec 6 09:24 ctssd

drwxr-x--- 2 grid oinstall 4096 Jan 26 18:14 diskmon

drwxr-x--- 2 grid oinstall 4096 Dec 6 09:25 evmd

drwxr-x--- 2 grid oinstall 4096 Jan 26 21:20 gipcd

drwxr-x--- 2 root oinstall 4096 Dec 6 09:20 gnsd

drwxr-x--- 2 grid oinstall 4096 Jan 26 20:58 gpnpd

drwxr-x--- 2 grid oinstall 4096 Jan 26 21:19 mdnsd

drwxr-x--- 2 root oinstall 4096 Jan 26 21:20 ohasd

drwxrwxr-t 5 grid oinstall 4096 Dec 6 09:34 racg

drwxrwxrwt 2 grid oinstall 4096 Dec 6 09:20 racgeut

drwxrwxrwt 2 grid oinstall 4096 Dec 6 09:20 racgevtf

drwxrwxrwt 2 grid oinstall 4096 Dec 6 09:20 racgmain

drwxr-x--- 2 grid oinstall 4096 Jan 26 20:57 srvm

Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there's unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.

In Oracle Restart environment:

And here's what it looks like under $GRID_HOME/log in Oracle Restart environment:

drwxrwxr-x 5 grid oinstall 4096 Oct 31 2009 log

drwxr-xr-x 2 grid oinstall 4096 Oct 31 2009 crs

drwxr-xr-x 3 grid oinstall 4096 Oct 31 2009 diag

drwxr-xr-t 17 root oinstall 4096 Oct 31 2009 rac1

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 admin

drwxrwxr-t 4 root oinstall 4096 Oct 31 2009 agent

drwxrwxrwt 2 root oinstall 4096 Oct 31 2009 crsd

drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd

drwxr-xr-x 2 grid oinstall 4096 Aug 5 13:40 oraagent_grid

drwxr-xr-x 2 grid oinstall 4096 Aug 2 07:11 oracssdagent_grid

drwxr-xr-x 2 grid oinstall 4096 Aug 3 21:13 orarootagent_grid

-rwxr-xr-x 1 grid oinstall 13782 Aug 1 17:23 alertrac1.log

drwxr-x--- 2 grid oinstall 4096 Nov 2 2009 client

drwxr-x--- 2 root oinstall 4096 Oct 31 2009 crsd

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 cssd

drwxr-x--- 2 root oinstall 4096 Oct 31 2009 ctssd

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 diskmon

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 evmd

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 gipcd

drwxr-x--- 2 root oinstall 4096 Oct 31 2009 gnsd

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 gpnpd

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 mdnsd

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 ohasd

drwxrwxr-t 5 grid oinstall 4096 Oct 31 2009 racg

drwxrwxrwt 2 grid oinstall 4096 Oct 31 2009 racgeut

drwxrwxrwt 2 grid oinstall 4096 Oct 31 2009 racgevtf

drwxrwxrwt 2 grid oinstall 4096 Oct 31 2009 racgmain

drwxr-x--- 2 grid oinstall 4096 Oct 31 2009 srvm

For 12.1.0.2 onward, refer to note 1915729.1 - Oracle Clusterware Diagnostic and Alert Log Moved to ADR

Network Socket File Location, Ownership and Permission

Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle

When socket file has unexpected ownership or permission, usually daemon log file (i.e. evmd.log) will have the following:

2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))

2011-06-18 14:07:28.545: [ clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=lexxxDBG_EVMD))

2011-06-18 14:07:28.545: [ clsdmt][515]Terminating process

2011-06-18 14:07:28.559: [ default][515] EVMD exiting on stop request from clsdms_thdmai

And the following error may be reported:

CRS-5017: The resource action "ora.evmd start" encountered the following error:

CRS-2674: Start of 'ora.evmd' on 'racnode1' failed

The solution is to stop GI as root (crsctl stop crs -f), clean up socket files and restart GI.

Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs

In Grid Infrastructure cluster environment:

Below is an example output from cluster environment:

drwxrwxrwt 2 root oinstall 4096 Feb 2 21:25 .oracle

./.oracle:

drwxrwxrwt 2 root oinstall 4096 Feb 2 21:25 .

srwxrwx--- 1 grid oinstall 0 Feb 2 18:00 master_diskmon

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 mdnsd

-rw-r--r-- 1 grid oinstall 5 Feb 2 18:00 mdnsd.pid

prw-r--r-- 1 root root 0 Feb 2 13:33 npohasd

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 ora_gipc_GPNPD_rac1

-rw-r--r-- 1 grid oinstall 0 Feb 2 13:34 ora_gipc_GPNPD_rac1_lock

srwxrwxrwx 1 grid oinstall 0 Feb 2 13:39 s#11724.1

srwxrwxrwx 1 grid oinstall 0 Feb 2 13:39 s#11724.2

srwxrwxrwx 1 grid oinstall 0 Feb 2 13:39 s#11735.1

srwxrwxrwx 1 grid oinstall 0 Feb 2 13:39 s#11735.2

srwxrwxrwx 1 grid oinstall 0 Feb 2 13:45 s#12339.1

srwxrwxrwx 1 grid oinstall 0 Feb 2 13:45 s#12339.2

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 s#6275.1

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 s#6275.2

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 s#6276.1

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 s#6276.2

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 s#6278.1

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 s#6278.2

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 sAevm

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 sCevm

srwxrwxrwx 1 root root 0 Feb 2 18:01 sCRSD_IPC_SOCKET_11

srwxrwxrwx 1 root root 0 Feb 2 18:01 sCRSD_UI_SOCKET

srwxrwxrwx 1 root root 0 Feb 2 21:25 srac1DBG_CRSD

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 srac1DBG_CSSD

srwxrwxrwx 1 root root 0 Feb 2 18:00 srac1DBG_CTSSD

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 srac1DBG_EVMD

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 srac1DBG_GIPCD

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 srac1DBG_GPNPD

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 srac1DBG_MDNSD

srwxrwxrwx 1 root root 0 Feb 2 18:00 srac1DBG_OHASD

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 sLISTENER

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 sLISTENER_SCAN2

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:01 sLISTENER_SCAN3

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 sOCSSD_LL_rac1_

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 sOCSSD_LL_rac1_eotcs

-rw-r--r-- 1 grid oinstall 0 Feb 2 18:00 sOCSSD_LL_rac1_eotcs_lock

-rw-r--r-- 1 grid oinstall 0 Feb 2 18:00 sOCSSD_LL_rac1__lock

srwxrwxrwx 1 root root 0 Feb 2 18:00 sOHASD_IPC_SOCKET_11

srwxrwxrwx 1 root root 0 Feb 2 18:00 sOHASD_UI_SOCKET

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 sOracle_CSS_LclLstnr_eotcs_1

-rw-r--r-- 1 grid oinstall 0 Feb 2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock

srwxrwxrwx 1 root root 0 Feb 2 18:01 sora_crsqs

srwxrwxrwx 1 root root 0 Feb 2 18:00 sprocr_local_conn_0_PROC

srwxrwxrwx 1 root root 0 Feb 2 18:00 sprocr_local_conn_0_PROL

srwxrwxrwx 1 grid oinstall 0 Feb 2 18:00 sSYSTEM.evm.acceptor.auth

In Oracle Restart environment:

And below is an example output from Oracle Restart environment:

drwxrwxrwt 2 root oinstall 4096 Feb 2 21:25 .oracle

./.oracle:

srwxrwx--- 1 grid oinstall 0 Aug 1 17:23 master_diskmon

prw-r--r-- 1 grid oinstall 0 Oct 31 2009 npohasd

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 s#14478.1

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 s#14478.2

srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1

srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2

srwxrwxrwx 1 grid oinstall 0 Jul 7 10:59 s#2269.1

srwxrwxrwx 1 grid oinstall 0 Jul 7 10:59 s#2269.2

srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1

srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2

srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1

srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sCRSD_UI_SOCKET

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 srac1DBG_CSSD

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 srac1DBG_OHASD

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sEXTPROC1521

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1_

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1_localhost

-rw-r--r-- 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1_localhost_lock

-rw-r--r-- 1 grid oinstall 0 Aug 1 17:23 sOCSSD_LL_rac1__lock

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOHASD_IPC_SOCKET_11

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sOHASD_UI_SOCKET

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sgrid_CSS_LclLstnr_localhost_1

-rw-r--r-- 1 grid oinstall 0 Aug 1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock

srwxrwxrwx 1 grid oinstall 0 Aug 1 17:23 sprocr_local_conn_0_PROL

Diagnostic file collection

If the issue can't be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.

↧

Is it possible to access the oracle asm diskgroup if the underline disk partition table is corrupted

April 9, 2020, 12:47 am

≫ Next: Generic check for error Oracle ASM ORA-15042

≪ Previous: Troubleshoot Oracle clusterware Grid Infrastructure Startup Issues

We had issues with one of RAC node going down and were reported saying the LUN/partition table missing due to which the issue occured. But the ASM, database and CRS logs during the issue occurance were analyzed. The analysis does not reveal any evidence pointing to missing partition table OR even block corruption. In fact database was up during said period. Further, team analyzed OS logs during which the issue occured and beyond that timestamp but did not find evidence to show partition table missed or damaged during the period.

So, we would like to know that,

a) Can ASM work for days with partition table on cache while partition table on disk is damaged?

b) What is periodic update/syncing protocol between ASM cache and disk for partition information?

We would appreciate for looking into our queries mentioned above and respond back immediately to take the analysis further ahead to find the RCA and take preventive measures to avoid the same in future.

SM can continue to function when the ASM disk header on disk is corrupted or lost. ASM keeps the ASM disk header in the (ASM instance) cache, not the disk partition table.

If the disk partition table is lost/damaged/deleted/modified, the ASM disk header may or may not be affected. For example, if the disk partition table is deleted, that would not affect the ASM disk header at all. ASM disk group would stay mounted and all would seem to be fine. You would be able to dismount the disk group just fine. But the next time you try to mount the disk group, ASM will not be able to find the disk header as it will look at the beginning of the disk. The disk header, and in fact all other data, including ASM metadata, will be there, but ASM will not be able to find it. If this was the only thing that happened, i.e. lost or deleted disk partition table, the fix is very easy. Just recreate the partition table on that disk.

But if the partition table and ASM disk header are in fact damaged, say someone overwrote the first megabyte of the disk, then again ASM disk group will stay mounted and all would appear to be fine. That is because ASM keeps the ASM disk header in the cache. But if you were to dismount the disk group, you would not be able to mount it again as the ASM metadata is now truly lost.

"what happen when we insert a data?" Remember also that the ASM instance only keeps track of where database data file extents are stored and does not take part in actual database IO. Each database instance caches the ASM file information it gets and uses this information to perform the IO directly to the data files. Everything would work until such time as ASM was unable to provide the database with needed file information.

↧

Generic check for error Oracle ASM ORA-15042

April 9, 2020, 12:51 am

≫ Next: KFED Reports “KFBTYP_INVALID” & OS Metadata [EFI PART] In Oracle ASMLIB Disk /ASM disk Member (ASM Disk Overlapping : Scenario #1).

≪ Previous: Is it possible to access the oracle asm diskgroup if the underline disk partition table is corrupted

Error ORA-15042 means asm cannot mount the diskgroup as some of it member disks are not visible to asm.

How to find which disk is missing:

-----------------------------------------------------

- Scan the asm alert log for last successful mount of that diskgroup, where it will list out the disks with which it was mounted. Search for below keyword

SUCCESS: diskgroup <diskgroup name> was mounted

- After that point to till issue time scan in asm alert log that what are the disks added and dropped.

- Finally, make a list of disk path and name which were member of diskgroup before issue occurred.

Diagnosis:

----------------

- Check all disks permission (660) and ownership are correct. Owner/Group should be grid:asmsdmin (asmadminis SS_ASM_GRP in $GRID_HOME/rdbms/lib/config.c)

- Check the disks are accessible as grid os user , like:

dd if=<path to asm disk> of=/tmp/<devicename>.dd bs=1048576 count=10

- If multipath is in use , then check if the multipath base devices are presented fine.

Possible cause:

----------------------------

1. Permission of disks are not correct

2. Disk assessed by asm is available but its base device is not available

3. Duplicate device is found for same disk.

4. Partition of the disk was modified , so that the asm disk header is not located now at 1st 4k block of starting of disk.

5. At least 1st 4k block data information is erases externally , that is outside of oracle

6. In case of asmlib , multipath disks are visible but the asmlib disks are not visible.

Solution

----------------

1. Correct the permission

2. Ask you system admin to present the base device

3. Check with system admin that why more than 2 paths are showing for same asm disk , which are getting diskcovered by asm.

4. Check with system admin to correct the partition table so that it will point to correct location of asm disk header.

5. If only asm disk header is corrupted / wiped out and remain asm metadata is correct , then it be repaired , please raise SR with oracle support. Otherwise you need to recreate the diskgroup.

- Check the label is present by running below command

kfed read /dev/oracleasm/disks/<label name>| grep kfdhdb.driver.provstr

- ASMLIB drivers are loaded

- ASMLIB configuration file /etc/sysconfig/oracleasm is having proper entry

Also, please check the following document:

=)> ORA-15063 / ORA-15042 - TROUBLESHOOTING STEPS BEFORE OPENING a SR to Oracle Support (Doc ID 1535996.1)

↧

KFED Reports “KFBTYP_INVALID” & OS Metadata [EFI PART] In Oracle ASMLIB Disk /ASM disk Member (ASM Disk Overlapping : Scenario #1).

April 9, 2020, 12:56 am

≫ Next: Oracle ASM KFED Reports “KFBTYP_INVALID” & OS Metadata [LVM2 001] In "/dev/" Disk /ASM disk Member (ASM Disk Overlapping : Scenario #2).

≪ Previous: Generic check for error Oracle ASM ORA-15042

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 12.1.0.1 [Release 10.2 to 12.1]

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and later

Oracle Database Backup Service - Version N/A and later

Information in this document applies to any platform.

SYMPTOMS

1) ASM diskgroup cannot be mounted due to the next error:

SQL> alter diskgroup reco mount;

alter diskgroup reco mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "2" is missing from group number "2"

CAUSE

1) Kfed reports:

[oracle@fcomtaep2 disks]$ kfed read ASMRECO03

kfbh.endian: 0 ; 0x000: 0x00

kfbh.hard: 0 ; 0x001: 0x00

kfbh.type: 0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt: 0 ; 0x003: 0x00

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 0 ; 0x008: file=0

kfbh.check: 0 ; 0x00c: 0x00000000

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

7FC18D899400 00000000 00000000 00000000 00000000 [................]

Repeat 27 times

7FC18D8995C0 FEEE0001 0001FFFF FFFF0000 00000FFF [................]

7FC18D8995D0 00000000 00000000 00000000 00000000 [................]

Repeat 1 times

7FC18D8995F0 00000000 00000000 00000000 AA550000 [..............U.]

7FC18D899600 20494645 54524150 00010000 0000005C [EFI PART....\...]

7FC18D899610 BD82BBB3 00000000 00000001 00000000 [................]

7FC18D899620 0FFFFFFF 00000000 00000022 00000000 [........".......]

7FC18D899630 0FFFFFDE 00000000 FD8857E5 42D7B49B [.........W.....B]

7FC18D899640 0901FA87 6B3DB5AA 00000002 00000000 [......=k........]

7FC18D899650 00000080 00000080 FE48EB77 00000000 [........w.H.....]

7FC18D899660 00000000 00000000 00000000 00000000 [................]

Repeat 25 times

7FC18D899800 EBD0A0A2 4433B9E5 B668C087 C79926B7 [......3D..h..&..]

7FC18D899810 5381F6DF 4626F988 0E4F468D D78D3B28 [...S..&F.FO.(;..]

7FC18D899820 000007A1 00000000 0FFFF85F 00000000 [........_.......]

7FC18D899830 00000000 00000000 00720070 006D0069 [........p.r.i.m.]

7FC18D899840 00720061 00000079 00000000 00000000 [a.r.y...........]

7FC18D899850 00000000 00000000 00000000 00000000 [................]

Repeat 186 times

KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0

2) Above, kfed is reporting that the “ASMRECO03” disk was overlapped by the OS, it shows “EFI PART” which is OS partition metadata:

7FC18D899600 20494645 54524150 00010000 0000005C [EFI PART....\...]

3) The underlying disks associated to the “ASMRECO03” ASMLIB disk was reformatted by root and assigned to the OS filesystem.

4) This action corrupted/destroyed the RECO diskgroup for sure.

SOLUTION

Diskgroup needs to be recreated due to the “ASMRECO03” disk was overlapped by the OS filesystem.

↧

Oracle ASM KFED Reports “KFBTYP_INVALID” & OS Metadata [LVM2 001] In "/dev/" Disk /ASM disk Member (ASM Disk Overlapping : Scenario #2).

April 9, 2020, 1:02 am

≫ Next: Oracle ASM disk group is not mounted on second node|| showing corruption

≪ Previous: KFED Reports “KFBTYP_INVALID” & OS Metadata [EFI PART] In Oracle ASMLIB Disk /ASM disk Member (ASM Disk Overlapping : Scenario #1).

APPLIES TO:

Oracle Database Exadata Express Cloud Service - Version N/A and later

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 12.1.0.1 [Release 10.2 to 12.1]

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and later

Generic (Platform Independent)

SYMPTOMS

1) ASM diskgroup cannot be mounted due to the next error:

SQL> alter diskgroup <DGNAME> mount;

alter diskgroup <DGNAME> mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "1" is missing from group number "2"

2) ASM reports block corruption:

Wed Apr 09 01:26:05 2014

NOTE: SMON starting instance recovery for group <DGNAME> domain 2 (mounted)

NOTE: F1X0 found on disk 0 au 2 fcn 0.469914

NOTE: SMON skipping disk 1 - no header

NOTE: starting recovery of thread=2 ckpt=2.10251 group=2 (<DGNAME>)

WARNING: ASM recovery read a corrupted ACD block 21004

NOTE: a corrupted block was dumped to the trace file

ORA-15196: invalid ASM block header [kfr.c:8098] [endian_kfbh] [3] [21004] [0 != 1]

ERROR: ASM recovery failed to read ACD block 21004

NOTE: cache initiating offline of disk 1 group <DGNAME>

NOTE: process _smon_+asm1 (26726) initiating offline of disk 1.3915939526 (<DGNAME>_0001) with mask 0x7e in group 2

CAUSE

1) “/dev/<DISK #1>” (<DGNAME>_0001) disk was overlapped by an OS volume, it shows OS metadata associated to the “LVM2 001” logical volume (all the ASM metadata was wiped out):

$ kfed read <DGNAME>_0001_<DISK #1>.dump | head -25

kfbh.endian: 0 ; 0x000: 0x00

kfbh.hard: 0 ; 0x001: 0x00

kfbh.type: 0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt: 0 ; 0x003: 0x00

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 0 ; 0x008: file=0

kfbh.check: 0 ; 0x00c: 0x00000000

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

2ABD671E9400 00000000 00000000 00000000 00000000 [................]

Repeat 31 times

2ABD671E9600 4542414C 454E4F4C 00000001 00000000 [LABELONE........]

2ABD671E9610 E4E1DDB1 00000020 324D564C 31303020 [.... ...LVM2 001] 2ABD671E9620 50365A77 71327874 34303156 4B4E6136 [wZ6Ptx2qV1046aNK]

2ABD671E9630 35395159 5147634C 487A5A38 63575A37 [YQ95LcGQ8ZzH7ZWc]

2ABD671E9640 00000000 00000019 00030000 00000000 [................]

2ABD671E9650 00000000 00000000 00000000 00000000 [................]

2ABD671E9660 00000000 00000000 00001000 00000000 [................]

2ABD671E9670 0002F000 00000000 00000000 00000000 [................]

2ABD671E9680 00000000 00000000 00000000 00000000 [................]

Repeat 215 times

KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

2) The ““/dev/<DISK #1>” disk was used to create the next logical OS volume while it was already assigned to an ASM diskgroup.

3) This overlapping corrupted the "/dev/<DISK #1>" (<DGNAME>_0001) disk.

SOLUTION

The <DGNAME> diskgroup needs to be recreated and database files restored from backup due to the <DGNAME> diskgroup was overlapped by the OS, in other words the corruption occurred and came outside Oracle, it cannot be repaired since the OS volume overlapped the data in the “/dev/<DISK #1>” disk.

↧

Oracle ASM disk group is not mounted on second node|| showing corruption

April 9, 2020, 1:41 am

≫ Next: Oracle ASM ORA-15063 / ORA-15042 - TROUBLESHOOTING STEPS BEFORE OPENING a SR to Oracle Support

≪ Previous: Oracle ASM KFED Reports “KFBTYP_INVALID” & OS Metadata [LVM2 001] In "/dev/" Disk /ASM disk Member (ASM Disk Overlapping : Scenario #2).

Hello Experts,

Envirionment :

OS: RHEL 5.6

Oracle :11.2.0.3 + PSU 5

i had is issue with disk group. i have a disk group called DATA, and this disk group is mounted successfully on first node in a two node RAC. when i tried to mount the disk group on second node,

i got

ASMCMD> mount data

ORA-15032: not all alterations performed

ORA-15017: diskgroup "DATA" cannot be mounted

ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA" (DBD ERROR: OCIStmtExecute)

ASMCMD>

i have verified the permissions and compared with the first node. every thing looks correct.

when i ran a kfed to read the disk from second node, i got following error and it is complaining the corruption . if i ran the same command on first node in a cluster. it got successful. i am not sure

how that can happen if nodes are reading the same device

db3: /opt/app/oracle/diag/asm/+asm/+ASM2/trace/amdu_2013_05_31_18_33_53 # kfed read /dev/raw/raw4

kfbh.endian: 0 ; 0x000: 0x00

kfbh.hard: 0 ; 0x001: 0x00

kfbh.type: 0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt: 0 ; 0x003: 0x00

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 0 ; 0x008: file=0

kfbh.check: 0 ; 0x00c: 0x00000000

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

2B48D7FCC400 00000000 00000000 00000000 00000000 [................]

Repeat 255 times

KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

please advice, to find out root cause.

1) It seems that the original block device(s) names bound to the raw(s) devices in questions are not referencing the original raw devices, they were relocated/reassigned, this situation usually occurrs when new disks are added into the system.

2) Please obtain the following output from the affected & healthy node and provide it us:

Healthy node:

======================================================

script /tmp/node_ok.txt

raw -qa

cat /proc/partitions

exit

======================================================

Affected node:

======================================================

script /tmp/node_bad.txt

raw -qa

cat /proc/partitions

exit

======================================================

1) The best approach to guarantee device persistence (as our colleague Natalka mentioned previously) is using ASMLIB as described below:

Device Persistence with Oracle Linux ASMLib

ASMLib is a support library for the Automatic Storage Management feature of Oracle Database 10g. Oracle provides a Linux specific implementation of this library. This document describes some advantages this ASMLib brings to Linux system administration.

Device Persistence with Oracle Linux ASMLib

This document describes some advantages the Linux specific ASM library provided by Oracle (herein "ASMLib") brings to the administration of a Linux system running Oracle. Linux often presents the challenge of disk name persistence. Change the storage configuration and a disk that appeared as /dev/sdg yesterday can appear as /dev/sdh after a reboot today. How can these changes be isolated so that they do not affect ASM?

Why Not Let ASM Scan All Disks?

ASM scans all disks it is allowed to discover (via the asm_diskstring). Why not scan all the disks and let ASM determine which it cares about, rather than even worrying about disk name persistence?

The question is notionally correct. If you pass /dev/sd* to ASM, and ASM can read the devices, ASM can indeed pick out its disks regardless of whether /dev/sdg has changed to /dev/sdh on this particular boot.

However, to read these devices, ASM has to have permission to read these devices. That means ASM has to have user or group ownership on all devices /dev/sd*, including any system disks. Most system administrators do not want to have the oracle user own system disks just so ASM can ignore them. The potential for mistakes (DBA writing over the /home volume, etc) is way too high.

ASMLib vs UDev or DevLabel

There are various methods to provide names that do not change, including devlabel and udev. What does ASMLib provide that these solutions do not?

The bigger problem is not specifically a persistent name - it is matching that name to a set of permissions. It doesn't matter if /dev/sdg is now /dev/sdh, as long as the new /dev/sdh has oracle:dba ownership and the new /dev/sdg - which used to be /dev/sdf - has the ownership the old /dev/sdf used to have. The easiest way to ensure that permissions are correct is persistent naming. If a disk always appears as the same name, you can always apply the same permissions to it without worrying. In addition, you can then exclude names that match system disks. Even if the permissions are right, a system administrator isn't going to want ASM scanning system disks every time.

Now, udev or devlabel can handle keeping sdg as sdg (or /dev/mydisk, whatever). What does ASMLib add? A few things, actually. With ASMLib, there is a simple command to label a disk for ASM. With udev, you'll have to modify the udev configuration file for each disk you add. You'll have to determine a unique id to match the disk and learn the udev configuration syntax.

The name is also human-readable. With an Apple XServe RAID, why have a disk named /dev/sdg when it can be DRAWER1DISK2? ASMLib can also list all disks, where with udev you have to either know in your head that sdg, sdf, and sdj are for ASM, or you have to provide names. With ASMLib, there is no chance of ASM itself scanning system disks. In fact, ASMLib never modifies the system's names for disks. ASMLib never uses the name " /dev/sdg". After boot-time querying the disks, it provides its own access to the devices with permissions for Oracle. /dev/sdg is still owned by root:root, and the oracle user still cannot access the device by that name.

The configuration is persistent. Reinstall a system and your udev configuration is gone. ASMLib's labels are not. With udev, you have to copy the configuration over to the other nodes in a RAC. If you have sixteen nodes, you have to copy each configuration change to all sixteen nodes. Whether you use udev or devlabel, you have to set the permissions properly on all sixteen nodes. ASMLib just requires one invocation of " /etc/init.d/oracleasm scandisks" to pick up all changes made on the other node.

These are just a few of the benefits ASMLib brings to device persistence.

2) An example about how to setup the ASMLIB is described in the following document (I wrote it):

Note: 580153.1 How To Setup ASM on Linux Using ASMLIB Disks, Raw Devices, Block Devices or UDEV Devices?

↧

Oracle ASM ORA-15063 / ORA-15042 - TROUBLESHOOTING STEPS BEFORE OPENING a SR to Oracle Support

April 9, 2020, 1:43 am

≫ Next: TROUBLESHOOTING - Oracle ASM disk not found/visible/discovered issues

≪ Previous: Oracle ASM disk group is not mounted on second node|| showing corruption

APPLIES TO:

Oracle Database - Enterprise Edition

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and later

Oracle Database Cloud Exadata Service - Version N/A and later

Information in this document applies to any platform.

PURPOSE

Self-debugging steps when a diskgroup cannot be mounted due to error ORA-15063:

ORA-15063: ASM discovered an insufficient number of disks for diskgroup s%

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "%" is missing

TROUBLESHOOTING STEPS

SECTION A - Getting started

Start by refering NOTE 452770.1 "TROUBLESHOOTING - ASM disk not found/visible/discovered issues "

Firstly identify all disks being part of the affected diskgroup by looking at last successful mount in alert_+ASM*.log.

You should search for a section as below:

SQL> ALTER DISKGROUP <DGNAME1> MOUNT /* asm agent *//* {0:0:214} */

NOTE: cache registered group DATA number=1 incarn=0x44bef6bb

NOTE: cache began mount (not first) of group DATA number=1 incarn=0x44bef6bb

NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so

NOTE: Assigning number (1,0) to disk (ORCL:DATA01P)

NOTE: Assigning number (1,1) to disk (ORCL:DATA02P)

NOTE: Assigning number (1,2) to disk (ORCL:DATA03P)

NOTE: Assigning number (1,3) to disk (ORCL:DATA04P)

NOTE: Assigning number (1,4) to disk (ORCL:DATA05P)

NOTE: cache opening disk 0 of grp 1: DATA01P label:DATA01P

NOTE: cache opening disk 1 of grp 1: DATA02P label:DATA02P

SUCCESS: DISKGROUP <DGNAME1> was mounted

NOTE: When ASMLIB is not used the path to ASM disk is specified within the mount section:

NOTE: cache opening disk 1 of grp 1: REDO3_0001 path:/dev/mpath/3600601600ba12c00d4b784363e69e211

NOTE: cache opening disk 2 of grp 1: REDO3_0002 path:/dev/mpath/3600601600ba12c00d4b784363e69e212

...

Isolate the device(s) reported as "missing" as note 452770.1 suggested.

Finally start your checks as follow:

A1) If there is any IO/storage/multipathing errors reported in OS logs - investigate and fix them.

This step is mandatory as usually ORA-15063/ORA-15042 are caused by underlying IO/storage errors .

A2) If devices used by ASM disks are properly presented and configured at OS level.

If additionally "ORA-15075: disk(s) are not visible cluster-wide" is reported, make sure that all devices are cluster-wide visible.

A3) If all ASM disks have appropriate permissions (eg: they should be owned by grid owner)

If ownership of ASM disk(s) has been changed for whatever reason, please correct that.

A4) If/how the "missing" device(s) is reported when querying v$asm_disks

-----------------------------------------------------------------------------------

If the device(s) is reported with status:

=> "PROVISIONED/CANDIDATE" - this means the header of ASM disk(s) is damaged.

-> investigate the IO problems behind the corruption - see step A1. Oracle never wipes out its metadata!! A checksum is made for every write before being accepted.

-> check the header status, in order to confirm the damage:

$> kfed read <path_to_your_missing_devices>

kfbh.endian: 0 ; 0x000: 0x00

kfbh.hard: 0 ; 0x001: 0x00

kfbh.type: 0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt: 0 ; 0x003: 0x00

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 0 ; 0x008: file=0

....

-> try to repair the header and see if diskgroup can be mounted:

$> kfed repair <path_to_your_missing_devices>

-> check the if there is additional corruptions reported by ASM (eg ORA-15196) or by your database - as IO/storage problems could affect more than one block.

If any corruption is seen please open a SR to Oracle Support.

NOTE:

1) When non-default AU size is used AUSZ=<au_size> must be specified with each KFED command.

2) "kfed repair" works for 11g ONLY!

=> "UNKNOWN/IGNORED" - this means the ASM disk(s) is not seen at OS level.

-> review steps A1,A2 and A3:

-----------------------------------------------------------------------------------

A5) If asm_diskstring is still properly set.

On Windows configuration, you can also refer NOTE 880061.1 "ASM Is Unable To Detect SCSI Disks On Windows"

SECTION B - ASMLIB is used

When ASMLIB is used, follow the above steps (section A) and also check the errors associated with ORA-15063:

B1) ORA-15183 Unable to initialize the ASMLIB in oracle/ORA-15183: ASMLIB initialization error [driver/agent not installed]

Refer: NOTE 340519.1 Cannot Start ASM Ora-15063/ORA-15183

B2) ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]

Check your ASMLIB health.

=> correctness of installed rpm's

=> correctness of symlinks - all nodes should show:

# ls -l /etc/sysconfig/oracleasm

lrwxrwxrwx 1 root root 24 Sep 18 22:10 /etc/sysconfig/oracleasm -> oracleasm-_dev_oracleas

=> correctness of ASMLIB configuration (/etc/sysconfig/oracleasm) - when multipathing is used:

# ORACLEASM_SCANORDER: Matching patterns to order disk scanning

ORACLEASM_SCANORDER="dm"

# ORACLEASM_SCANEXCLUDE: Matching patterns to exclude disks from scan

ORACLEASM_SCANEXCLUDE="sd"

B3) Check if ASMLIB disks are listed under /dev/oracleasm/disks

=> devices under /dev/oracleasm/disks/* must be reported as dm devices on all nodes (not single path device -sd*-).If not, please correct that! (see step B2)

$> ls -al /dev/oracleasm/disks

brw-rw---- 1 grid dba 253, 29 Feb 12 11:44 /dev/oracleasm/disks/DATA01P

brw-rw---- 1 grid dba 253, 35 Feb 12 11:44 /dev/oracleasm/disks/DATA02P

brw-rw---- 1 grid dba 253, 27 Feb 15 16:04 /dev/oracleasm/disks/DATA03P

brw-rw---- 1 grid dba 253, 24 Feb 12 11:44 /dev/oracleasm/disks/DATA04P

brw-rw---- 1 grid dba 253, 25 Feb 12 11:44 /dev/oracleasm/disks/DATA05P

=> If one of your ASMLIB disk(s) is missing from the above output, first try to re-scan devices, as root:

# /etc/init.d/oracleasm scandisks

=> If ASMLIB disk(s) is still missing from /dev/oracleasm/disks, engage your sysadmin to investigate this (see steps A1, A2, A3).

B4) Check if ASMLIB disk(s) has the correct ASMLIB stamp and status:

$> kfed read <ASMLIB_device> |grep provstr

kfdhdb.driver.provstr: ORCLDISK<diskname> ; 0x000: length=20

$> kfed read <ASMLIB_device> | egrep 'kfbh.type|kfdhdb.dskname|kfdhdb.hdrsts'

kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD

kfdhdb.dskname: DATA01P ; 0x028: length=14

kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER

=> If the output is "kfdhdb.driver.provstr: ORCLCLRD" (but kfdhdb.hdrsts= MEMBER and kfbh.type=KFBTYP_DISKHEAD) then your disk was deleted using "oracleasm deletedisk".

=> If kfbh.type = KFBTYP_INVALID -> see step A4) and check if "kfed repair" could fix the problem.

B5)Refer also the below documents:

NOTE: 398622.1 ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]

NOTE: 1384504.1 Mount ASM Disk Group Fails : ORA-15186, ORA-15025, ORA-15063

NOTE: 967461.1 "Multipath: error getting device" seen in OS log causes ASM/ASMlib to shutdown by itself

NOTE: 1526920.1 ORA-15186 ORA-15063 on node 2

SECTION C - Additional notes to review

If the above checks are done, but error still persists, please review also the below notes, depending on your configuration/situation:

NOTE: 577526.1 ORA-15063 ASM Discovered An Insufficient Number Of Disks For Diskgroup using NetApp Storage

NOTE: 784776.1 ORA-15063 When Mounting a Diskgroup After Storage Cloning ( BCV / Split Mirror / SRDF / HDS / Flash Copy )

NOTE: 555918.1 ORA-15038 On Diskgroup Mount After Node Eviction

NOTE: 1484723.1 ASM Candidate Raw Device Is Not Presented As A RAC Cluster Wide Shared character Devices On Unix.

NOTE: 1534211.1 ORA-15017 and ORA-15063 errors for unused diskgroups in 11.2

NOTE: 1487443.1 Mounting Diskgroup Fails With ORA-15063 and V$ASM_DISK Shows PROVISIONED

NOTE: 742832.1 AIX:After changing Multipathing drivers from RDAC to MPIO ASM discovered an insufficient number of disks

NOTE: 1276913.1 Unable to discover or use raw devices for ASM in HP-UX Itanium in 11.2.0.2 ( ORA-15063 )

SECTION D - Information to be collected when are you going to open a SR

If you are not able to fix the problem on your own, please collect the below information and raise a SR to Oracle Support

D1) alert_+ASM*.log (from all nodes if RAC)

D2) script#1 from NOTE 470211.1 How To Gather/Backup ASM Metadata In A Formatted Manner version 10.1, 10.2, 11.1 & 11.2?

D3) KFED reports

#! /bin/sh

rm /tmp/kfed_DH.out /tmp/kfed_BK.out

for i in `ls <your_path_to_asm_disks>`

echo $i >> /tmp/kfed_DH.out

kfed read $i >> /tmp/kfed_DH.out

echo $i >> /tmp/kfed_BK.out

kfed read $i aun=1 blkn=254 >> /tmp/kfed_BK.out

done

Run kfed.sh in as GRID/ASM owner. Upload /tmp/kfed_DH.out, /tmp/kfed_BK.out

! Pay attention to non-default AU size - if a non-default AU size is used the you must specify it. (see note 1485597.1 "ASM tools used by Support : KFOD, KFED, AMDU")

D4) ASMLIB information

NOTE : 869526.1 Collecting The Required Information For Support To Troubleshot ASM/ASMLIB Issues.

D5) List of your ASM devices

$> ls -al <path_to_ASM_devices>

D6) OS logs (from all nodes if this is RAC configuration)

SECTION E - Disk is reported as MISSING after a failed disk addition

If you are facing ORA-15063 after a failed disk addition, please collect the below information and raise a SR to Oracle Support

E1) alert_+ASM*.log (from all nodes if RAC)

E2) script#1 from NOTE 470211.1 How To Gather/Backup ASM Metadata In A Formatted Manner version 10.1, 10.2, 11.1 & 11.2?

E3) KFED reports

#! /bin/sh

rm /tmp/kfed_*.out

for i in `ls <your_path_to_asm_disks>`

echo $i >> /tmp/kfed_DH.out

kfed read $i >> /tmp/kfed_DH.out

echo $i >> /tmp/kfed_BK.out

kfed read $i aun=1 blkn=254 >> /tmp/kfed_BK.out

echo $i >> /tmp/kfed_PST.out

kfed read $i aun=1 blkn=2 >> /tmp/kfed_PST.out

echo $i >> /tmp/kfed_FS.out

kfed read $i blkn=1 >> /tmp/kfed_FS.out

echo $i >> /tmp/kfed_FD.out

kfed read $i aun=2 blkn=1 >> /tmp/kfed_FD.out

echo $i >> /tmp/kfed_DD.out

kfed read $i aun=2 blkn=0 >> /tmp/kfed_DD.out ##there might be more than one block needed if a large number of disks -> this might be asked later by Oracle Support

done

Run kfed.sh in as GRID/ASM owner. Upload /tmp/kfed_*.out

! Pay attention to non-default AU size - if a non-default AU size is used the you must specify it. (see note 1485597.1 "ASM tools used by Support : KFOD, KFED, AMDU")

E4) AMDU output

amdu -diskstring '<ASM_DISKSTRING>' -dump '<DISKGROUP_NAME>' -noimage

amdu -diskstring '<ASM_DISKSTRING>' -print <DISKGROUP_NAME>.F2.V0.C2 > DG.amdu

####F2.V0.C2 --> This will only extract up to 16 disks information. If there is a large number of disks, a larger output is needed

↧

TROUBLESHOOTING - Oracle ASM disk not found/visible/discovered issues

April 9, 2020, 1:59 am

≫ Next: How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) Oracle ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards

≪ Previous: Oracle ASM ORA-15063 / ORA-15042 - TROUBLESHOOTING STEPS BEFORE OPENING a SR to Oracle Support

APPLIES TO:

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Database Exadata Express Cloud Service - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and later

Oracle Database Cloud Exadata Service - Version N/A and later

Information in this document applies to any platform.

PURPOSE

This note will assist in troubleshooting disk not found / visible / discovered issues with ASM

Typically this means that the disk in question is not found in the v$asm_disk view

Common errors indicating that a disk is missing/not found are :

ORA-15063: ASM discovered an insufficient number of disks for diskgroup s%

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "%" is missing

TROUBLESHOOTING STEPS

If an existing diskgroup is suddenly missing a disk ...

Determine what disk is missing

If on non-RAC ... or ALL RAC nodes exhibit these errors

a) Locate and open the ASM alert log (this may require multiple logs for RAC)

b) Locate the last successfull mount of the diskgroup (this will show a list of disks)

c) Locate each successful ALTER DISKGROUP ... ADD DISK since that time

d) Combine the lists of disks found in b) and c) above

e) Compare this list against those shown in V$ASM_DISK and determine the missing disk(s)

If on RAC and at least one node can mount the diskgroup

Compare the V$ASM_DISK entries on the node that can mount to those that cannot this will show which disk(s) is missing

The following is a set of steps that will assist in resolving disk not found issues

1) ASM_DISKSTRING is not set correctly

Examine the ASM_DISKSTRING setting in the parameter file or via SHOW PARAMETER

If ASM_DISKSTRING is NOT SET ... then the following default is used

Default ASM_DISKSTRING per OS

Operating System Default Search String

=======================================

Solaris (32/64 bit) /dev/rdsk/*

Windows NT/XP \\.\orcldisk*

Linux (32/64 bit) /dev/raw/*

LINUX (ASMLIB) ORCL:*

LINUX (ASMLIB) /dev/oracleasm/disks/* ( as a workaround )

HPUX /dev/rdsk/*

HP-UX(Tru 64) /dev/rdisk/*

AIX /dev/rhdisk*

IF ASM_DISKSTRING is SET ... then verify that the setting includes the disks that are needed to be seen by ASM

2) Operating system drive ownership

Make sure that the disk is owned by the OS user who installed the ASM Oracle Home ... and that the disk is mounted correctly (with the correct owner)

3) Operating system drive permissions

Make sure that the permissions are set correctly at the disk level ... 660 is normal ... but if there are problems use 777 as a test

4) RAC is being used

If RAC is in use ... then ALL disks need to be visible on all nodes where ASM is / will be running ... before an attempt is made to add the disks to a diskgroup

5) Use OS utilities to determine which disk cannot be found

TRUSSing or STRACEing the RBAL process while selecting * from v$asm_disk can often show errors in the path of the command

EXAMPLE:

=========

SESSION #1

strace -f -o /tmp/rbal.trc -p <OS pid of RBAL process>

<OR>

truss -ef -o /tmp/rbal.out -p <OS pid for RBAL process>

SESSION #2

select * from v$asm_disk

SESSION #3

<CTRL-C>

Examine the rbal.out for errors: For example,

1147090: 1871929: chdir("dev/") = 0

1147090: 1871929: statx("rhdisk8, ", 0x0FFFFFFFFFFFAA80, 176, 010) Err#2 ENOENT

<< This says that rhdisk8 cannot be found >>

NOTE ... If a crash occured during an Add or Drop of a disk ... then the disk(s) in question may still be part of the diskgroup ... so all of the steps above need to take into consideration this (these) disks

PORT SPECIFIC ISSUES

1) HEWLETT PACKARD (HP)

If this is problem is occuring on HP (RISC or Itanium) ... and all of the above are not helping ... and the customer is using HP Logical Volume Manager (LVM) then

Note.433770.1 Cannot Discover Disks in ASM After Upgrade on 10.2.0.3 on HP-UX Itanium

Note.434500.1 When Starting A Database With The 10.2.0.3 Executables, Error "ORA-15059 invalid device type for ASM disk"

2) IBM AIX

Note.353761.1 Assigning a Physical Volume ID (PVID) To An Existing ASM Disk Corrupts the ASM Disk Header

NOTE:1174604.1 - ASM Is Not Detecting EMC PowerPath Raw Devices Or Regular Raw Devices On AIX

3) SOLARIS

Note.368840.1 ASM does not discover disk on Solaris platform

4) LINUX

Note.457369.1 RASM is Unable to Detect ASMLIB Disks/Devices.

↧

How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) Oracle ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards

April 9, 2020, 2:00 am

≫ Next: Best Practice : Corruption in Oracle ASM Header

≪ Previous: TROUBLESHOOTING - Oracle ASM disk not found/visible/discovered issues

APPLIES TO:

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Database Exadata Express Cloud Service - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and later

Oracle Database Backup Service - Version N/A and later

Information in this document applies to any platform.

GOAL

The present document provides an example about “how to restore/repair/fix an overwritten ASM Disk Header (first 4K) on 11.1.0.7 and Onwards”.

SOLUTION

A copy of the ASM disk header (first 4K) exists on 10.2.0.5, 11.1.0.7, 11.2 and onwards. It can be used to try to restore a valid ASM disk header (Assuming only the first 4k of the disk were affected/overwritten). In order to restore the ASM disk header (assuming the automatic ASM disk header backup is in good shape) please perform the next steps:

1) Backup the first 50MB of the affected disk (this step is mandatory):

$> dd if=<full path affected disk name> of=/tmp/<affected disk name>.dump bs=1048576 count=50

Example:

[grid@dbaasm ~]$ dd if=/dev/oracleasm/disks/ASMDISK2 of=/tmp/ASMDISK2.dump bs=1048576 count=50

50+0 records in

50+0 records out

52428800 bytes (52 MB) copied, 0.667474 seconds, 78.5 MB/s

Where: "/dev/oracleasm/disks/ASMDISK2" is the affected ASM disk member .

2) Collect the Allocation Unit Size (kfdhdb.ausize) from another healthy disk member (from the same affected diskgroup):

Example:

kfdhdb.dsknum: 0 ; 0x024: 0x0000

kfdhdb.dskname: ASMDISK1 ; 0x028: length=8

kfdhdb.grpname: DATA ; 0x048: length=4

kfdhdb.fgname: FG1_SAN1 ; 0x068: length=8

kfdhdb.ausize: 2097152 ; 0x0bc: 0x00200000

Note: In this example, the diskgroup was created using an AU_SIZE=2M (2097152 ) & "/dev/oracleasm/disks/ASMDISK1" is the healthy ASM disk member .

3) Then restore the ASM disk header from backup as follows:

$> <ASM Oracle Home>/bin/kfed repair <full path affected disk name> ausz=<AU size from point #2>

Example:

[grid@dbaasm ~]$ kfed repair /dev/oracleasm/disks/ASMDISK2 ausz=2097152

4) Verify that the ASM disk header in the affected disk was recreated/restored:

$> <ASM Oracle Home>/bin/kfed read <full path affected disk name> | head -40

Example:

[grid@dbaasm ~]$ kfed read /dev/oracleasm/disks/ASMDISK2 | head -40

kfbh.endian: 1 ; 0x000: 0x01

kfbh.hard: 130 ; 0x001: 0x82

kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD

kfbh.datfmt: 1 ; 0x003: 0x01

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 2147483650 ; 0x008: disk=2

kfbh.check: 4052202307 ; 0x00c: 0xf187b343

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

kfdhdb.driver.provstr: ORCLDISKASMDISK2 ; 0x000: length=16

kfdhdb.driver.reserved[0]: 1145918273 ; 0x008: 0x444d5341

kfdhdb.driver.reserved[1]: 843797321 ; 0x00c: 0x324b5349

kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000

kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000

kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000

kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000

kfdhdb.compat: 186647296 ; 0x020: 0x0b200300

kfdhdb.dsknum: 2 ; 0x024: 0x0002

kfdhdb.grptyp: 2 ; 0x026: KFDGTP_NORMAL

kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER

kfdhdb.dskname: ASMDISK2 ; 0x028: length=8

kfdhdb.grpname: DATA ; 0x048: length=4

kfdhdb.fgname: FG2_SAN2 ; 0x068: length=8

kfdhdb.capname: ; 0x088: length=0

kfdhdb.crestmp.hi: 32974423 ; 0x0a8: HOUR=0x17 DAYS=0x12 MNTH=0x9 YEAR=0x7dc

kfdhdb.crestmp.lo: 1180930048 ; 0x0ac: USEC=0x0 MSEC=0xe4 SECS=0x26 MINS=0x11

kfdhdb.mntstmp.hi: 33003184 ; 0x0b0: HOUR=0x10 DAYS=0x15 MNTH=0x5 YEAR=0x7de

kfdhdb.mntstmp.lo: 1230240768 ; 0x0b4: USEC=0x0 MSEC=0xff SECS=0x15 MINS=0x12

kfdhdb.secsize: 512 ; 0x0b8: 0x0200

kfdhdb.blksize: 4096 ; 0x0ba: 0x1000

kfdhdb.ausize: 2097152 ; 0x0bc: 0x00200000

kfdhdb.mfact: 228480 ; 0x0c0: 0x00037c80

kfdhdb.dsksize: 9769 ; 0x0c4: 0x00002629

kfdhdb.pmcnt: 2 ; 0x0c8: 0x00000002

kfdhdb.fstlocn: 1 ; 0x0cc: 0x00000001

kfdhdb.altlocn: 2 ; 0x0d0: 0x00000002

kfdhdb.f1b1locn: 0 ; 0x0d4: 0x00000000

kfdhdb.redomirrors[0]: 0 ; 0x0d8: 0x0000

5) Finally, mount the diskgroup:

SQL> alter diskgroup <diskgroup name> mount ;

Example:

SQL> alter diskgroup DATA mount;

Diskgroup altered.

Notes

Note 1: The solution provided in this document will work if the following conditions are true:

a) Only the first 4K of the affected disk were overwritten/wiped out/overlapped.

b) ASM disk header backup is in good shape.

Note 2: If this solution does not solve your problem, then do not attempt additional steps/actions on the affected diskgroup, therefore please engage Oracle Support through a new Service Request to determinate the “Root Cause” & possible solutions.

Note 3: An ASM disk with a corrupted “Disk Header” will report the following output:

[grid@dbaasm ~]$ kfed read /dev/oracleasm/disks/ASMDISK2

kfbh.endian: 0 ; 0x000: 0x00

kfbh.hard: 0 ; 0x001: 0x00

kfbh.type: 0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt: 0 ; 0x003: 0x00

kfbh.block.blk: 0 ; 0x004: blk=0

kfbh.block.obj: 0 ; 0x008: file=0

kfbh.check: 0 ; 0x00c: 0x00000000

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 0 ; 0x018: 0x00000000

kfbh.spare2: 0 ; 0x01c: 0x00000000

000000000 00000000 00000000 00000000 00000000 [................]

Repeat 255 times

KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

↧

Best Practice : Corruption in Oracle ASM Header

April 9, 2020, 2:01 am

≫ Next: Bug 13331814 : Oracle ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.

≪ Previous: How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) Oracle ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.4 and later

Oracle Database Cloud Schema Service - Version N/A and later

Oracle Database Exadata Cloud Machine - Version N/A and later

Oracle Cloud Infrastructure - Database Service - Version N/A and later

Oracle Database Exadata Express Cloud Service - Version N/A and later

Information in this document applies to any platform.

PURPOSE

The purpose of this note is to provide a summary of Best Practices to be adopted while handling Corruption in ASM Header

SCOPE

This note applies to Grid Infrastructure for both Clustered and Standalone ASM environment.

DETAILS

I. My Oracle Support Document: How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards

Document:1088867.1: This document provides an example about “how to restore/repair/fix an overwritten ASM Disk Header (first 4K) on 11.1.0.7 and Onwards”.

II. My Oracle Support Document: ASM Corruption: Case #1: How To Fix The ASM Disk HEADER_STATUS From FORMER or PROVISIONED to MEMBER

Document: 1448799.1: This document explains, in detail, the steps required (with an example) to fix the ASM disk header from FORMER or Provisioned to MEMBER under the below scenarios:

A) Diskgroup was dropped by accident using the "SQL> drop <DG name> diskgroup;" statement.

B) Or under strange situations as described in the next bug:

Bug:13331814 ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.

REFERENCES

NOTE:1088867.1 - How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards

NOTE:1448799.1 - ASM Corruption: Case #1: How To Fix The ASM Disk HEADER_STATUS From FORMER or PROVISIONED To MEMBER.

↧

Bug 13331814 : Oracle ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.

April 9, 2020, 2:02 am

≫ Next: Oracle bbed Block Browser /Editor

≪ Previous: Best Practice : Corruption in Oracle ASM Header

Abstract: ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.

PROBLEM:

--------

While the 2 diskgroups were mounted on 3 of 4 ASM instances, 1 member disk

(of each diskgroup) turned into FORMER, the disks associated with both

diskgroups appear as MEMBER (3) & FORMER(2) (note the diskgroups were mounted

while 2 disks turned into FORMER):

======================================================

GROUP_NUMBER DISK_NUMBER HEADER_STATU MODE_ST OS_MB TOTAL_MB FREE_MB NAME

FAILGROUP PATH

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3 0 MEMBER ONLINE 20468 20468 19335 DG_SCUT_ARCH1_0000 DG_SCUT_ARCH1_0000

/dev/rdsk/c3t60A980006466654F476F64317648572Fd0s6

3 1 MEMBER ONLINE 20468 20468 19341 DG_SCUT_ARCH1_0001 DG_SCUT_ARCH1_0001

/dev/rdsk/c3t60A980006466654F476F64317656586Dd0s6

3 2 FORMER ONLINE 20468 20468 19342 DG_SCUT_ARCH1_0002 DG_SCUT_ARCH1_0002

/dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6 <(== HERE

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

14 0 MEMBER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0000 DG_MDSUT_ARCH2_0000

/dev/rdsk/c3t60A980006466654C436F643955687064d0s6

14 1 FORMER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0001 DG_MDSUT_ARCH2_0001

/dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6 <(== HERE

======================================================

DIAGNOSTIC ANALYSIS:

--------------------

Both disks show a FORMER status:

======================================================

+ASM:oracle> kfed read c3t60A980006466654F476F643176575A4Fd0s6.dump | egrep

'grpname|dskname|hdrsts'

kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

kfdhdb.dskname: DG_SCUT_ARCH1_0002 ; 0x028: length=18

kfdhdb.grpname: DG_SCUT_ARCH1 ; 0x048: length=13

======================================================

+ASM:oracle> kfed read c3t60A980006466654C436F6439556B4E37d0s6.dump | egrep

'grpname|dskname|hdrsts'

kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

kfdhdb.dskname: DG_MDSUT_ARCH2_0001 ; 0x028: length=19

kfdhdb.grpname: DG_MDSUT_ARCH2 ; 0x048: length=14

======================================================

1) This an 11.2.0.2.0 ASM configuration (4 RAC nodes).

2) DG_MDSUT_ARCH2 & DG_SCUT_ARCH1 diskgroups were mounted on 4 ASM instances:

======================================================

GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB OFFLINE_DISKS

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3 DG_SCUT_ARCH1 MOUNTED EXTERN 40936 38676 0

14 DG_MDSUT_ARCH2 MOUNTED EXTERN 20468 20378 0

======================================================

3) While the 2 diskgroups were mounted on 3 of 4 ASM instances, 1 member disk

(of each diskgroup) turned into FORMER, the disks associated with both

diskgroups appear as MEMBER (3) & FORMER(2) (note the diskgroups were mounted

while 2 disks turned into FORMER):

======================================================

GROUP_NUMBER DISK_NUMBER HEADER_STATU MODE_ST OS_MB TOTAL_MB FREE_MB NAME

FAILGROUP PATH

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3 0 MEMBER ONLINE 20468 20468 19335 DG_SCUT_ARCH1_0000 DG_SCUT_ARCH1_0000

/dev/rdsk/c3t60A980006466654F476F64317648572Fd0s6

3 1 MEMBER ONLINE 20468 20468 19341 DG_SCUT_ARCH1_0001 DG_SCUT_ARCH1_0001

/dev/rdsk/c3t60A980006466654F476F64317656586Dd0s6

3 2 FORMER ONLINE 20468 20468 19342 DG_SCUT_ARCH1_0002 DG_SCUT_ARCH1_0002

/dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6 <(== HERE

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

14 0 MEMBER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0000 DG_MDSUT_ARCH2_0000

/dev/rdsk/c3t60A980006466654C436F643955687064d0s6

14 1 FORMER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0001 DG_MDSUT_ARCH2_0001

/dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6 <(== HERE

======================================================

4) Diskgroups were not mounted on the node #4 (+ASM4), since node #4 was

evicted. After node reboot 2 the diskgroups cannot be mounted (DG_MDSUT_ARCH2

& DG_SCUT_ARCH1) on this node. Then the 2 disk showed header_status =

'FORMER'.

5) Since this is ASM 11.2.0.2, I asked to fix the disks headers as follow

(without success):

======================================================

5.1) Please backup all the database contained in the DG_SCUT_ARCH1 &

DG_MDSUT_ARCH2 diskgroups and validate the backups.

5.2) Then shutdown all the database instances referencing the DG_SCUT_ARCH1 &

DG_MDSUT_ARCH2 diskgroups (from all the nodes).

5.3) Then dismount the DG_SCUT_ARCH1 & DG_MDSUT_ARCH2 diskgroups from all the

ASM instances and keep them dismounted on all the nodes:

======================================================

SQL> alter diskgroup DG_MDSUT_ARCH2 dismount;

SQL> alter diskgroup DG_SCUT_ARCH1 dismount;

======================================================

5.4) Then fix the disks as follow:

======================================================

$>kfed repair /dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6

======================================================

$> kfed repair /dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6

======================================================

5.5) Then mount the 2 diskgroup on all the ASM instances:

======================================================

SQL> alter diskgroup DG_MDSUT_ARCH2 mount;

SQL> alter diskgroup DG_SCUT_ARCH1 mount;

======================================================

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options

SQL> alter diskgroup DG_MDSUT_ARCH2 mount;

alter diskgroup DG_MDSUT_ARCH2 mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "1" is missing from group number "4"

SQL> alter diskgroup DG_SCUT_ARCH1 mount;

alter diskgroup DG_SCUT_ARCH1 mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "2" is missing from group number "4"

======================================================

*** 10/31/11 11:55 am ***

6) Then asked for the 50 MB of the affected disks:

======================================================

$> dd if=/dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6

of=/tmp/c3t60A980006466654F476F643176575A4Fd0s6.dump bs=1048576 count=50

$> dd if=/dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6

of=/tmp/c3t60A980006466654C436F6439556B4E37d0s6.dump bs=1048576 count=50

======================================================

7) Then dump the disk header using kfed, but both disks showed a FORMER

status:

======================================================

+ASM:oracle> kfed read c3t60A980006466654F476F643176575A4Fd0s6.dump | egrep

'grpname|dskname|hdrsts'

kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

kfdhdb.dskname: DG_SCUT_ARCH1_0002 ; 0x028: length=18

kfdhdb.grpname: DG_SCUT_ARCH1 ; 0x048: length=13

======================================================

+ASM:oracle> kfed read c3t60A980006466654C436F6439556B4E37d0s6.dump | egrep

'grpname|dskname|hdrsts'

kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

kfdhdb.dskname: DG_MDSUT_ARCH2_0001 ; 0x028: length=19

kfdhdb.grpname: DG_MDSUT_ARCH2 ; 0x048: length=14

======================================================

8) So, I edited the disk headers to change the status from FORMER to MEMBER:

======================================================

From:

kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER

======================================================

To:

kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER

======================================================

9) Then attempted the repair the disk header on both disks using the fixed

header as follow:

======================================================

kfed merge /dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6

text=c3t60A980006466654C436F6439556B4E37d0s6_affected_DG_MDSUT_ARCH2_0001_kfed

_fix.txt

======================================================

kfed merge /dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6

text=c3t60A980006466654F476F643176575A4Fd0s6_affected_DG_SCUT_ARCH1_0002_kfed_

fix.txt

======================================================

10) But diskgroup could not be mounted due to the next new ORA-15203 error (I

double checked they are using the correct 11.2.0.2 ASM release to mount the

diskgroups):

======================================================

SQL> alter diskgroup DG_MDSUT_ARCH2 mount;

alter diskgroup DG_MDSUT_ARCH2 mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15203: diskgroup DG_MDSUT_ARCH2 contains disks from an incompatible

version

of ASM

======================================================

SQL> SQL> alter diskgroup DG_SCUT_ARCH1 mount;

alter diskgroup DG_SCUT_ARCH1 mount

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15203: diskgroup DG_SCUT_ARCH1 contains disks from an incompatible

version

of ASM

======================================================

11) We need help please from the Development team for the 2 things:

11.1) Why the 2 disk turned into FORMER disks?

11.1) Is there any option to fix the ASM header since kfed repair is not

working and or manual fix header patch is also having issues?

↧

Oracle bbed Block Browser /Editor

June 1, 2020, 2:12 am

≫ Next: DBRECOVER for MS SQL Server

≪ Previous: Bug 13331814 : Oracle ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.

Oracle bbed is a utility which allows browsing and editing of disk data structures maintained by the Oracle RDBMS kernel.

The targeted user of the block browser/editor is a person who is knowledgeable of these disk data structures and has an understanding of the dependencies between them.

BROWSE read directly from disk, regardless of whether the instance is up or down.

EDIT Unrecoverable -> write all edits to disk immediately (write-through).

Ability to view and modify blocks in hexadecimal for all file and block types supported by Oracle. When modifying data in this mode, no range or dependency checking will be done.

The block editor is an externally invoked facility (similar to SQL*DBA or DBVERIFY) that supports a set of optional command-line parameters at invocation.

bbed [parameters]

where parameters can consist of the following (default settings underlined):

parameters:==

DATAFILE = filename

BLOCKSIZE = blocksize [2048]

MODE = BROWSE/EDIT

REVERT = y[es]/n[o]

SPOOL = y[es]/n[o]

LISTFILE = contains filenames, w/optional sizes

CMDFILE = bbed command filename

LOGFILE = log filename [log.bbd]

BIFILE = before-image filename [bifile.bbd]