Quantcast
Channel: Oracle and MySQL Database Recovery Repair Software recover delete drop truncate table corrupted datafiles dbf asm diskgroup blogs
Viewing all 175 articles
Browse latest View live

Recover Oracle database after disk loss

$
0
0
 
PURPOSE
-------
 
This article aims at walking you through some of the common
recovery techniques after a disk failure
 
SCOPE & APPLICATION
-------------------
 
All Oracle support Analysts, DBAs and Consultants who have a role
to play in recovering an Oracle database
 
Loss due to Disk Failure
------------------------
What can we lose due to disk failure:
A) Control files
B) Redo log files
C) Archivelog files
D) Datafiles
E) Parameter file or SPFILE
F) Oracle software installation
 
Detecting disk failure
-----------------------
1) Run copy utilities like "dd" on unix
2) If using RAID mechanisms like RAID 5, parity information may mask 
    the disk failure and more vigorous check would be needed
3) As always, check the Operating system log files
4) Another obvious case would be when the disk could not be seen
    or mounted by the OS.
5) On the Oracle side, run dbv if the file affected is a datafile
6) The best way to detect disk failure is by running Hardware 
diagnostic tools and OS specific disk utilities.
 
Next Action
------------
Once the type of failure is identified, the next step is to rectify them.
 
Options could be:
(1) Replace the corrupted disk with a new one and mount them with 
     the same name (say /oracle or D:\)
(2) Replace the corrupted disk with a new one and mount them with 
     a different name (say /oracle1 as the new mount point)
(3) Decide to use another existing disk mounted with a different name
     (say /oracle_new)
 
The most common methods are (1) AND (3).
 
Oracle Recovery
---------------
Once the disk problem is sorted, the next step is to perform recovery
at the Oracle level. This would depend on the type of files that is lost (see
"Loss due to Disk Failure" section) and also on the type of disk recovery done
as mentioned in the "Next Action" section above.
 
(A) Control Files
------------------
Normally, we have multiplexing of controlfiles and they are expected to be
placed in different disks.
 
If one or more controlfile is/are lost,mount will fail as shown below:
SQL> startup
Oracle Instance started
....
ORA-00205: error in identifying controlfile, check alert log for more info
 
You can verify the controlfile copies using:
SQL> select * from v$controlfile;
 
   **If atleast one copy of the controlfile is not affected by the disk failure, 
   When the database is shutdown cleanly:
   (a) Copy a good copy of the controlfile to the missing location
   (b) Start the database 
 
   Alternatively, remove the lost control file location specified in the
   init parameter control_files and start the database.
 
   **If all copies of the controlfile are lost due to the disk failure, then:
   Check for a backup controlfile. Backup controlfile is normally taken using 
   either of the following commands:
   (a) SQL> alter database backup controlfile to '/backup/control.ctl';
    -- This would have created a binary backup of the current controlfile --
 
    -->If the backup was done in binary format as mentioned above, restore the 
       file to the lost controlfile locations using OS copying utilities.
    --> SQL> startup mount;
    --> SQL> recover database using backup controlfile;
    --> SQL> alter database open;
 
   (b) SQL> alter database backup controlfile to trace;
    -- This would have created a readable trace file containing create controlfile
    script --
 
    --> Edit the trace file created (check user_dump_dest for the location) and
        retain the SQL commands alone. Save this to a file say cr_ctrl.sql
    --> Run the script
    
    SQL> @cr_ctrl
 
    This would create the controlfile, recover database and open the database.
 
    ** If no copy of the controlfile or backup is available, then create a controlfile
    creation script using the datafile and redo log file information. Ensure that the
    file names are listed in the correct order as in FILE$.
    Then the steps would be similar to the one followed with cr_ctrl.sql script.
 
 
Note that all controlfile related SQL maintenance operations are done in the 
database nomount state
 
 
(B) Redo logs
    ---------
In normal cases, we would not have backups of online redo log files. But the 
inactive logfile changes could already have been checkpointed on the datafiles
and even archive log files may be available.
 
SQL> startup mount
     Oracle Instance Started
     Database mounted
     ORA-00313: open failed for members of log group 1 of thread 1
     ORA-00312: online log 1 thread 1: '/(path)/REDO01.LOG'
     ORA-27041: unable to open file
     OSD-04002: unable to open file
     O/S-Error: (OS 2) The system cannot find the file specified.
 
** Verify if the lost redolog file is Current or not.
     SQL> select * from v$log;
     SQL> select * from v$logfile; 
 
     --> If the lost redo log is an Inactive logfile, you can clear the logfile:
 
     SQL> alter database clear logfile '/(path)/REDO01.LOG';
 
     Alternatively, you can drop the logfile if you have atleast two other   
     logfiles:
     SQL> alter database drop logfile group 1;
 
     
     --> If the logfile is the Current logfile, then do the following:
     SQL> recover database until cancel;
         
     Type Cancel when prompted
 
     SQL>alter database open resetlogs;
 
     
     The 'recover database until cancel' command can fail with the following 
     errors:
     ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error 
     below
     ORA-01194: file 1 needs more recovery to be consistent
     ORA-01110: data file 1: '/(Path)/SYSTEM01.DBF'
 
     In this case , restore an old backup of the database files and apply the
     archive logs to perform incomplete recovery.
     --> restore old backup
     SQL> startup mount
     SQL> recover database until cancel using backup controlfile;
     SQL> alter database open resetlogs;
 
 
If the database is in noarchivelog mode and if ORA-1547, ORA-1194 and ORA-1110 errors occur, then you would have restore from an old backup and start the database.
 
 
Note that all redo log maintenance operations are done in the database mount state
 
 
(C) Archive logs
-----------------
If the previous archive log files alone have been lost, then there is not much
to panic.
** Backup the current database files using hot or cold backup which would ensure
that you would not need the missing archive logs
 
(D) Datafiles
--------------
This obviously is the biggest loss.
 
(1) If only a few sectors are damaged, then you would get ora-1578 when 
accessing those blocks.
 --> Identify the object name and type whose block is corrupted by querying dba_extents
 --> Based on the object type, perform appropriate recovery
 --> Check metalink Note:28814.1 for resolving this error
 
(2) If the entire disk is lost, then one or more datafiles may need to be 
recovered . 
  SQL> startup
  ORACLE instance started.
  ...
  Database mounted.
  ORA-01157: cannot identify/lock data file 3 - see DBWR trace file
  ORA-01110: data file 3: '/(path)/USERS01.DBF'
 
Other possible errors are ORA-00376 and ORA-1113
 
The views and queries to identify the datafiles would be:
   SQL> select file#,name,status from v$datafile;
   SQL> select file#,online,error from v$recover_file;
 
 
** If restoring to a replaced disk mounted with the same name, then :
  (1) Restore the affected datafile(s) using OS copy/restore commands from the 
      previous backup
  (2) Perform recovery based on the type of datafile affected namely SYSTEM, 
      ROLLBACK or UNDO, TEMP , DATA or INDEX.
  (3) The recover commands could be 'recover database', 'recover tablespace'
      or 'recover datafile' based on the loss and the database state
 
** If restoring to a different mount point, then :
  (1) Restore the files to the new location from a previous backup
  (2) SQL> STARTUP MOUNT
  (3) SQL> alter database rename file '/old path_name' to 'new path_name';     
      -- Do this renaming for all datafiles affected. --
  (4) Perform recovery based on the type of datafile affected namely SYSTEM, 
      ROLLBACK or UNDO, TEMP , DATA or INDEX.
  (5) The recover commands could be 'recover database', 'recover tablespace'
      or 'recover datafile' based on the loss and the database state
 
The detailed steps of recovery based on the datafile lost and the Oracle error 
are outlined in the articles referenced at the end of this note.
 
 
  NOARCHIVELOG DATABASE
  =====================
  The loss mentioned in (A),(B) and (D) would be different in this case
  wherever archive logs are involved. 
 
  We will discuss the datafile loss scenarios here:
 
  (a) If the datafile lost is a SYSTEM datafile, restore the complete
      database from the previous backup and start the database.
  (b) If the datafile lost is Rollback related datafile with active transactions,
      restore from the previous backup and start the database.
  (c) If the datafile contains rollback with no active rollback segments, you can
      offline the datafile (after commenting the rollback_segments parameter 
      assuming that they are private rollback segments) and open the database. 
  (d) If the datafile is temporary, offline the datafile and open the database. 
      Drop the tablespace and recreate the tablespace.
  (e) If the datafile is DATA or INDEX, 
      **Offline the tablespace and start the database.
      **If you have a previous backup, restore it to a separate location.
      **Then export the objects in the affected tablespace ( using User or 
        table level export).
      **Create the tablespace in the original database.
      **Import the objects exported above.
 
      If the database is 8i or above, you can also use Transportable tablespace
      feature.
 
 
(E) Parameter file
    ---------------
This is not a major loss and can be easily restored. Options are:
  (1) If there is a backup, restore the file
  (2) If there is no backup, copy sample file or create a new file and add the 
      required parameters. Ensure that the parameters db_name, control_files,
       db_block_size, compatible are set correctly
  (3) If the spfile is lost, you can create it from the init parameter file if it is available
 
 
(F) Oracle Software Installation
    ----------------------------
There are two ways to recover from this scenario:
  (1) If there is a backup of the Oracle home and Oracle Inventory, restore
      them to the respective directories. Note if you change the Oracle Home, 
      the inventory would not be aware of thid new path and you would not be
      able to apply patchsets. Also restore to the same OS user and group.
 
  (2) Perform a fresh Install, bringing it to the same patchset level
 
 
PRACTICAL SCENARIO
==================
 
In most cases, when a disk is lost, more than one type of file could be lost.
The recovery in this scenario would be:
  (1) A combination of each of these data loss recovery scenarios
  (2) Perform entire database restore from the more recent backup and apply
      archive logs to perform recovery. This is a highly preferred method 
      but could be time consuming.
 
 

Recover A Lost Oracle Datafile With No Backup

$
0
0
Problem Description: 
==================== 
 
You have inadvertantly lost a datafile at the OS level and there are no current 
backups. 
You are in archivelog mode.
You have ALL Archivelogs available since the datafile was created initially (creation date). 
 
 
Problem Explanation: 
==================== 
 
Since there are no backups, the database cannot be opened without this file 
unless dropped and the tablespace dropped.  If this is an important file and 
tablespace, this is not a valid option.
 
 
Problem References: 
=================== 
 
Oracle 7 Backup and Recovery Workshop Student Guide, Failure Scenario 14 
 
 
Search Words: 
============= 
 
ORA-1110, lost datafile, file not found.
 
 
 
Solution Description: 
===================== 
 
This files have to be recreated and recovered. Do the following:
 
1) Go to svrmgrl and connect internal.
 
2) SVRMGR>shutdown immediate. (If this hangs, issue shutdown abort)
 
3) SVRMGR>startup mount 
 
4) SVRMGR> select * from v$recover_file;
 
 
  SAMPLE:
 
  FILE#      ONLINE  ERROR              CHANGE#    TIME                
  ---------- ------- ------------------ ---------- --------------------   
  11 OFFLINE FILE NOT FOUND              0 01/01/88 00:00:00   
 
  (Noting the file number that was reported in the error)
 
 
5) SVRMGR> select * from v$datafile where FILE#=11;
 
  SAMPLE:
 
  FILE#      STATUS  ENABLED    CHECKPOINT BYTES      CREATE_BYT NAME             
  ---------- ------- ---------- ---------- ---------- ---------- --------
  11 RECOVER READ WRITE 4.9392E+12          0      10240 /tmp/sample.dbf
 
  (Note the status is RECOVER and the CREATE_BYTE size)
  (Note the NAME)
 
 
6) Recreate the datafile.
 
SVRMGR> alter database create datafile '/tmp/sample.dbf'
as '/tmp/sample.dbf' size 10240 reuse.
 
(Note that the file "created" and the file created "as" are
the same file. The "size" needs to be the same size as it
was when it was created.)
 
7) Check to see that it was successful.
 
SVRMGR> select * from v$datafile where FILE#=11;
 
8) Bring the file online.
 
SVRMGR> alter database datafile '/tmp/sample.dbf' online;
 
9) Recover the datafile.
 
SVRMGR> Recover database;
 
Note: During recovery, all archived redo logs written to since the original 
datafile was created must be applied to the new, empty version of the 
lost datafile." 
 
 
10) Enjoy!!
 
SVRMGR> alter database open;
 
 
Solution Explanation: 
===================== 
 
Recreating the file and recovering it rewrites it to the OS and brings it up to 
date.   
 
 
Solution References: 
==================== 
 
Oracle 7 Backup and Recovery Workshop Student Guide, Failure Scenario 14
 

How to Recover an Oracle Database Having Added a Datafile Since Last Backup

$
0
0
NOTE: In the images and/or the document content below, the user information and environment 
data used represents fictitious data from the Oracle sample schema(s), Public Documentation 
delivered with an Oracle database product or other training material.  Any similarity to actual 
environments, actual persons, living or dead, is purely coincidental and not intended in any manner.
 
 
 
HOW TO RECOVER A DATABASE HAVING ADDED A DATAFILE SINCE THE LAST BACKUP
-----------------------------------------------------------------------
 
This bulletin outlines the steps required in performing database recovery
having added a datafile to the database since the last backup was taken. 
Section A is applicable to Oracle release 7.x. Section B applies only to
Oracle releases 7.3.x and above.
 
PLEASE READ THROUGH ALL STEPS AND WARNINGS BEFORE ATTEMPTING TO USE THIS
BULLETIN.
 
 
A. Current controlfile, backup of datafile exists (Oracle release 7.x)
   ===================================================================
 
 A valid (either hot or cold) backup of the datafiles exists, except for the
 datafile created since the backup was taken. The current controlfile exists. 
 The database is in archivelog mode (see note (c) at bottom of page).
 
 1. Restore ONLY the datafiles (those that have been lost or damaged) from the 
    last hot or cold backup. The current online redo logs and control file(s) 
    must be intact.
 
 2. Mount the database
 
 3. Create a new datafile using the 'ALTER DATABASE CREATE DATAFILE' command.
 
    a. The datafile can be created with the same name as the original
       file. For example,
 
       SQLDBA> alter database create datafile
            2> '/oracle/dbs/testtbs.dbf';
       Statement processed.
 
    b. The datafile can be created with a different filename to the original. 
       This option might be chosen if the original file was lost due to disk 
       failure and the failed disk was still unavailable; the new file would 
       then be created on a different device. For example,
 
       SQLDBA> alter database create datafile
            2> '/oracle/dbs/testtbs.dbf'
            3> as
            4> '/oracle/dbs/testtbs.dbf';
       Statement processed.
 
       The above command creates a new datafile on the dev2 device. The file
       is created using information, stored in the control file, from the 
       original file. The command implicitly renames the filename in the 
       control file.
   
       NOTE: IT IS VERY IMPORTANT TO SPECIFY THE CORRECT FILENAME WHEN
             RECREATING THE LOST DATAFILE. IF YOU SPECIFY AN EXISTING
             ORACLE DATAFILE, THAT DATAFILE WILL BE INITIALISED AND WILL
             ITSELF REQUIRE RECOVERY.
 
 4. Recover the database.
 
    SQLDBA> recover database
    ORA-00279: Change 6677 generated at 06/03/97 15:20:24 needed for thread 1
    ORA-00289: Suggestion : /oracle/dbs/arch/arch000074.arc
    ORA-00280: Change 6677 for thread 1 is in sequence #74
    Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
    
    At this point the recovery procedure will wait for the user to supply the
    information requested regarding the name and location of the archived log
    files. For example, entering AUTO directs Oracle to apply the suggested 
    redo log and any others that it requires to recover the datafiles.
 
    Applying suggested logfile...
    Log applied.
              :
              :
    <Application of further redo logs>
              :
              :
    Media recovery complete.
 
 5. Open the database
 
    SQLDBA> alter database open;
    Statement processed.
 
 
 
B. Old controlfile, no backup of datafile (Oracle release 7.3.x and above)
   =======================================================================
 
 A valid (either hot or cold) backup of the datafiles exists, except for the
 datafile created since the backup was taken. The controlfile is a backup from
 before the creation of the new datafile. The database is in archivelog mode 
 (see note (c) at bottom of page).
 
 NOTE : 'svrmgrl' has been replaced by SQL*Plus starting from Oracle8i
        So the 'SVRMGR>' prompt is than replaced by 'SQL>'
 
 1. Restore the datafiles (those that have been lost or damaged) from the 
    last hot or cold backup. Also restore the old copy of the controlfile.
    The current online redo logs must be intact.
 
 2. Mount the database
 
 3. Start media recovery, specifying backup controlfile
 
    SVRMGR> recover database using backup controlfile
    ORA-00279: Change 6677 generated at 06/03/97 15:20:24 needed for thread 1
    ORA-00289: Suggestion : /oracle/dbs/arch/arch000074.arc
    ORA-00280: Change 6677 for thread 1 is in sequence #74
    Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
    At this point, apply the archived logs as requested. Eventually Oracle
    will encounter redo to be applied to the non-existent datafile. The 
    recovery session will exit with the following message, and will return
    the user to the Server Manager prompt:
 
    ORA-00283: Recovery session canceled due to errors
    ORA-01244: unnamed datafile(s) added to controlfile by media recovery
    ORA-01110: data file 5: '/oracle/dbs/testtbs.dbf'
 
 4. Recreate the missing datafile. To do this, select the relevant filename 
    from v$datafile:
 
    SVRMGR> select name from v$datafile where file#=5;
    NAME
    -------------------------------------------------------
    UNNAMED0005
 
    Now recreate the file:
 
    SVRMGR> alter database create datafile
         2> 'UNNAMED0005'
         3> as
         4> '/oracle/dbs/testtbs.dbf';
 
 
 
 5. Restart recovery
 
    SVRMGR> recover database using backup controlfile
    ORA-00279: Change 6747 generated at 09/24/97 16:57:18 needed for thread 1
    ORA-00289: Suggestion : /oracle/dbs/arch/arch000079.arc
    ORA-00280: Change 6747 for thread 1 is in sequence #79
    Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
 
    Apply archived logs as requested. Prior to Oracle8, recovery must apply
    the complete log which was current at the time of the datafile creation
    (in the above example, this would be log sequence 79). A recovery to a
    point in time before the end of this log would result in errors:
 
    ORA-01196: file 1 is inconsistent due to a failed media recovery session
    ORA-01110: data file 1: '/oracle/dbs/systbs.dbf'
 
    If this happens, re-recover the database and ensure that the complete log
    is applied (plus any further redo if required). This limitation does
    not exist from Oracle 8.0+.
 
    Eventually, Oracle will request the archived log corresponding to the 
    current online log. It does this because the (backup) controlfile has no 
    knowledge of the current log sequence. If an attempt is made to apply the 
    suggested log, the recovery session will exit with the following message:
 
    ORA-00308: cannot open archived log '/oracle/dbs/arch/arch000084.arc'
    ORA-07360: sfifi: stat error, unable to obtain information about file.
    SVR4 Error: 2: No such file or directory
 
    At this stage, simply restart the recovery session and apply the current
    online log. The best way to do this is to try applying the online redo 
    logs one by one until Oracle completes media recovery:
 
    SVRMGR> recover database using backup controlfile
    ORA-00279: Change 6763 generated at 09/24/97 16:57:59 needed for thread 1
    ORA-00289: Suggestion : /oracle/dbs/arch/arch000084.arc
    ORA-00280: Change 6763 for thread 1 is in sequence #84
    Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
    /oracle/dbs/log2.dbf
    Log applied.
    Media recovery complete.
 
 6. Open the database
 
    SVRMGR> alter database open resetlogs;
 
    The resetlogs option must be chosen to resynchronize the controlfile. 
 
    
NOTES:
======
 
a) These techniques can be used whether the database was closed either 
   cleanly or uncleanly (aborted).
 
b) If the database is recovered using an incomplete recovery technique (either
   time-based, cancel-based, or change-based), and is recovered to a point in
   time before the datafile was originally created, any references to that
   datafile will be removed from the database when the database is opened.
 
   Oracle handles this situation as follows:
 
   - The 'alter database create datafile....' command creates a reference in 
     the controlfile for the datafile.
   - Incomplete recovery terminates before applying redo that would create a
     corresponding row for the datafile in the file$ dictionary table.
   - When the database is opened, Oracle detects an inconsistency between file$
     and the controlfile and resolves in favor of file$, deleting the entry
     from the controlfile. 
 
c) It may be possible to recover the datafile using this technique even if the
   database is not in archivelog mode. However, this relies on the required 
   redo being available in the online redo logs.
   

Recreating a missing Oracle datafile with no backups

$
0
0
GOAL
How to recreate a datafile that is missing at the operating system level. Missing/inaccessible files may be reported with one or more of these errors:
 
ORA-01116: error in opening database file %s
ORA-27041: unable to open file
ORA-01157: cannot identify/lock data file %s - see DBWR trace file
ORA-01119: error in creating database file '%s'
 
 
No backup or copy of the datafile is required. We only need the redo logs starting from the time of the datafile creation to the current point in time.
 
Note: plugged-in datafiles do not apply in this scenario and needs to be plugged-in again from its source.
 
SOLUTION
When a datafile goes missing at the operating system level, you would normally need to restore and recover it from a backup. If you do not have backups of this datafile, but do have redo logs you can still create and recover the datafile. You only need the redo logs starting from the datafile creation time to now.
 
Prior to 10g, you would use the following SQL command:
 
SQL> alter database create datafile 'missing name' as 'misisng name';
SQL> recover datafile 'missing name';
SQL> alter database datafile '<missing name>' online;
As of 10g, you can also do this in RMAN.
 
1) RMAN will create the datafile if there is no backups or copies of this datafile:
 
 
 
RMAN> restore datafile <missing file id>;
 
 
2) Recover the newly created datafile:
 
RMAN> recover datafile <missing file id>;
 
 
3) Bring it online:
 
RMAN> sql 'alter database datafile <missing file id> online';
 
 
 
Example:
 
RMAN> list copy of datafile 6;
 
specification does not match any datafile copy in the repository
 
RMAN> list backup of datafile 6;
 
specification does not match any backup in the repository
 
RMAN> restore datafile 6;
 
Starting restore at 14 JUL 10 10:20:02
using channel ORA_DISK_1
 
creating datafile file number=6 name=/opt/app/oracle/oradata/ORA112/datafile/o1_mf_leng_ts_63t08t64_.dbf
restore not done; all files read only, offline, or already restored
Finished restore at 14 JUL 10 10:20:05
 
RMAN> recover datafile 6;
 
Starting recover at 14 JUL 10 10:21:02
using channel ORA_DISK_1
 
starting media recovery
media recovery complete, elapsed time: 00:00:00
 
Finished recover at 14 JUL 10 10:21:02
 
RMAN> sql 'alter database datafile 6 online';
 
sql statement: alter database datafile 6 online

How to Recover from a Lost or Deleted Oracle Datafile with Different Scenarios

$
0
0
PURPOSE
This article explains the various scenarios for ORA-01157 and how to avoid them.
 
 
SCOPE & APPLICATION
 
This article is intended for Oracle Support Analysts, Oracle Consultants and
Database Administrators.
 
 
TROUBLESHOOTING STEPS
How to Recover from a Lost Datafile in Different Scenarios
 
In the event of a lost datafile or when the file cannot be accessed an ORA-01157
is reported followed by ORA-01110.
 
Besides this, you may encounter error ORA-07360 : sfifi: stat error, unable to
obtain information about file. A DBWR trace file is also generated in the
background_dump_dest directory.  If an attempt is made to shutdown the database
normal or immediate will result in ORA-01116, ORA-01110 and possibly ORA-07368.
 
This article discusses various scenarios that may be causing this error and the  
solution/workaround for these.
 
Throughout this note we refer to "backups" but if you have a valid physical standby database
you may also use the standby database's datafiles to recover the primary database.
 
 
Datafile not found by Oracle
 
- Unintentionally renamed or moved at the Operating System (OS) level.
  Simply restore the file to its original location and recover it
 
- Intentionally moved/renamed at OS level.
  You are re-organising the datafile layout across various disks at the OS.
  After moving/renaming the file you will have to rename the file at database
  level, and recover it.
 
Note:115424.1 How to Rename or Move Datafiles and Logfiles
 
 
Datafile damaged/deleted
 
If the file is damaged/deleted and an attempt is made to start the database
will result in ORA-01157, ORA-01110. Then depending upon the type of datafile
lost different action needs to be taken. Check for a faulty hard disk. The
file may have gone corrupt due to faulty disk. Replace the bad disk or create
the file on a non-faulty disk.
 
Lost datafile could be in one of the following:
 
1. Temporary tablespace
  
   If the datafile belongs to a temporary tablespace, you will have to simply offline
   drop the datafile and then drop the tablespace with including contents option.
   Thereafter, re-create the temporary tablespace.
 
   Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery
 
 
2. Read Only Tablespace
  
   In this case you will have to restore the most recent backup of the read-only
   datafile. No media recovery is required as read-only tablespaces are not
   modified. Note however that media recovery will be required under the following conditions:
 
   a. The tablespace was in read-write mode when the last backup was taken
      and was made read-only afterwards.
 
   b. The tablespace was in read-only mode when last backup was taken and
      was made read-write in between and then again made read only
 
   In either of the above cases you will have to restore the file and do a media
   recovery using RECOVER DATAFILE statement. Apply all the necessary archived redo
   logs until you get the message "Media Recovery Complete".
 
   Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery
 
 
3. User Tablespace
  
   Two options are available:
 
   a. Recreate the user tablespace.
      If all the objects in the tablespace can be re-created (recent export is
      available; tables can be re-populated using scripts; SQL*Loader etc)
      Then, offline drop the datafile, drop the tablespace with including
      contents option. Thereafter, re-create the tablespace and re-create
      the objects in it.
 
   b. Restore file from backup and do a media recovery.
      Database has to be in archivelog mode.If the database is in NOARCHIVELOG
      mode, you will only succeed in recovering the datafile if the redo to be
      applied to it is within the range of your online redo logs.
 
   Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery
 
 
4. Index Tablespace
  
   Two options are available:
 
   a. Recreate the Index tablespace
      If the index can be easily re-created using script or manual CREATE INDEX
      statement, then best option is to offline drop the datafile,drop the
      index tablespace, and re-create it and recreate all indexes in it.
 
   b. Restore file from backup and do a media recovery.
      If the index tablespace cannot be easily re-created, then restore the
      lost datafile from a valid backup and then do a media recovery on it.
 
   Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery
 
5. System (and/or Sysaux) Tablespace
  
   a. Restore from a valid backup and perform a media recovery on it
 
   b. Rebuild the database.
      If neither backup of the datafile nor the full database backup is
      available, then rebuild database using full export, user level/table
      level export, scripts, SQL*Loader, standby etc. to re-create and
      re-populate the database.
 
   Note.184327.1 Common Causes and Solutions on ORA-1157 Error Found in Backup & Recovery
 
 
6. Undo Tablespace
  
   While handling situation with lost datafile of an undo tablespace you need to
   be extra cautious so as not to lose active transactions in the undo segments.
 
   The preferred option in this case is to restore the datafile from backup and
   perform media recovery.
 
      i.  If the database was cleanly shutdown.
          Ensure that database was cleanly shutdown in NORMAL or IMMEDIATE mode.
          Update your init file with "undo_management=manual"
          Restart the database
          Drop and recreate the undo tablespace
          Update your init file with "undo_management=auto"
          Restart the database
 
      ii. If the database was NOT cleanly shutdown.
          If the database was shutdown aborted or crashed, you may not be able to drop
          the datafile as the undo segments may contain active transactions.
          You will need to restore the file from a backup
          and perform a media recovery.
 
7. Lost Controlfiles and Online Redo Logs
  
   If the datafiles are in a consistent state, not needing media recovery, but you have lost
   all the controlfiles and online redologs, then while
   attempting to create controlfile using scripts will complain of missing
   redologs. In this case use RESETLOGS option of the create controlfile
   script and then open the database with RESETLOGS option.
 
 
8. Lost datafile and no backup
  
   If there are no backups of the lost datafile then you can re-create the
   datafile with the same size as the original file and then apply all the
   archived redologs written since original datafile was created to the new
   version of the lost datafile.
 
 
Note: Please put the restore and recovery from backup as the first and prefer
         option for case 2 - 6.
 
   Note:1060605.6 Lost datafile and no backup.

Oracle ORA-1157 Troubleshooting

$
0
0
PURPOSE
This note is intended to list the common reasons and solutions for the ORA-1157 error.
 
SCOPE
NOTE: In the images and/or the document content below, the user information and environment data used represents fictitious data from the Oracle sample schema(s), Public Documentation delivered with an Oracle database product or other training material. Any similarity to actual environments, actual persons, living or dead, is purely coincidental and not intended in any manner.
 
 
This article is intended for Oracle Support and Oracle database administrators.
 
DETAILS
Oracle Error: ORA-1157
An ORA-01157 is issued whenever Oracle attempts to access a file but cannot find or lock the file.
Error Explanation:
 
01157, 00000, "cannot identify/lock data file %s - see DBWR trace file"
 
Cause: The background process was either unable to find one of the data files or failed to lock it because the file was already in use. The database will prohibit access to this file but other files will be unaffected. However the first instance to open the database will need to access all online data files. Accompanying error from the operating system describes why the file could not be identified.
 
Action: Have operating system make file available to database. Then either open the database or do ALTER SYSTEM CHECK DATAFILES.
ORA-01157 errors are usually followed by ORA-01110 and possibly an Oracle operating system layer error such as ORA-07360. A DBWR trace file is generated in the background_dump_dest directory.
For Example, on Solaris platform, the following errors will appear: 
 
ORA-01157: cannot identify/lock data file 19 - see DBWR trace file
ORA-01110: data file 19: '/<disk_path>/users02.dbf'
From the DBWR trace file:
 
ORA-01157: cannot identify/lock data file 19 - see DBWR trace file
ORA-01110: data file 19: '/<disk_path>/users02.dbf'
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory
Additional information: 3
 
Common Causes and Solutions for ORA-1157
 Note: Throughout this note we refer to "backups" but if you have a valid physical standby database
you may also use the standby database's datafiles to recover the primary database.
 
1.  The datafile does exist, but Oracle cannot find it.
 
The datafile may have been renamed at the operating system level, moved to a different directory or disk drive either intentionally or unintentionally.
 
In this case, restore and recover the datafile or move the datafile to its original name.
 
2.  The datafile does not exist or is unusable by Oracle. The datafile has been physically removed or damaged to an extent that Oracle cannot recognize it anymore.
 
For example, the datafile might be truncated or overwritten, in which case
ORA-27046 will accompany ORA-1157 error.
 
For example:
 
ORA-27046: file size is not a multiple of logical block size
 
In this case, the user has two options:
 
A   Recreate the tablespace that the datafile belongs to.
 
 
This option is best suited for USERS, INDEX, TEMPORARY tablespaces.
 
It is also recommended for UNDO tablespaces if the database had been SHUTDOWN CLEANLY, so that no active transactions are there in the rollback segments of this tablespace.
 
If the tablespace is SYSTEM tablespace, then this amounts to recreating or rebuilding the database.
 
This method is best suited for temporary tablespaces (since they do not contain important data), but can be used for USERS tablespaces and INDEXES tablespaces.
 
This method would be helpful wherein reasonably recent exports of the objects in the tablespace are available, or that the tables in the tablespace can be repopulated by running a script or program, loading the data through SQL*Loader, etc.
 
The steps involved are:
 
1. If the database is down, mount it.
 
STARTUP MOUNT;
 
2. Offline drop the datafile.
 
ALTER DATABASE DATAFILE 'full_path_file_name' OFFLINE DROP;
 
3. If the database is at mount, open it.
 
ALTER DATABASE OPEN;
 
4. Drop the user tablespace.
 
DROP TABLESPACE tablespace_name INCLUDING CONTENTS;
 
Note: The users can stop with this step if they do not want the
tablespace anymore in the database.
 
5. Recreate the tablespace.
 
CREATE TABLESPACE tablespace_name DATAFILE 'datafile_full_path_name' SIZE required_size;
 
6. Recreate all the previously existing objects in the tablespace.
 
This can be done using the creation scripts for the objects in that tablespace or using the recent export dump available for that tablespace objects.
 
 
B. Recover the datafile using normal recovery procedures.
 
This option is best suited for READ ONLY tablespaces and for USERS, INDEX tablespaces where recreating is not a feasible option.
 
If the tablespace is of type UNDO, then this is the method to be used if the database was not SHUTDOWN CLEANLY.
(that is, if shutdown abort had been used or the database had crashed)
 
If the tablespace is SYSTEM, then this is the recommended method, if there are backups and archivelogs are available. If the database is in
NOARCHIVELOG mode, then you can recover only if the required changes are present in the ONLINE redologs.
 
In many situations, recreating the user tablespace is impossible or too laborious. The solution then is to restore the lost datafile from a backup
and do media recovery on it. If the database is in NOARCHIVELOG mode, you will only succeed in recovering the datafile if the redo to be applied
to the datafile is within the range of the online logs.
 
This method would be ideal for READ ONLY tablespaces. If the tablespace was not switched to READ-WRITE after backup was taken and if the tablespace was
READ ONLY at the time of backup, then recovery is just restoring the backup of this tablespace.
 
These are the steps:
 
1. Restore the lost file from a backup.
 
2. If the database is down, mount it. 
 
STARTUP MOUNT;
 
3. Issue the following query:
 
SELECT V1.GROUP#, MEMBER, SEQUENCE#,
FIRST_CHANGE#
FROM V$LOG V1, V$LOGFILE V2
WHERE V1.GROUP# = V2.GROUP# ;
 
This will list all your online redolog files and their respective sequence and first change numbers.
 
4. If the database is in NOARCHIVELOG mode, issue the query:
 
SELECT FILE#, CHANGE# FROM V$RECOVER_FILE;
 
If the CHANGE# is GREATER than the minimum FIRST_CHANGE# of your logs, the datafile can be recovered. Just keep in mind that all the logs to
applied will be online logs, and move on to step 5.
 
If the CHANGE# is LESS than the minimum FIRST_CHANGE# of your logs, the file cannot be recovered. Your options at this point would be to restore
the most recent full backup (and thus lose all changes to the database since) or recreate the tablespace as explained in scenario a.
 
 
5. Recover the datafile: 
 
RECOVER DATAFILE 'full_path_file_name' ;
 
6. Confirm each of the logs that you are prompted for until you receive the message "Media Recovery Complete". If you are prompted for a non-existing
archived log, Oracle probably needs one or more of the online logs to proceed with the recovery. Compare the sequence number referenced in the
ORA-280 message with the sequence numbers of your online logs. Then enter the full path name of one of the members of the redo group whose sequence
number matches the one you are being asked for. Keep entering online logs as requested until you receive the message "Media Recovery Complete" .
 
7. If the database is at mount point, open it.
 
Operating Systems (OS) Tempfiles missing:
 
When using TEMPORARY tablespaces with tempfiles, the absence of the tempfile at the OS level can cause ORA-1157. Since Oracle does not checkpoint tempfiles, the database can be opened even with missing tempfiles.
 
The solution in this case would be to drop the logical tempfile and add a new one.
 
For example:
 
select * from dba_objects order by object_name;
select * from dba_objects order by object_name;
*
ERROR at line 1:
 
ORA-01157: cannot identify/lock data file 1026 - see DBWR trace file
ORA-01110: data file 1026: '/<disk_path>/temp2_01.tmp'
Solution:
 
alter database tempfile '/<disk_path>/temp2_01.tmp' drop;
 
select tablespace_name, file_name from dba_temp_files;
 
alter tablespace temp2 add tempfile '/<disk_path>/temp2_01.tmp' size 5m;
 
 
 
ORA-1157 due to OS issues/3rd party software
1. When trying to access Quick I/O files with vxfddstat, or other applications, getting an error message similar to "Cannot open file".
 
Oracle may return an error message similar to:
 
 
ORA-01157: cannot identify data file 1 - file not found
ORA-01110: data file 1: '/<disk_path>/system01.dbf'
The users need to contact Veritas support in this case. To access their support site, point your web browser to:
 
 
Click on: 'Product Listing'
Click on: 'File System for UNIX'
Enter 'Oracle' and click 'Search' to view relevant information from their Knowledge Base.
 
2. It is possible to get this error on HP if the kernel parmeter nflock is not set high enough. This might prevent Oracle to lock the required datafiles. Controlfile recreation might fail with ORA-27041 and ORA-1157 for the same reasons.
 
There may be more errors in the trace files in dump directory such as :
 
ORA-27086: skgfglk: unable to lock file - already in use
OR
 
ORA-01157: cannot identify/lock data file 263 - see DBWR trace file
ORA-0110: data file 263: '/<disk_path>/system01.dbf'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
OR
 
 
ORA-07445: exception encountered: core dump [%s] [%s] [%s] [%s] [%s] [%s]
ORA-01110: data file %s: '%s'
ORA-01242: data file suffered media failure: database in NOARCHIVELOG mode
ORA-01115: IO error reading block from file %s (block # %s)
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 3
To resolve these issues, increase the relavent kernel parameter on HP. The recommended settings would be :
 
nproc 4096 Max Number of Processes
nfile 63488 Max Number of Open Files
nflocks 4096 Max Number of File Locks
 
Refer to the OS Installation Guide for more information on these parameters.
 
To resolve the ORA-27041 'File table overflow' on Solaris, experience with the system settings below resolved -
 
/etc/system parameters
set vxfs:vxfs_ninode=692416
set ncsize=519312
 
Please note that tuning these parameters is outside the scope of Oracle Support.
For further info refer to Sun document available on sunsolve.sun.com (76671) which explains ncsize tuning.
And, tuning vxfs:vxfs_ninode is Veritas-specific, please post questions directly to Veritas.
 
3. It is possible to get ORA-1157, if the datafiles that Oracle requires are locked by some other process, for example a backup software might be locking
the files for backup.
 
On MS Windows, the user can get any of the following errors:
 
ORA-01157: signalled during alter database open
ORA-01157: can not identify datafile
ORA-01110: data file 10: '<disk_path>\index01.dbf'
ORA-27047: Unable to read header of file
OSD-04006: Read file failure
Error 33: process can not access file
The operating system error 33 is an error_lock_violation indicating that a portion of the data file is locked by another NT process.
 
OR
 
ORA-1157 - cannot identify datafile - file not found
ORA-1110 - datafile 11 '<disk_path>\index02.dbf'
ORA-9202 - sfifi: error identifying file
OSD-4006(OS 203) - The System could not find the environment option that was entered
 
 
Other error combinations that may show up in the alert log include:
 
ORA-1115 - IO error reading block from file %s (block # %s)
ORA-1110 - datafile 11 '<disk_path>\index02.dbf'
ORA-9206 - sfrfb: error reading from file
OSD-4006(OS 203) - The System Could not find the environment option that was entered
 
 
OR
 
ORA-1242 - data file suffered media failure: database in NOARCHIVELOG mode
ORA-1114 - IO error writing block to file &lt;name&gt; block #
ORA-9205 - sfqio: error reading or writing to disk
OSD-4016(OS 33) - The process cannot access the file because another process has locked a portion of the file.
 
Additionally, the following errors will appear:
 
KCF: write/open error dba=0x703473d block=0x3473d online=1
file=7 /<disk_path>/users01.dbf
error=9211 txt: 'OSD-4008 : WriteFile error (OS 203) - The System Could not find the environment option that was entered
 
In some cases, the alert log may also show errors such as:
 
Instance terminating due to error 1110.
Instance terminated by background process PID=<pid>
 
OR
 
background process TERMINATING INSTANCE DUE TO ERROR 472
 
ORA 472 - PMON process terminated with error
 
The following events may also be reported in the Microsoft Windows event viewer:
 
23 Error ReadFile() failure.
25 Error WriteFile() failure.
 
If this is a cold backup, need to wait until the backup is done and then
startup the database, or end the backup and startup the database.
 
Alternatively, the solution is that the backup software should be configured
such that it does not lock open files.
 
Refer to the BACKUP software documentation for information on how to do this.
 
For example, the Seagate Backup Exec Software, should be configured with
'Read and Not lock' option to take online backups.
 
This is true on some Unix platforms also.
For example, the IBM AIX ADSM backup utility can fall over before a successful run was performed, thereby holding the lock on Oracle datafiles.
 
The solution in this case would be to clear the lock on the datafile manually.
 
i. Run $ ps -ef | grep &lt;SID&gt; - look for an existing process on a datafile
 
ii. Do $ kill -9 on the process id
 
 
 
4. ORA-1157 is possible if the Oracle datafiles were copied to a directory using the File Manager on Windows.
This is true if the filenames are greater than the usual 8.3 format.
i.e., the files are of greater than 8 characters long name or have greater than 3 characters of extention.
 
The avoid this problem, when copying files with Windows 95/98, DO NOT use File Manager. If the files have long file names (e.g.: more than the
standard 8.3 file name convention), File Manager will rename the files with an 8.3 file name.
 
To preserve long filenames, Explorer should be used to copy files. If File Manager is already used and the files have a tilde (~) in the file name, the
files needs to be renamed to their original names.
 
5. ORA-1157 can be possible when using NETWORK appliance. Network Applicance acquires locks on the datafiles for certain operations.
These locks may be retained by the Network Appliance due to an instance or host machine failure. In these cases, the locks must be manually released from the Network
Appliance by the System Administrator.
 
The command for this would be:
 
As root on the netapp, from the prompt:
 
rc_toggle_basic
 
sm_mon -l &lt;hostname&gt;
 
The hostname being referenced in the above command should be the machine name where the Oracle instance is running from.
 
6. ORA-1157 is possible if Oracle files were restored as a different user.
 
After restoring a datafile and issuing recover datafile, the error ORA-1157 (cannot identify datafile - file not found) can occur. It does not appear
to recognize the restored datafile inspite that:
 
- The datafile exists at the OS.
- select * from v$datafile shows the correct path for the datafile.
- "alter system check datafiles" is successful.
- backup control file to trace shows full path to datafile is correct.
 
In this case, it could be a permissions problem at the OS level.
 
Check the permissions of the datafiles. When the datafile was restored, it may have been done by another user, other than Oracle such as root.
In this case, the file can be seen by Oracle, but not accessed. It cannot read and write to the datafile because it does not own the file.
Changing the ownership permissions to the Oracle user will allow the restore datafile command to succeed.
 
 
ORA-1157 - Other possibilities
1. Corrupt controlfile can cause ORA-1157
 
Users can get ORA-1157 in some extreme cases, where, the controlfile becomes corrupt.
 
One possible type of corruption that can result in these errors is having a trailing blank after the filename in the control file. This can be seen by
examining a text version of the control file produced using the 'ALTER DATABASE BACKUP CONTROLFILE TO TRACE' command.
(The trace file will be placed in the location specified by the user_dump_dest initialization parameter).
 
Example:
--------
 
'/<disk_path>/index1.dbf ' -- corrupt
'/<disk_path>/index02.dbf' -- non-corrupt
 
In these case, the solution would be to try using the other controlfile copies (if they are not corrupt) by changing the CONTROL_FILES parameter in
the init.ora to exclude the corrupt control file. If all the control files are corrupt, then the controlfile can be recreated.
It is also possible to have other 'unexpected' control characters embedded in the control file.
 
For example:
 
"^J" may be present in the datafile name.
 
The above solution will hold good in this case also.
 
 
2. Check if there was any tablespace/datafile added in the primary after the standby was setup.
 
If so, create the datafile(s) on the standby database manually. When the files exist, recovery can continue. If datafiles are not automatically created on the standby site.
 
For example:
 
The redo does not create a new datafile for you. The create datafile command from startup mount is:
 
alter database create datafile 'datafile_full_path_name';
 
 
3. RMAN restore can produce 'fake' ORA-1157 errors in the alert.log
 
During an RMAN restore operation, it is possible to encounter the following error(s) in the alert log:
 
ORA-01157: cannot identify/lock data file N - see DBWR trace file
ORA-01110: data file N: 'filename'
ORA-27037: unable to obtain file status
SVR4 Error: 2: No such file or directory
This problem occurs if RMAN is restoring datafiles that have been removed from disk prior to the restore operation.
 
These errors (which occur multiple times for each affected datafile) can be a source of concern and confusion for DBAs. However, they are entirely
expected and, apart from monitoring the size of the alert log (and archiving it if necessary), are nothing to worry about.
 
4. ORA-1157 with newly installed seed database.
 
If seed database was used, Oracle will copy the datafiles from CD to disk and try to startup the database. But, if the datafiles are corrupt in the CD,
that is, if the CD is damaged, it could result in ORA-1157 error.
 
The solution to this would be to use new set of CDs or create database using manual creation scripts.

Troubleshoot Oracle clusterware Grid Infrastructure Startup Issues

$
0
0
PURPOSE
 
This note is to provide reference to troubleshoot 11gR2 and 12c Grid Infrastructure clusterware startup issues. It applies to issues in both new environments (during root.sh or rootupgrade.sh) and unhealthy existing environments.  To look specifically at root.sh issues, see note 1053970.1 for more information. 
 
SCOPE
This document is intended for Clusterware/RAC Database Administrators and Oracle support engineers. 
 
DETAILS
Start up sequence:
In a nutshell, the operating system starts ohasd, ohasd starts agents to start up daemons (gipcd, mdnsd, gpnpd, ctssd, ocssd, crsd, evmd asm etc), and crsd starts agents that start user resources (database, SCAN, listener etc).
 
For detailed Grid Infrastructure clusterware startup sequence, please refer to note 1053147.1
 
 
Cluster status
 
To find out cluster and daemon status:
 
$GRID_HOME/bin/crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
 
$GRID_HOME/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       rac1                  Started
ora.crsd
      1        ONLINE  ONLINE       rac1
ora.cssd
      1        ONLINE  ONLINE       rac1
ora.cssdmonitor
      1        ONLINE  ONLINE       rac1
ora.ctssd
      1        ONLINE  ONLINE       rac1                  OBSERVER
ora.diskmon
      1        ONLINE  ONLINE       rac1
ora.drivers.acfs
      1        ONLINE  ONLINE       rac1
ora.evmd
      1        ONLINE  ONLINE       rac1
ora.gipcd
      1        ONLINE  ONLINE       rac1
ora.gpnpd
      1        ONLINE  ONLINE       rac1
ora.mdnsd
      1        ONLINE  ONLINE       rac1
 
For 11.2.0.2 and above, there will be two more processes:
 
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       rac1
ora.crf
      1        ONLINE  ONLINE       rac1
For 11.2.0.3 onward in non-Exadata, ora.diskmon will be offline:
 
ora.diskmon
      1        OFFLINE  OFFLINE       rac1
 
For 12c onward, ora.storage is introduced: 
 
ora.storage
1 ONLINE ONLINE racnode1 STABLE
 
 
 
To start an offline daemon - if ora.crsd is OFFLINE:
 
$GRID_HOME/bin/crsctl start res ora.crsd -init
 
 
Case 1: OHASD does not start
 
As ohasd.bin is responsible to start up all other cluserware processes directly or indirectly, it needs to start up properly for the rest of the stack to come up. If ohasd.bin is not up, when checking its status, CRS-4639 (Could not contact Oracle High Availability Services) will be reported; and if ohasd.bin is already up, CRS-4640 will be reported if another start up attempt is made; if it fails to start, the following will be reported:
 
 
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
 
 
Automatic ohasd.bin start up depends on the following:
 
1. OS is at appropriate run level:
 
OS need to be at specified run level before CRS will try to start up.
 
To find out at which run level the clusterware needs to come up:
 
cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up. Oracle Linux 7 (and Red Hat Linux 7) uses systemd to manage start/stop services (example: /etc/systemd/system/oracle-ohasd.service)
 
Above example shows CRS suppose to run at run level 3 and 5; please note depend on platform, CRS comes up at different run level.
 
To find out current run level:
 
who -r
 
 
2. "init.ohasd run" is up
 
On Linux/UNIX, as "init.ohasd run" is configured in /etc/inittab, process init (pid 1, /sbin/init on Linux, Solaris and hp-ux, /usr/sbin/init on AIX) will start and respawn "init.ohasd run" if it fails. Without "init.ohasd run" up and running, ohasd.bin will not start:
 
 
ps -ef|grep init.ohasd|grep -v grep
root      2279     1  0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
Note: Oracle Linux 6 (OL6) or Red Hat Linux 6 (RHEL6) has deprecated inittab, rather, init.ohasd will be configured via upstart in /etc/init/oracle-ohasd.conf, however, the process ""/etc/init.d/init.ohasd run" should still be up.
 
If any rc Snncommand script (located in rcn.d, example S98gcstartup) stuck, init process may not start "/etc/init.d/init.ohasd run"; please engage OS vendor to find out why relevant Snncommand script stuck.
 
Error "[ohasd(<pid>)] CRS-0715:Oracle High Availability Service has timed out waiting for init.ohasd to be started." may be reported of init.ohasd fails to start on time. 
 
If SA can not identify the reason why init.ohasd is not starting, the following can be a very short term workaround:
 
 cd <location-of-init.ohasd>
 nohup ./init.ohasd run &
 
 
3. Cluserware auto start is enabled - it's enabled by default
 
By default CRS is enabled for auto start upon node reboot, to enable:
 
$GRID_HOME/bin/crsctl enable crs
 
To verify whether it's currently enabled or not:
 
$GRID_HOME/bin/crsctl config crs
 
If the following is in OS messages file
 
Feb 29 16:20:36 racnode1 logger: Oracle Cluster Ready Services startup disabled.
Feb 29 16:20:36 racnode1 logger: Could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr
 
The reason is the file does not exist or not accessible, cause can be someone modified it manually or wrong opatch is used to apply a GI patch(i.e. opatch for Solaris X64 used to apply patch on Linux).
 
 
 
 
4. syslogd is up and OS is able to execute init script S96ohasd
 
OS may stuck with some other Snn script while node is coming up, thus never get chance to execute S96ohasd; if that's the case, following message will not be in OS messages:
 
Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
 
If you don't see above message, the other possibility is syslogd(/usr/sbin/syslogd) is not fully up. Grid may fail to come up in that case as well. This may not apply to AIX.
 
To find out whether OS is able to execute S96ohasd while node is coming up, modify S96ohasd:
 
From:
 
    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."
 
To:
 
    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/touch /tmp/ohasd.start."`date`"
        $LOGERR "Oracle HA daemon is enabled for autostart."
 
After a node reboot, if you don't see /tmp/ohasd.start.timestamp get created, it means OS stuck with some other Snn script. If you do see /tmp/ohasd.start.timestamp but not "Oracle HA daemon is enabled for autostart" in messages, likely syslogd is not fully up. For both case, you will need engage System Administrator to find out the issue on OS level. For latter case, the workaround is to "sleep" for about 2 minutes, modify ohasd:
 
From:
 
    case `$CAT $AUTOSTARTFILE` in
      enable*)
        $LOGERR "Oracle HA daemon is enabled for autostart."
 
To:
 
    case `$CAT $AUTOSTARTFILE` in
      enable*)
        /bin/sleep 120
        $LOGERR "Oracle HA daemon is enabled for autostart."
 
5. File System that GRID_HOME resides is online when init script S96ohasd is executed; once S96ohasd is executed, following message should be in OS messages file:
 
Jan 20 20:46:51 rac1 logger: Oracle HA daemon is enabled for autostart.
..
Jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -I/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin "reboot"
 
 
If you see the first line, but not the last line, likely the filesystem containing the GRID_HOME was not online while S96ohasd is executed.
 
 
6. Oracle Local Registry (OLR, $GRID_HOME/cdata/${HOSTNAME}.olr) is accessible and valid
 
 
ls -l $GRID_HOME/cdata/*.olr
-rw------- 1 root  oinstall 272756736 Feb  2 18:20 rac1.olr
 
If the OLR is inaccessible or corrupted, likely ohasd.log will have similar messages like following:
 
 
..
2010-01-24 22:59:10.470: [ default][1373676464] Initializing OLR
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:6m':failed in stat OCR file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.472: [  OCROSD][1373676464]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-01-24 22:59:10.473: [  OCRRAW][1373676464]proprinit: Could not open raw device
2010-01-24 22:59:10.473: [  OCRAPI][1373676464]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [  CRSOCR][1373676464] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage Operating System error [No such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] OLR initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 22:59:10.474: [ default][1373676464][PANIC] OHASD exiting; Could not init OLR
 
OR
 
 
..
2010-01-24 23:01:46.275: [  OCROSD][1228334000]utread:3: Problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]propriogid:1_1: Failed to read the whole bootblock. Assumes invalid format.
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprioini: all disks are not OCR/OLR formatted
2010-01-24 23:01:46.275: [  OCRRAW][1228334000]proprinit: Could not open raw device
2010-01-24 23:01:46.275: [  OCRAPI][1228334000]a_init:16!: Backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [  CRSOCR][1228334000] OCR context init failure.  Error: PROCL-26: Error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] OLR initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]Created alert : (:OHAS00106:) :  Failed to initialize Oracle Local Registry
2010-01-24 23:01:46.277: [ default][1228334000][PANIC] OHASD exiting; Could not init OLR
 
OR
 
 
..
2010-11-07 03:00:08.932: [ default][1] Created alert : (:OHAS00102:) : OHASD is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][PANIC] OHASD exiting: must be run as privileged user
 
OR
 
 
ohasd.bin comes up but output of "crsctl stat res -t -init"shows no resource, and "ocrconfig -local -manualbackup" fails
 
OR
 
 
..
2010-08-04 13:13:11.102: [   CRSPE][35] Resources parsed
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has been registered with the PE data model
2010-08-04 13:13:11.103: [   CRSPE][35] STARTUPCMD_REQ = false:
2010-08-04 13:13:11.103: [   CRSPE][35] Server [] has changed state from [Invalid/unitialized] to [VISIBLE]
2010-08-04 13:13:11.103: [  CRSOCR][31] Multi Write Batch processing...
2010-08-04 13:13:11.103: [ default][35] Dump State Starting ...
..
2010-08-04 13:13:11.112: [   CRSPE][35] SERVERS:
:VISIBLE:address{{Absolute|Node:0|Process:-1|Type:1}}; recovered state:VISIBLE. Assigned to no pool
 
------------- SERVER POOLS:
Free [min:0][max:-1][importance:0] NO SERVERS ASSIGNED
 
2010-08-04 13:13:11.113: [   CRSPE][35] Dumping ICE contents...:ICE operation count: 0
2010-08-04 13:13:11.113: [ default][35] Dump State Done.
 
 
The solution is to restore a good backup of OLR with "ocrconfig -local -restore <ocr_backup_name>".
By default, OLR will be backed up to $GRID_HOME/cdata/$HOST/backup_$TIME_STAMP.olr once installation is complete.
 
7. ohasd.bin is able to access network socket files:
 
 
2010-06-29 10:31:01.570: [ COMMCRS][1206901056]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))
 
2010-06-29 10:31:01.571: [  OCRSRV][1217390912]th_listen: CLSCLISTEN failed clsc_ret= 3, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2010-06-29 10:31:01.571: [  OCRSRV][3267002960]th_init: Local listener did not reach valid state
 
In Grid Infrastructure cluster environment, ohasd related socket files should be owned by root, but in Oracle Restart environment, they should be owned by grid user, refer to "Network Socket File Location, Ownership and Permission" section for example output.
 
8. ohasd.bin is able to access log file location:
 
OS messages/syslog shows:
 
Feb 20 10:47:08 racnode1 OHASD[9566]: OHASD exiting; Directory /ocw/grid/log/racnode1/ohasd not found.
 
Refer to "Log File Location, Ownership and Permission" section for example output, if the expected directory is missing, create it with proper ownership and permission.
 
9. After node reboot, ohasd may fail to start on SUSE Linux after node reboot, refer to note 1325718.1 - OHASD not Starting After Reboot on SLES
 
10. OHASD fails to start, "ps -ef| grep ohasd.bin" shows ohasd.bin is started, but nothing in $GRID_HOME/log/<node>/ohasd/ohasd.log for many minutes, truss shows it is looping to close non-opened file handles:
 
 
..
15058/1:         0.1995 close(2147483646)                               Err#9 EBADF
15058/1:         0.1996 close(2147483645)                               Err#9 EBADF
..
 
Call stack of ohasd.bin from pstack shows the following:
 
_close  sclssutl_closefiledescriptors  main ..
 
The cause is bug 11834289 which is fixed in 11.2.0.3 and above, other symptoms of the bug is clusterware processes may fail to start with same call stack and truss output (looping on OS call "close"). If the bug happens when trying to start other resources, "CRS-5802: Unable to start the agent process" could show up as well.
 
11. Other potential causes/solutions listed in note 1069182.1 - OHASD Failed to Start: Inappropriate ioctl for device
 
12. ohasd.bin started fine, however, "crsctl check crs" shows only the following and nothing else:
 
CRS-4638: Oracle High Availability Services is online
And "crsctl stat res -p -init" shows nothing
 
The cause is that OLR is corrupted, refer to note 1193643.1 to restore.
 
13. On EL7/OL7: note 1959008.1 - Install of Clusterware fails while running root.sh on OL7 - ohasd fails to start 
 
14. For EL7/OL7, patch 25606616 is needed: TRACKING BUG TO PROVIDE GI FIXES FOR OL7
 
15. If ohasd still fails to start, refer to ohasd.log in <grid-home>/log/<nodename>/ohasd/ohasd.log and ohasdOUT.log
 
 
 
 
Case 2: OHASD Agents do not start
 
OHASD.BIN will spawn four agents/monitors to start resource:
 
  oraagent: responsible for ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd etc
  orarootagent: responsible for ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs etc
  cssdagent / cssdmonitor: responsible for ora.cssd(for ocssd.bin) and ora.cssdmonitor(for cssdmonitor itself)
 
If ohasd.bin can not start any of above agents properly, clusterware will not come to healthy state.
 
1. Common causes of agent failure are that the log file or log directory for the agents don't have proper ownership or permission.
 
Refer to below section "Log File Location, Ownership and Permission" for general reference.
 
One example is "rootcrs.pl -patch/postpatch" wasn't executed while patching manually resulting in agent start failure: 
 
2015-02-25 15:43:54.350806 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/orarootagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/orarootagent]
 
2015-02-25 15:43:54.382154 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]
 
2015-02-25 15:43:54.384105 : CRSMAIN:3294918400: {0:0:2} {0:0:2} Created alert : (:CRSAGF00123:) : Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]
 
 
The solution is to execute the missed steps.
 
 
 
2. If agent binary (oraagent.bin or orarootagent.bin etc) is corrupted, agent will not start resulting in related resources not coming up:
 
2011-05-03 11:11:13.189
[ohasd(25303)]CRS-5828:Could not start agent '/ocw/grid/bin/orarootagent_grid'. Details at (:CRSAGF00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.
 
 
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Created alert : (:CRSAGF00130:) :  Failed to start the agent /ocw/grid/bin/orarootagent_grid
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Agfw Proxy Server sending the last reply to PE for message:RESOURCE_START[ora.diskmon 1 1] ID 4098:403
2011-05-03 12:03:17.491: [    AGFW][1117866336] {0:0:184} Can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized
..
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} Fatal Error from AGFW Proxy: Unable to start the agent process
2011-05-03 12:03:17.492: [   CRSPE][1128372576] {0:0:184} CRS-2674: Start of 'ora.diskmon' on 'racnode1' failed
 
..
 
2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdagent Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdagent]
2011-06-27 22:34:57.805: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00126:) :  Agent start failed
..
2011-06-27 22:34:57.806: [    AGFW][1131669824] {0:0:2} Created alert : (:CRSAGF00123:) :  Failed to start the agent process: /ocw/grid/bin/cssdmonitor Category: -1 Operation: fail Loc: canexec2 OS error: 0 Other : no exe permission, file [/ocw/grid/bin/cssdmonitor]
 
The solution is to compare agent binary with a "good" node, and restore a good copy.
 
 truss/strace of ohasd shows agent binary is corrupted
32555 17:38:15.953355 execve("/ocw/grid/bin/orarootagent.bin",
["/opt/grid/product/112020/grid/bi"...], [/* 38 vars */]) = 0
..
32555 17:38:15.954151 --- SIGBUS (Bus error) @ 0 (0) ---  
 
3. Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start"  #10 for details.
 
4. Refer to: note 1964240.1 - CRS-5823:Could not initialize agent framework
 
 
 
Case 3: OCSSD.BIN does not start
 
Successful cssd.bin startup depends on the following:
 
1. GPnP profile is accessible - gpnpd needs to be fully up to serve profile
 
If ocssd.bin is able to get the profile successfully, likely ocssd.log will have similar messages like following:
 
2010-02-02 18:00:16.251: [    GPnP][408926240]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "ipc://GPNPD_rac1", try 4 of 500...
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileVerifyForCall: [at clsgpnp.c:1867] Result: (87) CLSGPNP_SIG_VALPEER. Profile verified.  prf=0x165160d0
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileGetSequenceRef: [at clsgpnp.c:841] Result: (0) CLSGPNP_OK. seq of p=0x165160d0 is '6'=6
2010-02-02 18:00:16.263: [    GPnP][408926240]clsgpnp_profileCallUrlInt: [at clsgpnp.c:2186] Result: (0) CLSGPNP_OK. Successful get-profile CALL to remote "ipc://GPNPD_rac1" disco ""
 
Otherwise messages like following will show in ocssd.log
 
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1100] GIPC gipcretConnectionRefused (29) gipcConnect(ipc-ipc://GPNPD_rac1)
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnpm_connect: [at clsgpnpm.c:1101] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "ipc://GPNPD_rac1"
2010-02-03 22:26:17.057: [    GPnP][3852126240]clsgpnp_getProfileEx: [at clsgpnp.c:546] Result: (13) CLSGPNP_NO_DAEMON. Can't get GPnP service profile from local GPnP daemon
2010-02-03 22:26:17.057: [ default][3852126240]Cannot get GPnP profile. Error CLSGPNP_NO_DAEMON (GPNPD daemon is not running).
2010-02-03 22:26:17.057: [    CSSD][3852126240]clsgpnp_getProfile failed, rc(13)
The solution is to ensure gpnpd is up and running properly.
 
 
2. Voting Disk is accessible
 
In 11gR2, ocssd.bin discover voting disk with setting from GPnP profile, if not enough voting disks can be identified, ocssd.bin will abort itself.
 
2010-02-03 22:37:22.212: [    CSSD][2330355744]clssnmReadDiscoveryProfile: voting file discovery string(/share/storage/di*)
..
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvDiskVerify: Successful discovery of 0 disks
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2010-02-03 22:37:22.227: [    CSSD][1145538880]clssnmvFindInitialConfigs: No voting files found
2010-02-03 22:37:22.228: [    CSSD][1145538880]###################################
2010-02-03 22:37:22.228: [    CSSD][1145538880]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread
 
ocssd.bin may not come up with the following error if all nodes failed while there's a voting file change in progress:
 
2010-05-02 03:11:19.033: [    CSSD][1197668093]clssnmCompleteInitVFDiscovery: Detected voting file add in progress for CIN 0:1134513465:0, waiting for configuration to complete 0:1134513098:0
 
The solution is to start ocssd.bin in exclusive mode with note 1364971.1
 
 
If the voting disk is located on a non-ASM device, ownership and permissions should be:
 
-rw-r----- 1 ogrid oinstall 21004288 Feb  4 09:13 votedisk1
 
3. Network is functional and name resolution is working:
 
If ocssd.bin can't bind to any network, likely the ocssd.log will have messages like following:
 
2010-02-03 23:26:25.804: [GIPCXCPT][1206540320]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2010-02-03 23:26:25.804: [GIPCGMOD][1206540320]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ]  failed to determine host from clsinet, using default
..
2010-02-03 23:26:25.810: [    CSSD][1206540320]clsssclsnrsetup: gipcEndpoint failed, rc 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://rac1:nm_eotcs- ret 39
2010-02-03 23:26:25.811: [    CSSD][1206540320]clssscmain: failed to open gipc endp
 
 
If there's connectivity issue on private network (including multicast is off), likely the ocssd.log will have messages like following:
 
2010-09-20 11:52:54.014: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 453, LATS 328297844, lastSeqNo 452, uniqueness 1284979488, timestamp 1284979973/329344894
2010-09-20 11:52:54.016: [    CSSD][1078421824]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
..  >>>> after a long delay
2010-09-20 12:02:39.578: [    CSSD][1103055168]clssnmvDHBValidateNCopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 180441784, wrtcnt, 1037, LATS 328883434, lastSeqNo 1036, uniqueness 1284979488, timestamp 1284980558/329930254
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0xe1ad870)
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssgmShutDown: Received abortive shutdown request from client.
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
2010-09-20 12:02:39.895: [    CSSD][1107286336]clssscExit: CSSD aborting from thread GMClientListener
2010-09-20 12:02:39.895: [    CSSD][1107286336]###################################
 
To validate network, please refer to note 1054902.1
Please also check if the network interface name is matching the gpnp profile definition ("gpnptool get") for cluster_interconnect if CSSD could not start after a network change.
 
In 11.2.0.1, ocssd.bin may bind to public network if private network is unavailable
 
4. Vendor clusterware is up (if using vendor clusterware)
 
Grid Infrastructure provide full clusterware functionality and doesn't need Vendor clusterware to be installed; but if you happened to have Grid Infrastructure on top of Vendor clusterware in your environment, then Vendor clusterware need to come up fully before CRS can be started, to verify, as grid user:
 
$GRID_HOME/bin/lsnodes -n
racnode1    1
racnode1    0
 
If vendor clusterware is not fully up, likely ocssd.log will have similar messages like following:
 
2010-08-30 18:28:13.207: [    CSSD][36]clssnm_skgxninit: skgxncin failed, will retry
2010-08-30 18:28:14.207: [    CSSD][36]clssnm_skgxnmon: skgxn init failed
2010-08-30 18:28:14.208: [    CSSD][36]###################################
2010-08-30 18:28:14.208: [    CSSD][36]clssscExit: CSSD signal 11 in thread skgxnmon
 
Before the clusterware is installed, execute the command below as grid user:
 
$INSTALL_SOURCE/install/lsnodes -v
 
 
One issue on hp-ux: note 2130230.1 - Grid infrastructure startup fails due to vendor Clusterware did not start (HP-UX Service guard)
 
 
 
5. Command "crsctl" being executed from wrong GRID_HOME
 
Command "crsctl" must be executed from correct GRID_HOME to start the stack, or similar message will be reported:
 
2012-11-14 10:21:44.014: [    CSSD][1086675264]ASSERT clssnm1.c 3248
2012-11-14 10:21:44.014: [    CSSD][1086675264](:CSSNM00056:)clssnmvStartDiscovery: Terminating because of the release version(11.2.0.2.0) of this node being lesser than the active version(11.2.0.3.0) that the cluster is at
2012-11-14 10:21:44.014: [    CSSD][1086675264]###################################
2012-11-14 10:21:44.014: [    CSSD][1086675264]clssscExit: CSSD aborting from thread clssnmvDDiscThread#
 
 
Case 4: CRSD.BIN does not start
If the "crsctl stat res -t -init" shows that ora.crsd is in intermediate state and if this is not the first node where crsd is starting, then a likely cause is that the csrd.bin is not able to talk to the master crsd.bin.
In this case, the master crsd.bin is likely having a problem, so killing the master crsd.bin is a likely solution. 
Issue "grep MASTER crsd.trc" to find out the node where the master crsd.bin is running.  Kill the crsd.bin on that master node.
The crsd.bin will automatically respawn although the master will be transferred to crsd.bin on another node.
 
 
Successful crsd.bin startup depends on the following:
 
1. ocssd is fully up
 
If ocssd.bin is not fully up, crsd.log will show messages like following:
 
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clssscConnect: gipc request failed with 29 (0x16)
2010-02-03 22:37:51.638: [ CSSCLNT][1548456880]clsssInitNative: connect failed, rc 29
2010-02-03 22:37:51.639: [  CRSRTI][1548456880] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
 
 
2. OCR is accessible
 
If the OCR is located on ASM, ora.asm resource (ASM instance) must be up and diskgroup for OCR must be mounted, if not, likely the crsd.log will show messages like:
 
2010-02-03 22:22:55.186: [  OCRASM][2603807664]proprasmo: Error in open/create file in dg [GI]
[  OCRASM][2603807664]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
 
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: kgfoCheckMount returned [7]
2010-02-03 22:22:55.189: [  OCRASM][2603807664]proprasmo: The ASM instance is down
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: Failed to open [+GI]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprioo: No OCR/OLR devices are usable
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRRAW][2603807664]proprinit: Could not open raw device
2010-02-03 22:22:55.190: [  OCRASM][2603807664]proprasmcl: asmhandle is NULL
2010-02-03 22:22:55.190: [  OCRAPI][2603807664]a_init:16!: Backend init unsuccessful : [26]
2010-02-03 22:22:55.190: [  CRSOCR][2603807664] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge
ORA-15077: could not locate ASM instance serving a required diskgroup
] [7]
2010-02-03 22:22:55.190: [    CRSD][2603807664][PANIC] CRSD exiting: Could not init OCR, code: 26
 
Note: in 11.2 ASM starts before crsd.bin, and brings up the diskgroup automatically if it contains the OCR.
 
If the OCR is located on a non-ASM device, expected ownership and permissions are:
 
-rw-r----- 1 root  oinstall  272756736 Feb  3 23:24 ocr
 
If OCR is located on non-ASM device and it's unavailable, likely crsd.log will show similar message like following:
 
2010-02-03 23:14:33.583: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:33.583: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:33.583: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:6m':failed in stat OCR file/disk /share/storage/ocr, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCROSD][2346668976]utopen:7:failed to open any OCR file/disk, errno=2, os err string=No such file or directory
2010-02-03 23:14:34.587: [  OCRRAW][2346668976]proprinit: Could not open raw device
2010-02-03 23:14:34.587: [ default][2346668976]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:14:35.589: [    CRSD][2346668976][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26
 
 
If the OCR is corrupted, likely crsd.log will show messages like the following:
 
2010-02-03 23:19:38.417: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]propriogid:1_2: INVALID FORMAT
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprioini: all disks are not OCR/OLR formatted
2010-02-03 23:19:39.429: [  OCRRAW][3360863152]proprinit: Could not open raw device
2010-02-03 23:19:39.429: [ default][3360863152]a_init:7!: Backend init unsuccessful : [26]
2010-02-03 23:19:40.432: [    CRSD][3360863152][PANIC] CRSD exiting: OCR device cannot be initialized, error: 1:26
 
 
If owner or group of grid user got changed, even ASM is available, likely crsd.log will show following:
 
2010-03-10 11:45:12.510: [  OCRASM][611467760]proprasmo: Error in open/create file in dg [SYSTEMDG]
[  OCRASM][611467760]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
 
2010-03-10 11:45:12.528: [  OCRASM][611467760]proprasmo: kgfoCheckMount returned [7]
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmo: The ASM instance is down
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: Failed to open [+SYSTEMDG]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprioo: No OCR/OLR devices are usable
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRRAW][611467760]proprinit: Could not open raw device
2010-03-10 11:45:12.529: [  OCRASM][611467760]proprasmcl: asmhandle is NULL
2010-03-10 11:45:12.529: [  OCRAPI][611467760]a_init:16!: Backend init unsuccessful : [26]
2010-03-10 11:45:12.530: [  CRSOCR][611467760] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=1031, loc=kgfokge
ORA-01031: insufficient privileges
] [7]
 
 
If oracle binary in GRID_HOME has wrong ownership or permission regardless whether ASM is up and running, or if grid user can not write in ORACLE_BASE, likely crsd.log will show following:
 
2012-03-04 21:34:23.139: [  OCRASM][3301265904]proprasmo: Error in open/create file in dg [OCR]
[  OCRASM][3301265904]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=12547, loc=kgfokge
 
2012-03-04 21:34:23.139: [  OCRASM][3301265904]ASM Error Stack : ORA-12547: TNS:lost contact
 
2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: kgfoCheckMount returned [7]
2012-03-04 21:34:23.633: [  OCRASM][3301265904]proprasmo: The ASM instance is down
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2012-03-04 21:34:23.634: [  OCRRAW][3301265904]proprioo: No OCR/OLR devices are usable
2012-03-04 21:34:23.635: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.636: [    GIPC][3301265904] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5326]
2012-03-04 21:34:23.639: [ default][3301265904]clsvactversion:4: Retrieving Active Version from local storage.
2012-03-04 21:34:23.643: [  OCRRAW][3301265904]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2012-03-04 21:34:23.645: [  OCRRAW][3301265904]proprinit: Could not open raw device
2012-03-04 21:34:23.646: [  OCRASM][3301265904]proprasmcl: asmhandle is NULL
2012-03-04 21:34:23.650: [  OCRAPI][3301265904]a_init:16!: Backend init unsuccessful : [26]
2012-03-04 21:34:23.651: [  CRSOCR][3301265904] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact
 
2012-03-04 21:34:23.652: [ CRSMAIN][3301265904] Created alert : (:CRSD00111:) :  Could not init OCR, error: PROC-26: Error while accessing the physical storage
ORA-12547: TNS:lost contact
 
2012-03-04 21:34:23.652: [    CRSD][3301265904][PANIC] CRSD exiting: Could not init OCR, code: 26
 
The expected ownership and permission of oracle binary in GRID_HOME should be:
 
-rwsr-s--x 1 grid oinstall 184431149 Feb  2 20:37 /ocw/grid/bin/oracle
 
If OCR or mirror is unavailable (could be ASM is up, but diskgroup for OCR/mirror is unmounted), likely crsd.log will show following:
 
2010-05-11 11:16:38.578: [  OCRASM][18]proprasmo: Error in open/create file in dg [OCRMIR]
[  OCRASM][18]SLOS : SLOS: cat=8, opn=kgfoOpenFile01, dep=15056, loc=kgfokge
ORA-17503: ksfdopn:DGOpenFile05 Failed to open file +OCRMIR.255.4294967295
ORA-17503: ksfdopn:2 Failed to open file +OCRMIR.255.4294967295
ORA-15001: diskgroup "OCRMIR
..
2010-05-11 11:16:38.647: [  OCRASM][18]proprasmo: kgfoCheckMount returned [6]
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmo: The ASM disk group OCRMIR is not found or not mounted
2010-05-11 11:16:38.648: [  OCRASM][18]proprasmdvch: Failed to open OCR location [+OCRMIR] error [26]
2010-05-11 11:16:38.648: [  OCRRAW][18]propriodvch: Error  [8] returned device check for [+OCRMIR]
2010-05-11 11:16:38.648: [  OCRRAW][18]dev_replace: non-master could not verify the new disk (8)
[  OCRSRV][18]proath_invalidate_action: Failed to replace [+OCRMIR] [8]
[  OCRAPI][18]procr_ctx_set_invalid_no_abort: ctx set to invalid
..
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91: Comparing device hash ids between local and master failed
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Local dev (1862408427, 1028247821, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:91 Master dev (1862408427, 1859478705, 0, 0, 0)
2010-05-11 11:16:46.587: [  OCRMAS][19]th_master:9: Shutdown CacheLocal. my hash ids don't match
[  OCRAPI][19]procr_ctx_set_invalid_no_abort: ctx set to invalid
[  OCRAPI][19]procr_ctx_set_invalid: aborting...
2010-05-11 11:16:46.587: [    CRSD][19] Dump State Starting ...
 
 
3. crsd.bin pid file exists and points to running crsd.bin process
 
If pid file does not exist, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:
 
 
2010-02-14 17:40:57.927: [ora.crsd][1243486528] [check] PID FILE doesn't exist.
..
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Creating PID [30269] file for home /ocw/grid host racnode1 bin crs to /ocw/grid/crs/init/
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Error3 -2 writing PID [30269] to the file []
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Failed to record pid for CRSD
2010-02-14 17:41:57.927: [  clsdmt][1092499776]Terminating process
2010-02-14 17:41:57.927: [ default][1092499776] CRSD exiting on stop request from clsdms_thdmai
 
The solution is to create a dummy pid file ($GRID_HOME/crs/init/$HOST.pid) manually as grid user with "touch" command and restart resource ora.crsd
 
If pid file does exist and the PID in this file references a running process which is NOT the crsd.bin process, $GRID_HOME/log/$HOST/agent/ohasd/orarootagent_root/orarootagent_root.log will have similar like the following:
 
2011-04-06 15:53:38.777: [ora.crsd][1160390976] [check] PID will be looked for in /ocw/grid/crs/init/racnode1.pid
2011-04-06 15:53:38.778: [ora.crsd][1160390976] [check] PID which will be monitored will be 1535                               >> 1535 is output of "cat /ocw/grid/crs/init/racnode1.pid"
2011-04-06 15:53:38.965: [ COMMCRS][1191860544]clsc_connect: (0x2aaab400b0b0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD))
[  clsdmc][1160390976]Fail to connect (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_CRSD)) with status 9
2011-04-06 15:53:38.966: [ora.crsd][1160390976] [check] Error = error 9 encountered when connecting to CRSD
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Calling PID check for daemon
2011-04-06 15:53:39.023: [ora.crsd][1160390976] [check] Trying to check PID = 1535
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] PID check returned ONLINE CLSDM returned OFFLINE
2011-04-06 15:53:39.203: [ora.crsd][1160390976] [check] DaemonAgent::check returned 5
2011-04-06 15:53:39.203: [    AGFW][1160390976] check for resource: ora.crsd 1 1 completed with status: FAILED
2011-04-06 15:53:39.203: [    AGFW][1170880832] ora.crsd 1 1 state changed from: UNKNOWN to: FAILED
..
2011-04-06 15:54:10.511: [    AGFW][1167522112] ora.crsd 1 1 state changed from: UNKNOWN to: CLEANING
..
2011-04-06 15:54:10.513: [ora.crsd][1146542400] [clean] Trying to stop PID = 1535
..
2011-04-06 15:54:11.514: [ora.crsd][1146542400] [clean] Trying to check PID = 1535
 
 
To verify on OS level:
 
ls -l /ocw/grid/crs/init/*pid
-rwxr-xr-x 1 ogrid oinstall 5 Feb 17 11:00 /ocw/grid/crs/init/racnode1.pid
cat /ocw/grid/crs/init/*pid
1535
ps -ef| grep 1535
root      1535     1  0 Mar30 ?        00:00:00 iscsid                  >> Note process 1535 is not crsd.bin
 
The solution is to create an empty pid file and to restart the resource ora.crsd, as root:
 
 
# > $GRID_HOME/crs/init/<racnode1>.pid
# $GRID_HOME/bin/crsctl stop res ora.crsd -init
# $GRID_HOME/bin/crsctl start res ora.crsd -init
 
 
4. Network is functional and name resolution is working:
 
If the network is not fully functioning, ocssd.bin may still come up, but crsd.bin may fail and the crsd.log will show messages like:
 
 
2010-02-03 23:34:28.412: [    GPnP][2235814832]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=867, tl=3, f=0
2010-02-03 23:34:28.428: [  OCRAPI][2235814832]clsu_get_private_ip_addresses: no ip addresses found.
..
2010-02-03 23:34:28.434: [  OCRAPI][2235814832]a_init:13!: Clusterware init unsuccessful : [44]
2010-02-03 23:34:28.434: [  CRSOCR][2235814832] OCR context init failure.  Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-02-03 23:34:28.434: [    CRSD][2235814832][PANIC] CRSD exiting: Could not init OCR, code: 44
 
Or:
 
 
2009-12-10 06:28:31.974: [  OCRMAS][20]proath_connect_master:1: could not connect to master  clsc_ret1 = 9, clsc_ret2 = 9
2009-12-10 06:28:31.974: [  OCRMAS][20]th_master:11: Could not connect to the new master
2009-12-10 06:29:01.450: [ CRSMAIN][2] Policy Engine is not initialized yet!
2009-12-10 06:29:31.489: [ CRSMAIN][2] Policy Engine is not initialized yet!
 
Or:
 
 
2009-12-31 00:42:08.110: [ COMMCRS][10]clsc_receive: (102b03250) Error receiving, ns (12535, 12560), transport (505, 145, 0)
 
To validate the network, please refer to note 1054902.1
 
5. crsd executable (crsd.bin and crsd in GRID_HOME/bin) has correct ownership/permission and hasn't been manually modified, a simply way to check is to compare output of "ls -l <grid-home>/bin/crsd <grid-home>/bin/crsd.bin" with a "good" node.
 
 
6. crsd may not start due to the following:
 
note 1552472.1 -CRSD Will Not Start Following a Node Reboot: crsd.log reports: clsclisten: op 65 failed and/or Unable to get E2E port
note 1684332.1 - GI crsd Fails to Start: clsclisten: op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)
 
 
 
7. To troubleshoot further, refer to note 1323698.1 - Troubleshooting CRSD Start up Issue
 
 
Case 5: GPNPD.BIN does not start
1. Name Resolution is not working
 
gpnpd.bin fails with following error in gpnpd.log:
 
 
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_exchange: [at clsgpnpm.c:1175] Calling "tcp://node2:9393", try 1 of 3...
2010-05-13 12:48:11.540: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1015] ENTRY
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1066] GIPC gipcretFail (1) gipcConnect(tcp-tcp://node2:9393)
2010-05-13 12:48:11.541: [    GPnP][1171126592]clsgpnpm_connect: [at clsgpnpm.c:1067] Result: (48) CLSGPNP_COMM_ERR. Failed to connect to call url "tcp://node2:9393"
 
In above example, please make sure current node is able to ping "node2", and no firewall between them.
 
2. Bug 10105195
 
Due to Bug 10105195, gpnp dispatch is single threaded and could be blocked by network scanning etc, the bug is fixed in 11.2.0.2 GI PSU2, 11.2.0.3 and above, refer to note 10105195.8 for more details.
 
 
Case 6: Various other daemons do not start
Common causes:
 
1. Log file or directory for the daemon doesn't have appropriate ownership or permission
 
If the log file or log directory for the daemon doesn't have proper ownership or permissions, usually there is no new info in the log file and the timestamp remains the same while the daemon tries to come up.
 
Refer to below section "Log File Location, Ownership and Permission" for general reference.
 
 
2. Network socket file doesn't have appropriate ownership or permission
 
In this case, the daemon log will show messages like:
 
2010-02-02 12:55:20.485: [ COMMCRS][1121433920]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
 
2010-02-02 12:55:20.485: [  clsdmt][1110944064]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=rac1DBG_GIPCD))
 
 
 
3. OLR is corrupted
 
In this case, the daemon log will show messages like (this is a case that ora.ctssd fails to start):
 
2012-07-22 00:15:16.565: [ default][1]clsvactversion:4: Retrieving Active Version from local storage.
2012-07-22 00:15:16.575: [    CTSS][1]clsctss_r_av3: Invalid active version [] retrieved from OLR. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1](:ctss_init16:): Error [19] retrieving active version. Returns [19].
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS init failed [19]
2012-07-22 00:15:16.585: [    CTSS][1]ctss_main: CTSS daemon aborting [19].
2012-07-22 00:15:16.585: [    CTSS][1]CTSS daemon aborting
 
 
 
The solution is to restore a good copy of OLR note 1193643.1   
 
 
 
4.  Other cases:
 
note 1087521.1 - CTSS Daemon Aborting With "op 65 failed, NSerr (12560, 0), transport: (583, 0, 0)" 
 
 
 
Case 7: CRSD Agents do not start
 
CRSD.BIN will spawn two agents to start up user resource -the two agent share same name and binary as ohasd.bin agents:
 
  orarootagent: responsible for ora.netn.network, ora.nodename.vip, ora.scann.vip and  ora.gns
  oraagent: responsible for ora.asm, ora.eons, ora.ons, listener, SCAN listener, diskgroup, database, service resource etc
 
To find out the user resource status:
 
$GRID_HOME/crsctl stat res -t
 
 
If crsd.bin can not start any of the above agents properly, user resources may not come up. 
 
1. Common cause of agent failure is that the log file or log directory for the agents don't have proper ownership or permissions.
 
Refer to below section "Log File Location, Ownership and Permission" for general reference.
 
2. Agent may fail to start due to bug 11834289 with error "CRS-5802: Unable to start the agent process", refer to Section "OHASD does not start"  #10 for details.
 
 
Case 8: HAIP does not start
HAIP may fail to start with various errors, i.e. 
 
[ohasd(891)]CRS-2807:Resource 'ora.cluster_interconnect.haip' failed to start automatically.
Refer to note 1210883.1 for more details of HAIP 
 
Network and Naming Resolution Verification
 
CRS depends on a fully functional network and name resolution. If the network or name resolution is not fully functioning, CRS may not come up successfully.
 
To validate network and name resolution setup, please refer to note 1054902.1
 
 
Log File Location, Ownership and Permission
 
Appropriate ownership and permission of sub-directories and files in $GRID_HOME/log is critical for CRS components to come up properly.
 
In Grid Infrastructure cluster environment:
Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and two separate RDBMS owner rdbmsap and rdbmsar, here's what it looks like under $GRID_HOME/log in cluster environment:
 
 
drwxrwxr-x 5 grid oinstall 4096 Dec  6 09:20 log
  drwxr-xr-x  2 grid oinstall 4096 Dec  6 08:36 crs
  drwxr-xr-t 17 root   oinstall 4096 Dec  6 09:22 rac1
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:20 admin
    drwxrwxr-t 4 root   oinstall  4096 Dec  6 09:20 agent
      drwxrwxrwt 7 root    oinstall 4096 Jan 26 18:15 crsd
        drwxr-xr-t 2 grid  oinstall 4096 Dec  6 09:40 application_grid
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 oraagent_grid
        drwxr-xr-t 2 rdbmsap oinstall 4096 Jan 26 18:15 oraagent_rdbmsap
        drwxr-xr-t 2 rdbmsar oinstall 4096 Jan 26 18:15 oraagent_rdbmsar
        drwxr-xr-t 2 grid  oinstall 4096 Jan 26 18:15 ora_oc4j_type_grid
        drwxr-xr-t 2 root    root     4096 Jan 26 20:09 orarootagent_root
      drwxrwxr-t 6 root oinstall 4096 Dec  6 09:24 ohasd
        drwxr-xr-t 2 grid oinstall 4096 Jan 26 18:14 oraagent_grid
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdagent_root
        drwxr-xr-t 2 root   root     4096 Dec  6 09:24 oracssdmonitor_root
        drwxr-xr-t 2 root   root     4096 Jan 26 18:14 orarootagent_root    
    -rw-rw-r-- 1 root root     12931 Jan 26 21:30 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:44 client
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 crsd
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:24 cssd
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:24 ctssd
    drwxr-x--- 2 grid oinstall  4096 Jan 26 18:14 diskmon
    drwxr-x--- 2 grid oinstall  4096 Dec  6 09:25 evmd     
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:20 gipcd     
    drwxr-x--- 2 root oinstall  4096 Dec  6 09:20 gnsd      
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:58 gpnpd    
    drwxr-x--- 2 grid oinstall  4096 Jan 26 21:19 mdnsd    
    drwxr-x--- 2 root oinstall  4096 Jan 26 21:20 ohasd     
    drwxrwxr-t 5 grid oinstall  4096 Dec  6 09:34 racg       
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Dec  6 09:20 racgmain
    drwxr-x--- 2 grid oinstall  4096 Jan 26 20:57 srvm        
Please note most log files in sub-directory inherit ownership of parent directory; and above are just for general reference to tell whether there's unexpected recursive ownership and permission changes inside the CRS home . If you have a working node with the same version, the working node should be used as a reference.
 
 
In Oracle Restart environment:
And here's what it looks like under $GRID_HOME/log in Oracle Restart environment:
 
drwxrwxr-x 5 grid oinstall 4096 Oct 31  2009 log
  drwxr-xr-x  2 grid oinstall 4096 Oct 31  2009 crs
  drwxr-xr-x  3 grid oinstall 4096 Oct 31  2009 diag
  drwxr-xr-t 17 root   oinstall 4096 Oct 31  2009 rac1
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 admin
    drwxrwxr-t 4 root   oinstall  4096 Oct 31  2009 agent
      drwxrwxrwt 2 root oinstall 4096 Oct 31  2009 crsd
      drwxrwxr-t 8 root oinstall 4096 Jul 14 08:15 ohasd
        drwxr-xr-x 2 grid oinstall 4096 Aug  5 13:40 oraagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  2 07:11 oracssdagent_grid
        drwxr-xr-x 2 grid oinstall 4096 Aug  3 21:13 orarootagent_grid
    -rwxr-xr-x 1 grid oinstall 13782 Aug  1 17:23 alertrac1.log
    drwxr-x--- 2 grid oinstall  4096 Nov  2  2009 client
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 crsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 cssd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 ctssd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 diskmon
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 evmd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gipcd
    drwxr-x--- 2 root   oinstall  4096 Oct 31  2009 gnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 gpnpd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 mdnsd
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 ohasd
    drwxrwxr-t 5 grid oinstall  4096 Oct 31  2009 racg
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgeut
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgevtf
      drwxrwxrwt 2 grid oinstall 4096 Oct 31  2009 racgmain
    drwxr-x--- 2 grid oinstall  4096 Oct 31  2009 srvm
 
 
For 12.1.0.2 onward, refer to note 1915729.1 - Oracle Clusterware Diagnostic and Alert Log Moved to ADR
 
 
 
Network Socket File Location, Ownership and Permission
 
Network socket files can be located in /tmp/.oracle, /var/tmp/.oracle or /usr/tmp/.oracle
 
When socket file has unexpected ownership or permission, usually daemon log file (i.e. evmd.log) will have the following:
 
 
2011-06-18 14:07:28.545: [ COMMCRS][772]clsclisten: Permission denied for (ADDRESS=(PROTOCOL=ipc)(KEY=racnode1DBG_EVMD))
 
2011-06-18 14:07:28.545: [  clsdmt][515]Fail to listen to (ADDRESS=(PROTOCOL=ipc)(KEY=lexxxDBG_EVMD))
2011-06-18 14:07:28.545: [  clsdmt][515]Terminating process
2011-06-18 14:07:28.559: [ default][515] EVMD exiting on stop request from clsdms_thdmai
 
 
And the following error may be reported:
 
 
CRS-5017: The resource action "ora.evmd start" encountered the following error:
CRS-2674: Start of 'ora.evmd' on 'racnode1' failed
..
 
The solution is to stop GI as root (crsctl stop crs -f), clean up socket files and restart GI.
 
 
Assuming a Grid Infrastructure environment with node name rac1, CRS owner grid, and clustername eotcs
 
In Grid Infrastructure cluster environment:
Below is an example output from cluster environment:
 
 
drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle
 
./.oracle:
drwxrwxrwt 2 root  oinstall 4096 Feb  2 21:25 .
srwxrwx--- 1 grid oinstall    0 Feb  2 18:00 master_diskmon
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 mdnsd
-rw-r--r-- 1 grid oinstall    5 Feb  2 18:00 mdnsd.pid
prw-r--r-- 1 root  root        0 Feb  2 13:33 npohasd
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 ora_gipc_GPNPD_rac1
-rw-r--r-- 1 grid oinstall    0 Feb  2 13:34 ora_gipc_GPNPD_rac1_lock
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11724.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:39 s#11735.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 13:45 s#12339.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6275.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6276.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.1
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 s#6278.2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sAevm
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sCevm
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sCRSD_UI_SOCKET
srwxrwxrwx 1 root  root        0 Feb  2 21:25 srac1DBG_CRSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_CSSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_CTSSD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_EVMD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GIPCD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_GPNPD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 srac1DBG_MDNSD
srwxrwxrwx 1 root  root        0 Feb  2 18:00 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN2
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:01 sLISTENER_SCAN3
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1_eotcs_lock
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1
-rw-r--r-- 1 grid oinstall    0 Feb  2 18:00 sOracle_CSS_LclLstnr_eotcs_1_lock
srwxrwxrwx 1 root  root        0 Feb  2 18:01 sora_crsqs
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROC
srwxrwxrwx 1 root  root        0 Feb  2 18:00 sprocr_local_conn_0_PROL
srwxrwxrwx 1 grid oinstall    0 Feb  2 18:00 sSYSTEM.evm.acceptor.auth
 
 
In Oracle Restart environment:
And below is an example output from Oracle Restart environment:
 
 
drwxrwxrwt  2 root oinstall 4096 Feb  2 21:25 .oracle
 
./.oracle:
srwxrwx--- 1 grid oinstall 0 Aug  1 17:23 master_diskmon
prw-r--r-- 1 grid oinstall 0 Oct 31  2009 npohasd
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.1
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 s#14478.2
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.1
srwxrwxrwx 1 grid oinstall 0 Jul 14 08:02 s#2266.2
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.1
srwxrwxrwx 1 grid oinstall 0 Jul  7 10:59 s#2269.2
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.1
srwxrwxrwx 1 grid oinstall 0 Jul 31 22:10 s#2313.2
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.1
srwxrwxrwx 1 grid oinstall 0 Jun 29 21:58 s#2851.2
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sCRSD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_CSSD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 srac1DBG_OHASD
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sEXTPROC1521
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1_localhost_lock
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sOCSSD_LL_rac1__lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_IPC_SOCKET_11
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sOHASD_UI_SOCKET
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1
-rw-r--r-- 1 grid oinstall 0 Aug  1 17:23 sgrid_CSS_LclLstnr_localhost_1_lock
srwxrwxrwx 1 grid oinstall 0 Aug  1 17:23 sprocr_local_conn_0_PROL
 
 
 
Diagnostic file collection
 
If the issue can't be identified with the note, as root, please run $GRID_HOME/bin/diagcollection.sh on all nodes, and upload all .gz files it generated in current directory.
 

 

Is it possible to access the oracle asm diskgroup if the underline disk partition table is corrupted

$
0
0
We had issues with one of RAC node going down and were reported saying the LUN/partition table missing due to which the issue occured. But the ASM, database and CRS logs during the issue occurance were analyzed. The analysis does not reveal any evidence pointing to missing partition table OR even block corruption. In fact database was up during said period. Further, team analyzed OS logs during which the issue occured and beyond that timestamp but did not find evidence to show partition table missed or damaged during the period.
 
So, we would like to know that,
 
a) Can ASM work for days with partition table on cache while partition table on disk is damaged?
b) What is periodic update/syncing protocol between ASM cache and disk for partition information?
 
We would appreciate for looking into our queries mentioned above and respond back immediately to take the analysis further ahead to find the RCA and take preventive measures to avoid the same in future.
 
 
 
 
 
SM can continue to function when the ASM disk header on disk is corrupted or lost. ASM keeps the ASM disk header in the (ASM instance) cache, not the disk partition table.
 
If the disk partition table is lost/damaged/deleted/modified, the ASM disk header may or may not be affected. For example, if the disk partition table is deleted, that would not affect the ASM disk header at all. ASM disk group would stay mounted and all would seem to be fine. You would be able to dismount the disk group just fine. But the next time you try to mount the disk group, ASM will not be able to find the disk header as it will look at the beginning of the disk. The disk header, and in fact all other data, including ASM metadata, will be there, but ASM will not be able to find it. If this was the only thing that happened, i.e. lost or deleted disk partition table, the fix is very easy. Just recreate the partition table on that disk.
 
But if the partition table and ASM disk header are in fact damaged, say someone overwrote the first megabyte of the disk, then again ASM disk group will stay mounted and all would appear to be fine. That is because ASM keeps the ASM disk header in the cache. But if you were to dismount the disk group, you would not be able to mount it again as the ASM metadata is now truly lost.
 
 
 
 
"what happen when we  insert a data?"  Remember also that the ASM instance only keeps track of where database data file extents are stored and does not take part in actual database IO.  Each database instance caches the ASM file information it gets and uses this information to perform the IO directly to the data files.  Everything would work until such time as ASM was unable to provide the database with needed file information.
 

Generic check for error Oracle ASM ORA-15042

$
0
0
Error ORA-15042 means asm cannot mount the diskgroup as some of it member disks are not visible to asm.
 
 
 
How to find which disk is missing:
 
-----------------------------------------------------
 
- Scan the asm alert log for last successful mount of that diskgroup, where it will list out the disks with which it was mounted. Search for below keyword
 
SUCCESS: diskgroup <diskgroup name> was mounted
 
 
 
- After that point to till issue time scan in asm alert log that what are the disks added and dropped.
 
- Finally, make a list of disk path and name which were member of diskgroup before issue occurred.
 
 
 
Diagnosis:
 
----------------
 
- Check all disks permission (660) and ownership are correct. Owner/Group should be grid:asmsdmin (asmadminis SS_ASM_GRP in $GRID_HOME/rdbms/lib/config.c)
 
- Check the disks are accessible as grid os user , like:
 
dd if=<path to asm disk> of=/tmp/<devicename>.dd bs=1048576 count=10
 
- If multipath is in use , then check if the multipath base devices are presented fine.
 
 
 
Possible cause:
 
----------------------------
 
1. Permission of disks are not correct
 
2. Disk assessed by asm is available but its base device is not available
 
3. Duplicate device is found for same disk.
 
4. Partition of the disk was modified , so that the asm disk header is not located now at 1st 4k block of starting of disk.
 
5. At least 1st 4k block data information is erases externally , that is outside of oracle
 
6. In case of asmlib , multipath disks are visible but the asmlib disks are not visible.
 
 
 
Solution
 
----------------
 
1. Correct the permission
 
2. Ask you system admin to present the base device
 
3. Check with system admin that why more than 2 paths are showing for same asm disk , which are getting diskcovered by asm.
 
4. Check with system admin to correct the partition table so that it will point to correct location of asm disk header.
 
5. If only asm disk header is corrupted / wiped out and remain asm metadata is correct , then it be repaired , please raise SR with oracle support. Otherwise you need to recreate the diskgroup.
 
6.
 
- Check the label is present by running below command
 
kfed read /dev/oracleasm/disks/<label name>| grep kfdhdb.driver.provstr
 
- ASMLIB drivers are loaded
 
- ASMLIB configuration file /etc/sysconfig/oracleasm is having proper entry
 
 
 
 
 
 Also, please check the following document:
 
 
 
=)> ORA-15063 / ORA-15042 - TROUBLESHOOTING STEPS BEFORE OPENING a SR to Oracle Support (Doc ID 1535996.1)
 
 

KFED Reports “KFBTYP_INVALID” & OS Metadata [EFI PART] In Oracle ASMLIB Disk /ASM disk Member (ASM Disk Overlapping : Scenario #1).

$
0
0
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.2.0.1 to 12.1.0.1 [Release 10.2 to 12.1]
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Backup Service - Version N/A and later
Information in this document applies to any platform.
SYMPTOMS
1) ASM diskgroup cannot be mounted due to the next error:
 
SQL> alter diskgroup reco mount;
alter diskgroup reco mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "2" is missing from group number "2"
 
 
 
CAUSE
1) Kfed reports:
 
[oracle@fcomtaep2 disks]$ kfed read ASMRECO03
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
7FC18D899400 00000000 00000000 00000000 00000000 [................]
  Repeat 27 times
7FC18D8995C0 FEEE0001 0001FFFF FFFF0000 00000FFF [................]
7FC18D8995D0 00000000 00000000 00000000 00000000 [................]
  Repeat 1 times
7FC18D8995F0 00000000 00000000 00000000 AA550000 [..............U.]
7FC18D899600 20494645 54524150 00010000 0000005C [EFI PART....\...]
7FC18D899610 BD82BBB3 00000000 00000001 00000000 [................]
7FC18D899620 0FFFFFFF 00000000 00000022 00000000 [........".......]
7FC18D899630 0FFFFFDE 00000000 FD8857E5 42D7B49B [.........W.....B]
7FC18D899640 0901FA87 6B3DB5AA 00000002 00000000 [......=k........]
7FC18D899650 00000080 00000080 FE48EB77 00000000 [........w.H.....]
7FC18D899660 00000000 00000000 00000000 00000000 [................]
  Repeat 25 times
7FC18D899800 EBD0A0A2 4433B9E5 B668C087 C79926B7 [......3D..h..&..]
7FC18D899810 5381F6DF 4626F988 0E4F468D D78D3B28 [...S..&F.FO.(;..]
7FC18D899820 000007A1 00000000 0FFFF85F 00000000 [........_.......]
7FC18D899830 00000000 00000000 00720070 006D0069 [........p.r.i.m.]
7FC18D899840 00720061 00000079 00000000 00000000 [a.r.y...........]
7FC18D899850 00000000 00000000 00000000 00000000 [................]
 Repeat 186 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0
 
2) Above, kfed is reporting that the “ASMRECO03” disk was overlapped by the OS, it shows “EFI PART” which is OS partition metadata:
 
7FC18D899600 20494645 54524150 00010000 0000005C [EFI PART....\...]
 
3) The underlying disks associated to the “ASMRECO03” ASMLIB disk was reformatted by root and assigned to the OS filesystem.
 
4) This action corrupted/destroyed the RECO diskgroup for sure.
 
 
 
 
SOLUTION
Diskgroup needs to be recreated due to the “ASMRECO03” disk was overlapped by the OS filesystem.
 

Oracle ASM KFED Reports “KFBTYP_INVALID” & OS Metadata [LVM2 001] In "/dev/" Disk /ASM disk Member (ASM Disk Overlapping : Scenario #2).

$
0
0
APPLIES TO:
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Database - Enterprise Edition - Version 10.2.0.1 to 12.1.0.1 [Release 10.2 to 12.1]
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Generic (Platform Independent)
SYMPTOMS
1) ASM diskgroup cannot be mounted due to the next error:
 
 
SQL> alter diskgroup <DGNAME> mount;
alter diskgroup <DGNAME> mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing from group number "2"
 
 
2) ASM reports block corruption:
 
 
Wed Apr 09 01:26:05 2014
NOTE: SMON starting instance recovery for group <DGNAME> domain 2 (mounted)
NOTE: F1X0 found on disk 0 au 2 fcn 0.469914
NOTE: SMON skipping disk 1 - no header
NOTE: starting recovery of thread=2 ckpt=2.10251 group=2 (<DGNAME>)
WARNING: ASM recovery read a corrupted ACD block 21004
NOTE: a corrupted block was dumped to the trace file
ORA-15196: invalid ASM block header [kfr.c:8098] [endian_kfbh] [3] [21004] [0 != 1]
ERROR: ASM recovery failed to read ACD block 21004
NOTE: cache initiating offline of disk 1 group <DGNAME>
NOTE: process _smon_+asm1 (26726) initiating offline of disk 1.3915939526 (<DGNAME>_0001) with mask 0x7e in group 2
 
 
CAUSE
1) “/dev/<DISK #1>” (<DGNAME>_0001) disk was overlapped by an OS volume, it shows OS metadata associated to the “LVM2 001” logical volume (all the ASM metadata was wiped out):
 
 
$ kfed read <DGNAME>_0001_<DISK #1>.dump | head -25
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: blk=0
kfbh.block.obj: 0 ; 0x008: file=0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
2ABD671E9400 00000000 00000000 00000000 00000000 [................]
  Repeat 31 times
2ABD671E9600 4542414C 454E4F4C 00000001 00000000 [LABELONE........]
2ABD671E9610 E4E1DDB1 00000020 324D564C 31303020 [.... ...LVM2 001] 2ABD671E9620 50365A77 71327874 34303156 4B4E6136 [wZ6Ptx2qV1046aNK]
2ABD671E9630 35395159 5147634C 487A5A38 63575A37 [YQ95LcGQ8ZzH7ZWc]
2ABD671E9640 00000000 00000019 00030000 00000000 [................]
2ABD671E9650 00000000 00000000 00000000 00000000 [................]
2ABD671E9660 00000000 00000000 00001000 00000000 [................]
2ABD671E9670 0002F000 00000000 00000000 00000000 [................]
2ABD671E9680 00000000 00000000 00000000 00000000 [................]
  Repeat 215 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
 
2) The ““/dev/<DISK #1>” disk was used to create the next logical OS volume while it was already assigned to an ASM diskgroup.
 
 
3) This overlapping corrupted the "/dev/<DISK #1>" (<DGNAME>_0001) disk.
 
 
SOLUTION
 
The <DGNAME> diskgroup needs to be recreated and database files restored from backup due to the <DGNAME> diskgroup was overlapped by the OS, in other words the corruption occurred and came outside Oracle, it cannot be repaired since the OS volume overlapped the data in the “/dev/<DISK #1>” disk.

Oracle ASM disk group is not mounted on second node|| showing corruption

$
0
0
Hello Experts,
 
 
Envirionment :
OS: RHEL 5.6
Oracle :11.2.0.3 + PSU 5
 
 
i had is issue with disk group. i have a disk group called DATA, and this disk group is mounted successfully on first node in a two node RAC. when i tried to mount the disk group on second node,
i got
 
ASMCMD> mount data
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA" (DBD ERROR: OCIStmtExecute)
ASMCMD>
 
i have verified the permissions and compared with the first node. every thing looks correct.
 
when i ran a kfed to read the disk from second node, i got following error and it is complaining the corruption . if i ran the same command on first node in a cluster. it got successful. i am not sure
 how that can happen if nodes are reading the same device
 
 
db3: /opt/app/oracle/diag/asm/+asm/+ASM2/trace/amdu_2013_05_31_18_33_53 # kfed read /dev/raw/raw4
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
2B48D7FCC400 00000000 00000000 00000000 00000000  [................]
  Repeat 255 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
 
please advice, to find out root cause.
 
 
1) It seems that the original block device(s) names bound to the raw(s) devices in questions are not referencing the original raw devices, they were relocated/reassigned, this situation usually occurrs when new disks are added into the system.  
 
2) Please obtain the following output from the affected & healthy node and provide it us:
 
Healthy node:
======================================================
script /tmp/node_ok.txt
 
raw -qa
 
cat /proc/partitions
 
exit
======================================================
 
Affected node:
 
======================================================
script /tmp/node_bad.txt
 
raw -qa
 
cat /proc/partitions
 
exit
======================================================
 
 
 
 
 
1) The best approach to guarantee device persistence (as our colleague Natalka mentioned previously) is using ASMLIB as described below:
 
 
Device Persistence with Oracle Linux ASMLib  
 
ASMLib is a support library for the Automatic Storage Management feature of Oracle Database 10g. Oracle provides a Linux specific implementation of this library. This document describes some advantages this ASMLib brings to Linux system administration. 
 
Device Persistence with Oracle Linux ASMLib 
 
 
 
This document describes some advantages the Linux specific ASM library provided by Oracle (herein "ASMLib") brings to the administration of a Linux system running Oracle. Linux often presents the challenge of disk name persistence. Change the storage configuration and a disk that appeared as     /dev/sdg yesterday can appear as        /dev/sdh after a reboot today. How can these changes be isolated so that they do not affect ASM?                                           
 
Why Not Let ASM Scan All Disks?                                        
 
 
ASM scans all disks it is allowed to discover (via the   asm_diskstring). Why not scan all the disks and let ASM determine which it cares about, rather than even worrying about disk name persistence? 
 
 
The question is notionally correct. If you pass /dev/sd* to ASM, and ASM can read the devices, ASM can indeed pick out its disks regardless of whether  /dev/sdg has changed to   /dev/sdh on this particular boot.  
 
However, to read these devices, ASM has to have permission to read these devices. That means ASM has to have user or group ownership on all devices /dev/sd*, including any system disks. Most system administrators do not want to have the oracle user own system disks just so ASM can ignore them. The potential for mistakes (DBA writing over the /home volume, etc) is way too high.   
 
 
ASMLib vs UDev or DevLabel 
 
There are various methods to provide names that do not change, including  devlabel and udev. What does ASMLib provide that these solutions do not?                                                                                  
 
The bigger problem is not specifically a persistent name - it is matching that name to a set of permissions. It doesn't matter if  /dev/sdg is now  /dev/sdh, as long as the new /dev/sdh has  oracle:dba ownership and the new /dev/sdg - which used to be  /dev/sdf - has the ownership the old  /dev/sdf used to have. The easiest way to ensure that permissions are correct is persistent naming. If a disk always appears as the same name, you can always apply the same permissions to it without worrying. In addition, you can then exclude names that match system disks. Even if the permissions are right, a system administrator isn't going to want ASM scanning system disks every time.                                                                                  
 
Now, udev or devlabel can handle keeping sdg as  sdg (or  /dev/mydisk, whatever). What does ASMLib add? A few things, actually. With ASMLib, there is a simple command to label a disk for ASM. With udev, you'll have to modify the udev configuration file for each disk you add. You'll have to determine a unique id to match the disk and learn the udev configuration syntax.                                                                                  
 
The name is also human-readable. With an Apple XServe RAID, why have a disk named /dev/sdg when it can be DRAWER1DISK2? ASMLib can also list all disks, where with udev you have to either know in your head that   sdg, sdf, and sdj are for ASM, or you have to provide names. With ASMLib, there is no chance of ASM itself scanning system disks. In fact, ASMLib never modifies the system's names for disks. ASMLib never uses the name " /dev/sdg". After boot-time querying the disks, it provides its own access to the devices with permissions for Oracle.  /dev/sdg is still owned by root:root, and the oracle user still cannot access the device by that name.                                                                                  
 
The configuration is persistent. Reinstall a system and your udev configuration is gone. ASMLib's labels are not. With udev, you have to copy the configuration over to the other nodes in a RAC. If you have sixteen nodes, you have to copy each configuration change to all sixteen nodes. Whether you use udev or devlabel, you have to set the permissions properly on all sixteen nodes. ASMLib just requires one invocation of "  /etc/init.d/oracleasm scandisks" to pick up all changes made on the other node.                                                                                  
 
These are just a few of the benefits ASMLib brings to device persistence.                                        
 
                                    
2) An example about how to setup the ASMLIB is described in the following document (I wrote it):
 
 
Note: 580153.1 How To Setup ASM on Linux Using ASMLIB Disks, Raw Devices, Block Devices or UDEV Devices?

 

Oracle ASM ORA-15063 / ORA-15042 - TROUBLESHOOTING STEPS BEFORE OPENING a SR to Oracle Support

$
0
0
APPLIES TO:
Oracle Database - Enterprise Edition
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Cloud Exadata Service - Version N/A and later
Information in this document applies to any platform.
 
 
 
PURPOSE
 
 
Self-debugging steps when a diskgroup cannot be mounted due to error ORA-15063:
 
ORA-15063: ASM discovered an insufficient number of disks for diskgroup s%
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "%" is missing 
 
 
TROUBLESHOOTING STEPS
SECTION A - Getting started
Start by refering  NOTE 452770.1 "TROUBLESHOOTING - ASM disk not found/visible/discovered issues "
Firstly  identify all disks being part of the affected diskgroup by looking at last successful mount in alert_+ASM*.log.
 
You should search for a section as below:
SQL> ALTER DISKGROUP <DGNAME1> MOUNT /* asm agent *//* {0:0:214} */
NOTE: cache registered group DATA number=1 incarn=0x44bef6bb
NOTE: cache began mount (not first) of group DATA number=1 incarn=0x44bef6bb
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Assigning number (1,0) to disk (ORCL:DATA01P)
NOTE: Assigning number (1,1) to disk (ORCL:DATA02P)
NOTE: Assigning number (1,2) to disk (ORCL:DATA03P)
NOTE: Assigning number (1,3) to disk (ORCL:DATA04P)
NOTE: Assigning number (1,4) to disk (ORCL:DATA05P)
..
NOTE: cache opening disk 0 of grp 1: DATA01P label:DATA01P
NOTE: cache opening disk 1 of grp 1: DATA02P label:DATA02P
..
SUCCESS: DISKGROUP <DGNAME1> was mounted
 
 
NOTE: When ASMLIB is not used the path to ASM disk is specified within the mount section:
 NOTE: cache opening disk 1 of grp 1: REDO3_0001 path:/dev/mpath/3600601600ba12c00d4b784363e69e211 
 NOTE: cache opening disk 2 of grp 1: REDO3_0002 path:/dev/mpath/3600601600ba12c00d4b784363e69e212 
 ...
 
 
Isolate the device(s) reported as "missing" as note 452770.1 suggested.
 
Finally start your checks as follow:
 
A1) If there is any IO/storage/multipathing errors reported in OS logs - investigate and fix them.
This step is mandatory as usually ORA-15063/ORA-15042 are caused by underlying IO/storage errors .  
 
A2) If devices used by ASM disks are properly presented and configured at OS level.
If additionally "ORA-15075: disk(s) are not visible cluster-wide" is reported, make sure that all devices are cluster-wide visible.
 
A3) If all ASM disks have appropriate permissions (eg: they should be owned by grid owner)
If ownership of ASM disk(s) has been changed for whatever reason, please correct that.
 
A4) If/how the "missing" device(s) is reported when querying v$asm_disks
-----------------------------------------------------------------------------------
If the device(s) is reported with status:
 
=> "PROVISIONED/CANDIDATE" - this means the header of ASM disk(s) is damaged.
 
    -> investigate the IO problems behind the corruption - see  step A1. Oracle never wipes out its metadata!! A checksum is made for every write before  being accepted.
 
    -> check the header status, in order to confirm the damage:   
$> kfed read <path_to_your_missing_devices>
       
        kfbh.endian:                          0 ; 0x000: 0x00
        kfbh.hard:                            0 ; 0x001: 0x00
        kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
        kfbh.datfmt:                          0 ; 0x003: 0x00
        kfbh.block.blk:                       0 ; 0x004: blk=0
        kfbh.block.obj:                       0 ; 0x008: file=0
        ....
         
    ->  try to repair the header and see if diskgroup can be mounted:                  
$> kfed repair <path_to_your_missing_devices>
 
    -> check the if there is additional corruptions reported by ASM (eg ORA-15196) or by your database - as IO/storage problems could affect more than one block.
    If any corruption is seen please open a SR to Oracle Support.
 
 
 NOTE:  
 1) When non-default AU size is used AUSZ=<au_size> must be specified with each KFED command.
 2) "kfed repair" works for 11g ONLY!
 
 
=> "UNKNOWN/IGNORED" - this means the ASM disk(s) is not seen at OS level.
    -> review steps A1,A2 and A3:         
-----------------------------------------------------------------------------------   
 
A5) If asm_diskstring is still properly set.
 
On Windows configuration, you can also refer NOTE 880061.1 "ASM Is Unable To Detect SCSI Disks On Windows"    
      
SECTION B - ASMLIB is used
When ASMLIB is used, follow the above steps (section A) and also check the errors associated with ORA-15063:
 
B1) ORA-15183 Unable to initialize the ASMLIB in oracle/ORA-15183: ASMLIB initialization error [driver/agent not installed]
 
Refer: NOTE 340519.1 Cannot Start ASM Ora-15063/ORA-15183
 
B2) ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
   
Check your ASMLIB health.
 
 => correctness of installed rpm's
 
 => correctness of symlinks - all nodes should show:
   
    # ls -l  /etc/sysconfig/oracleasm
       lrwxrwxrwx 1 root root 24 Sep 18 22:10 /etc/sysconfig/oracleasm -> oracleasm-_dev_oracleas
       
 => correctness of ASMLIB configuration (/etc/sysconfig/oracleasm) -    when multipathing is used:
 
     # ORACLEASM_SCANORDER: Matching patterns to order disk scanning
        ORACLEASM_SCANORDER="dm"
     # ORACLEASM_SCANEXCLUDE: Matching patterns to exclude disks from scan
        ORACLEASM_SCANEXCLUDE="sd"
 
B3) Check if ASMLIB disks are listed under /dev/oracleasm/disks
 
=> devices under /dev/oracleasm/disks/* must be reported as dm devices on all nodes (not single path device -sd*-).If not, please correct that! (see step B2)
   
$> ls -al /dev/oracleasm/disks
 
brw-rw---- 1 grid dba 253, 29 Feb 12 11:44 /dev/oracleasm/disks/DATA01P
brw-rw---- 1 grid dba 253, 35 Feb 12 11:44 /dev/oracleasm/disks/DATA02P
brw-rw---- 1 grid dba 253, 27 Feb 15 16:04 /dev/oracleasm/disks/DATA03P
brw-rw---- 1 grid dba 253, 24 Feb 12 11:44 /dev/oracleasm/disks/DATA04P
brw-rw---- 1 grid dba 253, 25 Feb 12 11:44 /dev/oracleasm/disks/DATA05P
 
 
=> If one of your ASMLIB disk(s) is missing from the above output,  first try to re-scan devices, as root:
 # /etc/init.d/oracleasm scandisks
 
 
=> If ASMLIB disk(s) is still missing from /dev/oracleasm/disks,  engage your sysadmin to investigate this (see steps A1, A2, A3).
 
B4) Check if ASMLIB disk(s) has the correct ASMLIB stamp and status:
 
 $> kfed read <ASMLIB_device> |grep provstr
      kfdhdb.driver.provstr: ORCLDISK<diskname> ; 0x000: length=20
 
 $> kfed read <ASMLIB_device> | egrep 'kfbh.type|kfdhdb.dskname|kfdhdb.hdrsts'
      kfbh.type:      1 ; 0x002: KFBTYP_DISKHEAD 
      kfdhdb.dskname: DATA01P ; 0x028: length=14
      kfdhdb.hdrsts:  3 ; 0x027: KFDHDR_MEMBER     
     
=> If the output is "kfdhdb.driver.provstr: ORCLCLRD" (but kfdhdb.hdrsts= MEMBER and kfbh.type=KFBTYP_DISKHEAD)  then your disk was deleted using "oracleasm deletedisk".
 
 
 
=> If  kfbh.type = KFBTYP_INVALID  -> see step A4)  and check if "kfed repair" could fix the problem.
 
 
B5)Refer also the below documents:
 
NOTE: 398622.1     ORA-15186: ASMLIB error function = [asm_open], error = [1], mesg = [Operation not permitted]
NOTE: 1384504.1   Mount ASM Disk Group Fails : ORA-15186, ORA-15025, ORA-15063  
NOTE: 967461.1    "Multipath: error getting device" seen in OS log causes ASM/ASMlib to shutdown by itself
NOTE: 1526920.1   ORA-15186 ORA-15063 on node 2
SECTION C  -  Additional notes to review
If the above checks are done, but error still persists, please review also the below notes, depending on your configuration/situation:
 
NOTE:  577526.1     ORA-15063 ASM Discovered An Insufficient Number Of Disks For Diskgroup using NetApp Storage
NOTE:  784776.1     ORA-15063 When Mounting a Diskgroup After Storage Cloning ( BCV / Split Mirror / SRDF / HDS / Flash Copy )
NOTE:  555918.1     ORA-15038 On Diskgroup Mount After Node Eviction
NOTE:  1484723.1   ASM Candidate Raw Device Is Not Presented As A RAC Cluster Wide Shared character Devices On Unix.
NOTE:  1534211.1   ORA-15017 and ORA-15063 errors for unused diskgroups in 11.2
NOTE:  1487443.1   Mounting Diskgroup Fails With ORA-15063 and V$ASM_DISK Shows PROVISIONED
NOTE:  742832.1     AIX:After changing Multipathing drivers from RDAC to MPIO ASM discovered an insufficient number of disks
NOTE:  1276913.1   Unable to discover or use raw devices for ASM in HP-UX Itanium in 11.2.0.2 ( ORA-15063 )
SECTION D  - Information to be collected when are you going to open a SR 
If you are not able to fix the problem on your own, please collect the below information and raise a SR to Oracle Support
 
D1) alert_+ASM*.log (from all nodes if RAC)
 
D2) script#1 from NOTE 470211.1 How To Gather/Backup ASM Metadata In A Formatted Manner version 10.1, 10.2, 11.1 & 11.2?
 
D3) KFED reports
 
 
#! /bin/sh
rm /tmp/kfed_DH.out /tmp/kfed_BK.out 
for i in `ls <your_path_to_asm_disks>`
 do
 echo $i >> /tmp/kfed_DH.out
 kfed read $i >> /tmp/kfed_DH.out
 echo $i >> /tmp/kfed_BK.out
 kfed read $i aun=1 blkn=254  >> /tmp/kfed_BK.out     
done
 
Run kfed.sh in as GRID/ASM owner. Upload /tmp/kfed_DH.out, /tmp/kfed_BK.out
! Pay attention to non-default AU size - if a non-default AU size is used the  you must specify it. (see note 1485597.1 "ASM tools used by Support : KFOD, KFED, AMDU")
 
 
D4) ASMLIB information
NOTE : 869526.1 Collecting The Required Information For Support To Troubleshot ASM/ASMLIB Issues.
 
D5) List of your ASM devices
 
   $> ls -al <path_to_ASM_devices>
 
D6) OS logs (from all nodes if this is RAC configuration)
 
SECTION E  - Disk is reported as MISSING after a failed disk addition
 If you are facing ORA-15063 after a failed disk addition, please collect the below information and raise a SR to Oracle Support
 
E1) alert_+ASM*.log (from all nodes if RAC)
 
E2) script#1 from NOTE 470211.1 How To Gather/Backup ASM Metadata In A Formatted Manner version 10.1, 10.2, 11.1 & 11.2?
 
E3) KFED reports
#! /bin/sh
rm /tmp/kfed_*.out 
for i in `ls <your_path_to_asm_disks>`
 do
 echo $i >> /tmp/kfed_DH.out
 kfed read $i >> /tmp/kfed_DH.out
 echo $i >> /tmp/kfed_BK.out
 kfed read $i aun=1 blkn=254  >> /tmp/kfed_BK.out 
 echo $i >> /tmp/kfed_PST.out
 kfed read $i aun=1 blkn=2 >> /tmp/kfed_PST.out
 echo $i >> /tmp/kfed_FS.out
 kfed read $i blkn=1 >> /tmp/kfed_FS.out
 echo $i >> /tmp/kfed_FD.out
 kfed read $i aun=2 blkn=1 >> /tmp/kfed_FD.out
 echo $i >> /tmp/kfed_DD.out
 kfed read $i aun=2 blkn=0 >> /tmp/kfed_DD.out  ##there might be more than one block needed if a large number of disks -> this might be asked later by Oracle Support
done
 
Run kfed.sh in as GRID/ASM owner. Upload /tmp/kfed_*.out
! Pay attention to non-default AU size - if a non-default AU size is used the  you must specify it. (see note 1485597.1 "ASM tools used by Support : KFOD, KFED, AMDU")
 
 
E4) AMDU output
 
amdu -diskstring '<ASM_DISKSTRING>' -dump '<DISKGROUP_NAME>' -noimage
amdu -diskstring '<ASM_DISKSTRING>' -print <DISKGROUP_NAME>.F2.V0.C2 > DG.amdu
####F2.V0.C2  --> This will only extract up to 16 disks information. If there is a large number of disks, a larger output is needed
 

TROUBLESHOOTING - Oracle ASM disk not found/visible/discovered issues

$
0
0
APPLIES TO:
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Cloud Exadata Service - Version N/A and later
Information in this document applies to any platform.
PURPOSE
This note will assist in troubleshooting disk not found / visible / discovered issues with ASM
 
Typically this means that the disk in question is not found in the v$asm_disk view
 
Common errors indicating that a disk is missing/not found are :
 
ORA-15063: ASM discovered an insufficient number of disks for diskgroup s%
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "%" is missing
 
TROUBLESHOOTING STEPS
If an existing diskgroup is suddenly missing a disk ...
 
Determine what disk is missing
 
If on non-RAC ... or ALL RAC nodes exhibit these errors
 
    a) Locate and open the ASM alert log (this may require multiple logs for RAC)
    b) Locate the last successfull mount of the diskgroup (this will show a list of disks)
    c) Locate each successful ALTER DISKGROUP ... ADD DISK since that time
    d) Combine the lists of disks found in b) and c) above
    e) Compare this list against those shown in V$ASM_DISK and determine the missing disk(s)
 
If on RAC and at least one node can mount the diskgroup
 
    Compare the V$ASM_DISK entries on the node that can mount to those that cannot this will show which disk(s) is missing
 
 
The following is a set of steps that will assist in resolving disk not found issues
 
1) ASM_DISKSTRING is not set correctly
 
    Examine the ASM_DISKSTRING setting in the parameter file or via SHOW PARAMETER
    If ASM_DISKSTRING is NOT SET ... then the following default is used
 
Default ASM_DISKSTRING per OS
 
    Operating System Default            Search String
    =======================================
    Solaris (32/64 bit)                    /dev/rdsk/*
    Windows NT/XP                          \\.\orcldisk* 
    Linux (32/64 bit)                      /dev/raw/* 
 
    LINUX (ASMLIB)                         ORCL:*
    LINUX (ASMLIB)                        /dev/oracleasm/disks/* ( as a workaround )
 
    HPUX                                  /dev/rdsk/* 
    HP-UX(Tru 64)                         /dev/rdisk/*
    AIX                                   /dev/rhdisk*
   IF ASM_DISKSTRING is SET ... then verify that the setting includes the disks that are needed to be seen by ASM
 
 
2) Operating system drive ownership
 
     Make sure that the disk is owned by the OS user who installed the ASM Oracle Home ... and that the disk is mounted correctly (with the correct owner)
 
3) Operating system drive permissions
 
   Make sure that the permissions are set correctly at the disk level ... 660 is normal ... but if there are problems use 777 as a test
 
4) RAC is being used
 
     If RAC is in use ... then ALL disks need to be visible on all nodes where ASM is / will be running ... before an attempt is made to add the disks to a diskgroup
 
5) Use OS utilities to determine which disk cannot be found
 
TRUSSing or STRACEing the RBAL process while selecting * from v$asm_disk can often show errors in the path of the command
 
EXAMPLE:
=========
SESSION #1
 
strace -f -o /tmp/rbal.trc -p <OS pid of RBAL process>
  <OR>
truss -ef -o /tmp/rbal.out -p <OS pid for RBAL process>
 
SESSION #2
 
select * from v$asm_disk
 
SESSION #3
 
<CTRL-C>
 
Examine the rbal.out for errors: For example,
 
1147090: 1871929: chdir("dev/") = 0
1147090: 1871929: statx("rhdisk8, ", 0x0FFFFFFFFFFFAA80, 176, 010) Err#2 ENOENT
 
<< This says that rhdisk8 cannot be found >>
 
 
NOTE ... If a crash occured during an Add or Drop of a disk ... then the disk(s) in question may still be part of the diskgroup ... so all of the steps above need to take into consideration this (these) disks
 
PORT SPECIFIC ISSUES
 
1) HEWLETT PACKARD (HP)
 
If this is problem is occuring on HP (RISC or Itanium) ... and all of the above are not helping ... and the customer is using HP Logical Volume Manager (LVM) then
 
  Note.433770.1 Cannot Discover Disks in ASM After Upgrade on 10.2.0.3 on HP-UX  Itanium
  Note.434500.1  When Starting A Database With The 10.2.0.3 Executables, Error "ORA-15059 invalid device type for ASM disk"
 
2) IBM AIX
 
  Note.353761.1 Assigning a Physical Volume ID (PVID) To An Existing ASM Disk Corrupts the ASM Disk Header
 
NOTE:1174604.1 - ASM Is Not Detecting EMC PowerPath Raw Devices Or Regular Raw Devices On AIX
 
3) SOLARIS
 
  Note.368840.1 ASM does not discover disk on Solaris platform
 
4) LINUX
 
  Note.457369.1 RASM is Unable to Detect ASMLIB Disks/Devices.

How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) Oracle ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards

$
0
0
APPLIES TO:
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Backup Service - Version N/A and later
Information in this document applies to any platform.
GOAL
The present document provides an example about “how to restore/repair/fix an overwritten ASM Disk Header (first 4K) on 11.1.0.7 and Onwards”.
 
SOLUTION
A copy of the ASM disk header (first 4K) exists on 10.2.0.5, 11.1.0.7, 11.2 and onwards. It can be used to try to restore a valid ASM disk header (Assuming only the first 4k of the disk were affected/overwritten). In order to restore the ASM disk header (assuming the automatic ASM disk header backup is in good shape) please perform the next steps:
 
 
1) Backup the first 50MB of the affected disk (this step is mandatory):
 
$> dd if=<full path affected disk name> of=/tmp/<affected disk name>.dump bs=1048576 count=50
 
Example:
 
[grid@dbaasm ~]$ dd if=/dev/oracleasm/disks/ASMDISK2 of=/tmp/ASMDISK2.dump bs=1048576 count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 0.667474 seconds, 78.5 MB/s
 
 
Where: "/dev/oracleasm/disks/ASMDISK2" is the affected ASM disk member .
 
2)  Collect the Allocation Unit Size (kfdhdb.ausize) from another healthy disk member (from the same affected diskgroup):
 
$> <ASM Oracle Home>/bin/kfed read <full path healthy disk name> | egrep 'ausize|dsknum|dskname|grpname|fgname'  
 
Example:
 
[grid@dbaasm ~]$ kfed read /dev/oracleasm/disks/ASMDISK1  | egrep 'ausize|dsknum|dskname|grpname|fgname'  
 
kfdhdb.dsknum:                        0 ; 0x024: 0x0000
kfdhdb.dskname:                ASMDISK1 ; 0x028: length=8
kfdhdb.grpname:                    DATA ; 0x048: length=4
kfdhdb.fgname:                 FG1_SAN1 ; 0x068: length=8
kfdhdb.ausize:                  2097152 ; 0x0bc: 0x00200000
 
 
Note: In this example, the diskgroup was created using an AU_SIZE=2M (2097152 ) & "/dev/oracleasm/disks/ASMDISK1" is the healthy ASM disk member .
 
3) Then restore the ASM disk header from backup as follows:
 
$> <ASM Oracle Home>/bin/kfed repair <full path affected disk name> ausz=<AU size from point #2>
 
Example:
 
[grid@dbaasm ~]$ kfed repair /dev/oracleasm/disks/ASMDISK2  ausz=2097152
 
4) Verify that the ASM disk header in the affected disk was recreated/restored:
 
$> <ASM Oracle Home>/bin/kfed read <full path affected disk name>  | head -40
 
Example:
 
[grid@dbaasm ~]$  kfed read /dev/oracleasm/disks/ASMDISK2  | head -40
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:              2147483650 ; 0x008: disk=2
kfbh.check:                  4052202307 ; 0x00c: 0xf187b343
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr: ORCLDISKASMDISK2 ; 0x000: length=16
kfdhdb.driver.reserved[0]:   1145918273 ; 0x008: 0x444d5341
kfdhdb.driver.reserved[1]:    843797321 ; 0x00c: 0x324b5349
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                186647296 ; 0x020: 0x0b200300
kfdhdb.dsknum:                        2 ; 0x024: 0x0002
kfdhdb.grptyp:                        2 ; 0x026: KFDGTP_NORMAL
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:                ASMDISK2 ; 0x028: length=8
kfdhdb.grpname:                    DATA ; 0x048: length=4
kfdhdb.fgname:                 FG2_SAN2 ; 0x068: length=8
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             32974423 ; 0x0a8: HOUR=0x17 DAYS=0x12 MNTH=0x9 YEAR=0x7dc
kfdhdb.crestmp.lo:           1180930048 ; 0x0ac: USEC=0x0 MSEC=0xe4 SECS=0x26 MINS=0x11
kfdhdb.mntstmp.hi:             33003184 ; 0x0b0: HOUR=0x10 DAYS=0x15 MNTH=0x5 YEAR=0x7de
kfdhdb.mntstmp.lo:           1230240768 ; 0x0b4: USEC=0x0 MSEC=0xff SECS=0x15 MINS=0x12
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  2097152 ; 0x0bc: 0x00200000
kfdhdb.mfact:                    228480 ; 0x0c0: 0x00037c80
kfdhdb.dsksize:                    9769 ; 0x0c4: 0x00002629
kfdhdb.pmcnt:                         2 ; 0x0c8: 0x00000002
kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
kfdhdb.f1b1locn:                      0 ; 0x0d4: 0x00000000
kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
.
.
.
.
 
5) Finally, mount the diskgroup:
 
SQL> alter diskgroup <diskgroup name> mount ;
 
Example:
 
 
SQL> alter diskgroup DATA mount;
 
Diskgroup altered.
 
 
Notes
Note 1: The solution provided in this document will work if the following conditions are true:
 
   a) Only the first 4K of the affected disk were overwritten/wiped out/overlapped.
 
   b) ASM disk header backup is in good shape.
 
 
 
Note 2: If this solution does not solve your problem, then do not attempt additional steps/actions on the affected diskgroup, therefore please engage Oracle Support through a new Service Request to determinate the “Root Cause” & possible solutions.
 
 
 
Note 3: An ASM disk with a corrupted “Disk Header” will report the following output: 
 
[grid@dbaasm ~]$ kfed read /dev/oracleasm/disks/ASMDISK2
 
kfbh.endian:                          0 ; 0x000: 0x00
 
kfbh.hard:                            0 ; 0x001: 0x00
 
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
 
kfbh.datfmt:                          0 ; 0x003: 0x00
 
kfbh.block.blk:                       0 ; 0x004: blk=0
 
kfbh.block.obj:                       0 ; 0x008: file=0
 
kfbh.check:                           0 ; 0x00c: 0x00000000
 
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
 
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
 
kfbh.spare1:                          0 ; 0x018: 0x00000000
 
kfbh.spare2:                          0 ; 0x01c: 0x00000000
 
000000000 00000000 00000000 00000000 00000000  [................]
 
  Repeat 255 times
 
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

Best Practice : Corruption in Oracle ASM Header

$
0
0
APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.4 and later
Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Information in this document applies to any platform.
PURPOSE
The purpose of this note is to provide a summary of Best Practices to be adopted while handling Corruption in ASM Header
 
SCOPE
This note applies to Grid Infrastructure for both Clustered and Standalone ASM environment.
 
DETAILS
I. My Oracle Support Document:  How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards
 
Document:1088867.1: This document provides an example about “how to restore/repair/fix an overwritten ASM Disk Header (first 4K) on 11.1.0.7 and Onwards”.
 
 
 
II. My Oracle Support Document: ASM Corruption: Case #1: How To Fix The ASM Disk HEADER_STATUS From FORMER or PROVISIONED to MEMBER
 
Document: 1448799.1: This document explains, in detail, the steps required (with an example) to fix the ASM disk header from FORMER or Provisioned to MEMBER under the below scenarios:
 
A) Diskgroup was dropped by accident using the "SQL> drop <DG name> diskgroup;" statement.
 
B) Or under strange situations as described in the next bug:
 
Bug:13331814 ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.
 
REFERENCES
NOTE:1088867.1 - How To Restore/Repair/Fix An Overwritten (KFBTYP_INVALID) ASM Disk Header (First 4K) 10.2.0.5, 11.1.0.7, 11.2 And Onwards
NOTE:1448799.1 - ASM Corruption: Case #1: How To Fix The ASM Disk HEADER_STATUS From FORMER or PROVISIONED To MEMBER.

Bug 13331814 : Oracle ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.

$
0
0
Abstract: ASM DISKS TURNED INTO FORMER WHILE DISKGROUP IS MOUNTED.
 
PROBLEM:
--------
While the 2 diskgroups were mounted on 3 of 4 ASM instances, 1 member disk 
(of each diskgroup) turned into FORMER, the disks associated with both 
diskgroups appear as MEMBER (3) & FORMER(2) (note the diskgroups were mounted 
while 2 disks  turned into FORMER):
======================================================
GROUP_NUMBER DISK_NUMBER HEADER_STATU MODE_ST OS_MB TOTAL_MB FREE_MB NAME 
FAILGROUP PATH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 0 MEMBER ONLINE 20468 20468 19335 DG_SCUT_ARCH1_0000 DG_SCUT_ARCH1_0000 
/dev/rdsk/c3t60A980006466654F476F64317648572Fd0s6
3 1 MEMBER ONLINE 20468 20468 19341 DG_SCUT_ARCH1_0001 DG_SCUT_ARCH1_0001 
/dev/rdsk/c3t60A980006466654F476F64317656586Dd0s6
3 2 FORMER ONLINE 20468 20468 19342 DG_SCUT_ARCH1_0002 DG_SCUT_ARCH1_0002 
/dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6   <(== HERE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14 0 MEMBER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0000 DG_MDSUT_ARCH2_0000 
/dev/rdsk/c3t60A980006466654C436F643955687064d0s6
14 1 FORMER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0001 DG_MDSUT_ARCH2_0001 
/dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6  <(== HERE
======================================================
 
DIAGNOSTIC ANALYSIS:
--------------------
Both disks show a FORMER status:
======================================================
+ASM:oracle> kfed read c3t60A980006466654F476F643176575A4Fd0s6.dump | egrep 
'grpname|dskname|hdrsts'
kfdhdb.hdrsts:                        4 ; 0x027: KFDHDR_FORMER
kfdhdb.dskname:      DG_SCUT_ARCH1_0002 ; 0x028: length=18
kfdhdb.grpname:           DG_SCUT_ARCH1 ; 0x048: length=13
======================================================
 
+ASM:oracle> kfed read c3t60A980006466654C436F6439556B4E37d0s6.dump | egrep 
'grpname|dskname|hdrsts'
kfdhdb.hdrsts:                        4 ; 0x027: KFDHDR_FORMER
kfdhdb.dskname:     DG_MDSUT_ARCH2_0001 ; 0x028: length=19
kfdhdb.grpname:          DG_MDSUT_ARCH2 ; 0x048: length=14
======================================================
 
 
1) This an 11.2.0.2.0 ASM configuration (4 RAC nodes).
 
2) DG_MDSUT_ARCH2 & DG_SCUT_ARCH1 diskgroups were mounted on 4 ASM instances:
======================================================
GROUP_NUMBER NAME STATE TYPE TOTAL_MB FREE_MB OFFLINE_DISKS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 DG_SCUT_ARCH1 MOUNTED EXTERN 40936 38676 0
14 DG_MDSUT_ARCH2 MOUNTED EXTERN 20468 20378 0
======================================================
 
3) While the 2 diskgroups were mounted on 3 of 4 ASM instances, 1 member disk 
(of each diskgroup) turned into FORMER, the disks associated with both 
diskgroups appear as MEMBER (3) & FORMER(2) (note the diskgroups were mounted 
while 2 disks  turned into FORMER):
======================================================
GROUP_NUMBER DISK_NUMBER HEADER_STATU MODE_ST OS_MB TOTAL_MB FREE_MB NAME 
FAILGROUP PATH
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 0 MEMBER ONLINE 20468 20468 19335 DG_SCUT_ARCH1_0000 DG_SCUT_ARCH1_0000 
/dev/rdsk/c3t60A980006466654F476F64317648572Fd0s6
3 1 MEMBER ONLINE 20468 20468 19341 DG_SCUT_ARCH1_0001 DG_SCUT_ARCH1_0001 
/dev/rdsk/c3t60A980006466654F476F64317656586Dd0s6
3 2 FORMER ONLINE 20468 20468 19342 DG_SCUT_ARCH1_0002 DG_SCUT_ARCH1_0002 
/dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6   <(== HERE
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14 0 MEMBER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0000 DG_MDSUT_ARCH2_0000 
/dev/rdsk/c3t60A980006466654C436F643955687064d0s6
14 1 FORMER ONLINE 20468 20468 20378 DG_MDSUT_ARCH2_0001 DG_MDSUT_ARCH2_0001 
/dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6  <(== HERE
======================================================
 
4) Diskgroups were not mounted on the node #4 (+ASM4), since node #4 was 
evicted. After node reboot 2 the diskgroups cannot be mounted (DG_MDSUT_ARCH2 
& DG_SCUT_ARCH1) on this node. Then the 2 disk showed header_status = 
'FORMER'. 
 
 
5) Since this is ASM 11.2.0.2, I asked to fix the disks headers as follow 
(without success):
======================================================
5.1) Please backup all the database contained in the DG_SCUT_ARCH1 & 
DG_MDSUT_ARCH2 diskgroups and validate the backups.
 
5.2) Then shutdown all the database instances referencing the DG_SCUT_ARCH1 & 
DG_MDSUT_ARCH2 diskgroups (from all the nodes).
 
5.3) Then dismount the DG_SCUT_ARCH1 & DG_MDSUT_ARCH2 diskgroups from all the 
ASM instances and keep them dismounted on all the nodes:
======================================================
SQL> alter diskgroup  DG_MDSUT_ARCH2 dismount;
 
SQL>  alter diskgroup DG_SCUT_ARCH1  dismount;
======================================================
 
5.4) Then fix the disks as follow:
======================================================
$>kfed repair /dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6
======================================================
$> kfed repair /dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6
======================================================
 
5.5) Then mount the 2 diskgroup on all the ASM instances:
======================================================
SQL> alter diskgroup  DG_MDSUT_ARCH2 mount;
 
SQL>  alter diskgroup DG_SCUT_ARCH1  mount;
======================================================
 
 
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
 
SQL> alter diskgroup DG_MDSUT_ARCH2 mount;
alter diskgroup DG_MDSUT_ARCH2 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "1" is missing from group number "4"
 
 
SQL>  alter diskgroup DG_SCUT_ARCH1 mount;
alter diskgroup DG_SCUT_ARCH1 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "2" is missing from group number "4"
 
======================================================
*** 10/31/11 11:55 am ***
 
6) Then asked for the 50 MB of the affected disks:
 
======================================================
$> dd if=/dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6 
of=/tmp/c3t60A980006466654F476F643176575A4Fd0s6.dump bs=1048576 count=50
 
 
$> dd if=/dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6 
of=/tmp/c3t60A980006466654C436F6439556B4E37d0s6.dump bs=1048576 count=50
======================================================
 
7) Then dump the disk header using kfed, but both disks showed a FORMER 
status:
======================================================
+ASM:oracle> kfed read c3t60A980006466654F476F643176575A4Fd0s6.dump | egrep 
'grpname|dskname|hdrsts'
kfdhdb.hdrsts:                        4 ; 0x027: KFDHDR_FORMER
kfdhdb.dskname:      DG_SCUT_ARCH1_0002 ; 0x028: length=18
kfdhdb.grpname:           DG_SCUT_ARCH1 ; 0x048: length=13
======================================================
 
+ASM:oracle> kfed read c3t60A980006466654C436F6439556B4E37d0s6.dump | egrep 
'grpname|dskname|hdrsts'
kfdhdb.hdrsts:                        4 ; 0x027: KFDHDR_FORMER
kfdhdb.dskname:     DG_MDSUT_ARCH2_0001 ; 0x028: length=19
kfdhdb.grpname:          DG_MDSUT_ARCH2 ; 0x048: length=14
======================================================
 
8) So, I edited the disk headers to change the status from FORMER to MEMBER:
======================================================
 
From:
 
kfdhdb.hdrsts:                        4 ; 0x027: KFDHDR_FORMER
======================================================
 
To:
 
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
 
======================================================
 
9) Then attempted the repair the disk header on both disks using the fixed 
header  as follow:
======================================================
kfed merge /dev/rdsk/c3t60A980006466654C436F6439556B4E37d0s6   
text=c3t60A980006466654C436F6439556B4E37d0s6_affected_DG_MDSUT_ARCH2_0001_kfed
_fix.txt
======================================================
 
kfed merge /dev/rdsk/c3t60A980006466654F476F643176575A4Fd0s6   
text=c3t60A980006466654F476F643176575A4Fd0s6_affected_DG_SCUT_ARCH1_0002_kfed_
fix.txt
======================================================
 
10) But diskgroup could not be mounted due to the next new ORA-15203 error (I 
double checked they are using the correct 11.2.0.2 ASM release to mount the 
diskgroups):
======================================================
 
SQL> alter diskgroup DG_MDSUT_ARCH2 mount;
 
alter diskgroup DG_MDSUT_ARCH2 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15203: diskgroup DG_MDSUT_ARCH2 contains disks from an incompatible 
version
of ASM
======================================================
 
 
SQL> SQL> alter diskgroup DG_SCUT_ARCH1 mount;
alter diskgroup DG_SCUT_ARCH1 mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15203: diskgroup DG_SCUT_ARCH1 contains disks from an incompatible 
version
of ASM
======================================================
 
11) We need help please from the Development team for the 2 things:
 
11.1) Why the 2 disk turned into FORMER disks?
 
11.1) Is there any option to fix the ASM header since “kfed repair” is not 
working and or manual fix header patch is also having issues?

Oracle bbed Block Browser /Editor

$
0
0

Oracle bbed is a utility which allows browsing and editing of disk data structures maintained by the Oracle RDBMS kernel. 

The targeted user of the block browser/editor is a person who is knowledgeable of these disk data structures and has an understanding of the dependencies between them.

 

BROWSE read directly from disk, regardless of whether the instance is up or down.

EDIT Unrecoverable -> write all edits to disk immediately (write-through).

Ability to view and modify blocks in hexadecimal for all file and block types supported by Oracle. When modifying data in this mode, no range or dependency checking will be done.

 

The block editor is an externally invoked facility (similar to SQL*DBA or DBVERIFY) that supports a set of optional command-line parameters at invocation.

 

bbed [parameters]

where parameters can consist of the following (default settings underlined):

parameters:==

DATAFILE = filename

BLOCKSIZE = blocksize [2048]

MODE = BROWSE/EDIT

REVERT = y[es]/n[o]

SPOOL = y[es]/n[o]

LISTFILE = contains filenames, w/optional sizes

CMDFILE = bbed command filename

LOGFILE = log filename [log.bbd]

BIFILE = before-image filename [bifile.bbd]

PARFILE = parameter filename

 

 

 

 

DBRECOVER for MS SQL Server

DBRECOVER For Oracle Version 2009

$
0
0

What's New?
 

Enhancement :

 

  1. user can examine table recoverable rows easily 
  2. DataBridge now supports deleted rows 
  3. user can scan database in dictionary mode
  4. user can use extent mode to recover table in dictionary mode
  5. bootstrap object can be loaded without segment header

 

新特性:

用户可以通过表检查功能获知某张表可以恢复多少行数据

支持直接将deleted删除的行通过databridge恢复到数据库中

支持在字典模式下扫描全库对象,以便恢复drop的表

支持在字典模式下以盘区extent mode扫描表,以便恢复segment header段头损坏的情况

支持对系统基础表使用盘区模式加载

 

https://zcdn.parnassusdata.com/dbrecover-for-oracle2009.zip

 

 

 

Viewing all 175 articles
Browse latest View live