Archives October 2020

Failover simulation in dataguard

Failover is a emergency practice to avoid any point of failure,business impact due to loss of service and acts as a high availability.There is data loss from the point of failure

The issue occurs due to unexpected event like disaster,war,datacenter destruction etc

So we have two objectives : RTO(recovery time objective) and RPO(recovery point objecive)

RTO: how much application downtime can be compromised by business??

RPO: how much data loss can be compromised by business??

Based on individuals business objective,one can choose the redundancy and resiliency levels

We are going to simulate a failover event ourself in this article

Terms used:

Primary database : orcl –> failed old primary database –>New standby database

Standby database : orcldg –> old standby database –> New primary database

On primary:

Consider a critical financial application is connected to the database and loss of service would lead to loss of revenue

You have to somehow face the challenge when you get a surprise failure on the mission critical database and save the downtime but can tolerate some dataloss

simulate a failure on primary by aborting the instance.

Assume a hurricane damage on the primary datacenter XYZ location

On standby:

Standby database is located on ABC which is different region

Check the status of the standby database

Cancel the MRP and issue the command to failover services to standby as below

Alter database failover to "standby_database_name";

Then open the database for read write operations

Now old standby database became new primary

The log sequence number will be reset after a failover due to unexpected failure on the primary database and inconsistent recovery

Reinstate failed old primary database to new standby:

After the failover,the standby database acts as a primary and primary database acts as a standby after reinstation which is called role transition

Once the new primary(old standby) is up for application connectivity,we can reinstate the old primary database if flashback is enabled on failed primary database

If flashback is not enabled,then we need to rebuild the failed whole primary database from the beginning

Hence it is always recommended to enable flashback on primary database to be on safer side

On failed old primary database:

Get the last scn when primary database failed during disaster and mark the scn number

flashback the database using the marked scn from the previous step and convert the old failed primary to new physical standby database

Finally enable MRP and switch some logfiles on the new primary database(orcldg) to test the synchronization of scn

Note: If the dataguard protection is in max protection or availability mode before failure,then after failover event happen the protection mode become max performance !!

Switchover on dataguard

Switchover is an activity which acts as a role reversal with guaranteed protection to data.

Here primary database acts as a standby and standby database acts as a primary!!

We can perform switchover in situations like maintenance of primary database like patching,migration of a database from one server to another etc..

ON PRIMARY:


[oracle@orcl ~]$ sqlplus / as sysdba

Check the role of database and log shipping status.

Switchover status should be “To standby” and Database role should be “Primary

Check the redolog status which is “current

On primary check log sequence which are applied on standby

Check the archive destinations if they are valid !!

Create a restore point with guaranteed flashback before switchover to avoid any failures

Issue the following command to commit the primary to switchover to standby

Automatically the instance get down

Then startup database in nomount stage and then mount the database in standby mode

Then enable the MRP on new standby and check the status of switchover status on old primary

ON STANDBY:

Check the status of the log shipping on standby database whether primary and standby log sequence are in sync

Check the database role and switchover status of standby database and issue the following command –> commit to switchover to primary and shutdown the database

Then startup in mount stage and again check the status of the database role and switchover status of standby database.Database role should be primary on the new primary !!

switch the logfile multiple times on new primary (old standby)

Check the status of the log sequence in new standby and new primary !!!

ASM disks not visible during grid installation

Solution:

Check for the location of the asm disks on the mountpoint

[oracle@RACFLEX1 ~]$ locate FRA

/dev/disk/by-label/FRA
/dev/disk/by-label/FRA1
/dev/oracleasm/disks/FRA
/dev/oracleasm/disks/FRA1

[oracle@RACFLEX1 ~]$ cd /dev/oracleasm/disks/
[oracle@RACFLEX1 disks]$ ls -lrt
total 0
brw-rw---- 1 grid oinstall 8,  1 Oct 14 12:48 DATA
brw-rw---- 1 grid oinstall 8, 33 Oct 14 12:48 CRS
brw-rw---- 1 grid oinstall 8, 17 Oct 14 12:48 FRA
brw-rw---- 1 grid oinstall 8, 49 Oct 14 12:48 DATA1
brw-rw---- 1 grid oinstall 8, 65 Oct 14 12:48 FRA1
brw-rw---- 1 grid oinstall 8, 81 Oct 14 12:48 CRS1
brw-rw---- 1 grid oinstall 8, 97 Oct 14 12:48 CRS2

[oracle@RACFLEX1 disks]$ pwd
/dev/oracleasm/disks

Specify the same location on the “change discovery path”

ASM Disks are visible now cool!!!

This is a prerequisite condition to test whether the system kernel version is at least “2.6.39”. 
Error:
PRVF-7524 : Kernel version is not consistent across all the nodes. Kernel version = "3.8.13-68.3.4.el6uek.x86_64" found on nodes: racflex2. Kernel version = "3.8.13-118.48.1.el6uek.x86_64" found on nodes: racflex1.  - Cause:  The operating system kernel versions do not match across cluster nodes.  - Action:  Update the kernel version where necessary to have all cluster nodes running the same kernel version. 

  
Verification WARNING result on node: racflex1 
Expected Value
 : 2.6.39
Actual Value
 : 3.8.13-118.48.1.el6uek.x86_64
 
Verification WARNING result on node: racflex2 
Expected Value
 : 2.6.39
Actual Value
 : 3.8.13-68.3.4.el6uek.x86_64
 

Solution:

This error is a cause of bug.Ignore this error !!

[INS-30132] Initial setup required for the execution of installer validations failed on nodes: racflex2

Solution: Setup passwordless ssh or ssh equivalency on all the RAC nodes and give permissions to oracle user and oinstall group.Give rwx permissions to /tmp directory.

Refer this post for setting up user equivalence https://wordpress.com/block-editor/post/alphaoragroup.com/90

[root@RACFLEX2 /]# chmod -R 777 /tmp/
[root@RACFLEX2 /]# su - oracle
[oracle@RACFLEX2 ~]$
[oracle@RACFLEX2 ~]$
[oracle@RACFLEX2 ~]$ cd /
[oracle@RACFLEX2 /]$ ls -lrt /tmp/
total 32
drwxrwxrwx  3    501 oinstall 4096 Oct 13 13:36 OraInstall2020-10-13_01-15-27PM
drwxrwxrwx  3    501 oinstall 4096 Oct 13 15:40 OraInstall2020-10-13_03-36-40PM
drwxrwxrwx  3    501 oinstall 4096 Oct 13 17:45 OraInstall2020-10-13_05-42-34PM
drwxrwxrwx  3    501 oinstall 4096 Oct 13 21:43 OraInstall2020-10-13_09-41-04PM
drwxr-xr-x  3 oracle oinstall 4096 Oct 14 14:46 CVU_12.1.0.2.0_oracle
drwxr-xr-x  3 oracle oinstall 4096 Oct 14 14:46 OraInstall2020-10-14_01-46-46PM

Once we grant permission,we can solve this error during grid installation!!

prvf-4123-inconsistent-user-ids-found-for-user-oracle

Solution:

User id of oracle on node 1 mismatch with node 2.

I tried to add id 500 for oracle user on both nodes.I find 500 userid on node 1 and 501 userid on node 2!!

[oracle@RACFLEX1 grid]$ id oracle
uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba),502(asmadmin),503(asmdba),504(asmoper)

[root@RACFLEX2 oracle]# id oracle
uid=501(oracle) gid=500(oinstall) groups=500(oinstall),501(dba),502(asmadmin),503(asmdba),504(asmoper)
                                                                                             

I try to modify the id on node 2 as 500 but couldnot do that.

[root@RACFLEX2 oracle]# usermod -u 500 -g oinstall -G dba,asmdba,asmadmin,asmoper oracle                                                                                                             
usermod: UID '500' already exists
[root@RACFLEX2 oracle]# userdel oracle
userdel: user oracle is currently used by process 4626
[root@RACFLEX2 oracle]# kill -9 4626

Some other user is using the same id .I have to figure out which user?

[root@RACFLEX2 ~]# cat /etc/passwd|grep 500 
grid:x:500:500::/home/grid:/bin/bash
oracle:x:501:500::/home/oracle:/bin/bash

I check password file and found out grid user using the same id.But i dont want the grid user anymore.So i deleted grid user and readd oracle user with id 500

[root@RACFLEX2 ~]# userdel grid
[root@RACFLEX2 ~]# usermod -u 500 -g oinstall -G dba,asmdba,asmadmin,asmoper ora                                                                                                             cle
[root@RACFLEX2 ~]# id oracle
uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba),502(asmadmin),50                                                                                                             3(asmdba),504(asmoper)

[INS-30515] OCR disks size and count by redundancy

In oracle RAC,OCR file is accessed by clusterware to start the crsd agents and processes.

Have you wondered,how much size and how many disks should we allocate according to different redundancy??

External redundancy:

for external redundancy, we should atleast have diskspace of 5.5GB and 1 disk required for redundancy in OCR diskgroup

Normal redundancy:

For normal redundancy, we should have atleast 11GB diskspace and 3 disks required for redundancy in OCR diskgroup

High redundancy:

For high redundancy, we should have atleast 16.5GB diskspace and 5 disks required for redundancy in OCR diskgroup !!

keep in mind that for 12c grid version,we require below for OCR diskgroup configuration

external – minimum 5.5G disk space with 1 disk or more

normal – minimum 11G disk space with 3 disk or more

high – minimum 16.5G disk space with 5 disk or more

error: %pre() scriptlet failed, exit status 4error: install: %pre scriptlet failed (2), skipping cell-11.2.3.2.1_LINUX.X64_130109-1

Solution:

[root@EXDBADM001 cellbits]# rpm -qa|grep jdk

jdk-1.5.0_15-fcs.x86_64 <<<<<<<

There is a mismatch in format of rpm installation package

Create a macros file on the configuration directory and specify the correct format!! and retry rpm install with nodeps option. 
eg:rpm -Uvh xxx.rpm --nodeps

[root@EXDBADM001 cellbits]# cd ~/.

[root@EXDBADM001 cellbits]# cat ~/.rpmmacros
%_query_all_fmt %%{name}-%%{version}-%%{release}
11.2.0.4 runInstaller: [INS-06006] Passwordless SSH connectivity not set up between the following nodes(s)

Solution:

Edit /etc/hosts file and specify hostname of both nodes in uppercase and bounce the RAC nodes.
eg: RAC1 should be rac1
ORA-16038 – ORA-19809 – ORA-00312 – ORA-03113 – ORA-00257: archiver error

Archivelogs are filling up fast on your FRA.Soon there will be a space crunch on your disk unless you get rid of the archives from the disk.

This issue is a nightmare in case of very high transaction databases like telecom or banking applications.

ORA-16038: log 2 sequence# 113 cannot be archived
ORA-19809: limit exceeded for recovery files
ORA-00312: online log 2 thread 1: '/app01/base/oradata/orcl12x/redo02.log'
Sun Oct 11 04:21:13 2020

I get the above errors on my alert logfile

We can increase the FRA to a high value and take a backup of all archivelog files with delete input

I couldnot open my database due to the space crunch and redologs couldnot be archived but i can mount it successfully

SQL> startup;
ORACLE instance started.

Total System Global Area 1644167168 bytes
Fixed Size                  2925024 bytes
Variable Size             973082144 bytes
Database Buffers          654311424 bytes
Redo Buffers               13848576 bytes
Database mounted.
ORA-03113: end-of-file on communication channel
Process ID: 3906
Session ID: 1 Serial number: 7537

When i checked the files that occupy lot of space,i can find backuppiece and archivelog files filling up space

ARCHIVED LOG                      53655639                         0
            115          0


FILE_TYPE               PERCENT_SPACE_USED PERCENT_SPACE_RECLAIMABLE
----------------------- ------------------ -------------------------
NUMBER_OF_FILES     CON_ID
--------------- ----------
BACKUP PIECE                    19701765.1                         0
              5          0

i removed obsolete backups and deleted old backuppieces to release some space

RMAN> delete noprompt obsolete;

RMAN retention policy will be applied to the command
RMAN retention policy is set to redundancy 1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=7 device type=DISK
Deleting the following obsolete backups and copies:
Type                 Key    Completion Time    Filename/Handle
-------------------- ------ ------------------ --------------------
Backup Set           4      29-SEP-20
  Backup Piece       4      29-SEP-20          /home/oracle/FRA/ORCL12X/backupset/2020_09_29/o1_mf_annnn_TAG20200929T070113_hq64zsf0_.bkp

FRA size is also not sufficient to hold the archives

In logs i can see lot of ora errors related to archive

ORA-19809: limit exceeded for recovery files
ORA-19804: cannot reclaim 50577408 bytes disk space from 10000 limit
Sun Oct 11 04:21:13 2020
ARC1: Error 19809 Creating archive log file to '/home/oracle/FRA/ORCL12X/archivelog/2020_10_11/o1_mf_1_114_%u_.arc'
Sun Oct 11 04:21:13 2020
Errors in file /app01/base/diag/rdbms/orcl12x/orcl12x/trace/orcl12x_ora_4441.trc:
ORA-16038: log 2 sequence# 113 cannot be archived
ORA-19809: limit exceeded for recovery files
ORA-00312: online log 2 thread 1: '/app01/base/oradata/orcl12x/redo02.log'

SQL> show parameter db_recovery

NAME                                 TYPE
------------------------------------ --------------------------------
VALUE
------------------------------
db_recovery_file_dest                string
/data01
db_recovery_file_dest_size           big integer
7000M
SQL> alter system set db_recovery_file_dest_size=10000;

System altered.

I couldnot take archive backup as well .So i have to increase my FRA and change the destination to a mountpoint with more space than current FRA directory

RMAN-03009: failure of backup command on ORA_DISK_1 channel at 10/11/2020 04:53:35
ORA-19809: limit exceeded for recovery files
ORA-19804: cannot reclaim 2143025664 bytes disk space from 8388608000 limit

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   11G  7.2G  61% /
tmpfs           2.0G  1.1G  953M  52% /dev/shm
/dev/sda2        15G   11G  3.5G  75% /app01
/dev/sda5        12G  6.7G  4.2G  62% /data01

SQL> alter system set db_recovery_file_dest_size=12000m;

System altered.

SQL> show parameter db_recovery

NAME                                 TYPE
------------------------------------ --------------------------------
VALUE
------------------------------
db_recovery_file_dest                string
/home/oracle/FRA
db_recovery_file_dest_size           big integer
12000M

So i changed mountpoint for FRA with more seats and increased the FRA size

After increasing the FRA size,archive backup was successful and issue resolved !!!

RMAN> crosscheck archivelog all;

released channel: ORA_DISK_1
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=48 device type=DISK
validation succeeded for archived log
archived log file name=/home/oracle/FRA/ORCL12X/archivelog/2020_10_11/o1_mf_1_151_hr5ljhlb_.arc RECID=160 STAMP=1053493328
validation succeeded for archived log

RMAN> backup archivelog all delete input;

Starting backup at 11-OCT-20
current log archived
using target database control file instead of recovery catalog
..........................
Starting Control File and SPFILE Autobackup at 11-OCT-20
piece handle=/home/oracle/FRA/ORCL12X/autobackup/2020_10_11/o1_mf_s_1053493634_hr5lt4nt_.bkp comment=NONE
Finished Control File and SPFILE Autobackup at 11-OCT-20