RAC node eviction – IPC Send timeout detected

In oracle RAC, node eviction is one of the major issues which has to occur to maintain consistency of data.

They occur due to following reasons

  • lack of network heartbeat
  • lack of disk heartbeat
  • resource starvation (CPU,memory and IO)
  • due to bug
  • manual kill of critical background processes like css,cssdagent and cssdmonitor etc…

IPC timeout error occured in one of the instance due to lack of network heartbeat from one of the nodes.

There is IPC send error from node 1 where node 2 didnot receive acknowledgement after specific timeout.Hence node 1 was evicted from the cluster to prevent the independent IO writes to node 1 itself without the intervention or coordination of node 2.This process can be called as IO fencing or split brain syndrome

In this case, LMS process which is responsible for transfering the datablocks between RAC instance as a result of cache fusion. So it is very important to allocate the LMS process before building a RAC setup based on the number of database instances on the cluster. Number of LMS process should always be less than number of CPU cores. Below initialization parameter indirectly determines the number of LMS process. And if you observe the tracefile one below ,lot of IPC communication errors occured due to network heartbeat which was caused by resource starvation of CPU due to LMS process.

There are three LMS process in this RAC database which consumed heavy CPU due to high cluster waits with enqueue SQ contention

If you closely observe the logs,it clearly shows that 52 parallel servers which generated tracefile exdbx1_p052_15699.trc marked with * which was the starting point of the IPC error. This shows that when there was high CPU consumption from one of the batch jobs, there were resource starvation for other users or jobs who need it! Post that you can see lot of lms tracefiles with 15153 process id which all show cluster wait stacks

SQL> show parameter GCS

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
gcs_server_processes                 integer     0
IPC Send timeout detected. Receiver ospid 15699 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_p052_15699.trc: *********************
2021-03-22 13:44:11.710000 -07:00
IPC Send timeout detected. Receiver ospid 15153 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
IPC Send timeout detected. Receiver ospid 15153 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
IPC Send timeout detected. Receiver ospid 15153 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
IPC Send timeout detected. Receiver ospid 15153 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
IPC Send timeout detected. Receiver ospid 15153 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
2021-03-22 13:44:12.712000 -07:00

......
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
2021-03-22 13:44:25.696000 -07:00
IPC Send timeout detected. Receiver ospid 15153 [
Errors in file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_lms2_15153.trc:
2021-03-22 13:44:31.182000 -07:00
LMS2 (ospid: 15153) has detected no messaging activity from instance 2
LMS2 (ospid: 15153) has detected no messaging activity from instance 3
LMS2 (ospid: 15153) issues an IMR to resolve the situation
LMS2 (ospid: 15153) issues an IMR to resolve the situation
Please check LMS2 trace file for more detail.
2021-03-22 13:45:14.331000 -07:00
Instance termination initiated by instance 2 with reason 1.
Instance 2 received a reconfig event from its cluster manager indicating that this instance is supposed to be down
Please check instance 2's alert log and LMON trace file for more details.
Please also examine the CSS log files.
LMON (ospid: 15134): terminating the instance due to error 481
System state dump requested by (instance=1, osid=15134 (LMON)), summary=[abnormal instance termination].
System State dumped to trace file /opt/app/oracle/diag/rdbms/exdbx/exdbx1/trace/exdbx1_diag_15103_20210322134514.trc
2021-03-22 13:45:15.407000 -07:00
License high water mark = 309
2021-03-22 13:45:19.368000 -07:00
Instance terminated by LMON, pid = 15134
USER (ospid: 809): terminating the instance
Instance terminated by USER, pid = 809
2021-03-22 13:45:25.236000 -07:00
Starting ORACLE instance (normal)
2021-03-22 13:45:29.238000 -07:00
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 48
Number of processor cores in the system is 6
Number of processor sockets in the system is 1
2021-03-22 13:45:29.419:
[USER(3772)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error).
kggpnpInit: failed to init gpnp
  WARNING: No cluster interconnect has been specified. Depending on
           the communication driver configured Oracle cluster traffic
           may be directed to the public interface of this machine.
           Oracle recommends that RAC clustered databases be configured
           with a private interconnect for enhanced security and
           performance.

Top wait events observed moments before the node eviction from node 1 .You can see the top most wait event is enq: SQ – contention with high wait time followed by CPU usage

SQL> select
   nvl(event, 'on cpu') as event,
   count(*)               as total_wait_time
from
   dba_hist_active_sess_history
where
   to_char(sample_time,'dd-mm-yy hh24:mi:ss') like '%22-03-21 13:%'
group by
   event
order by
   total_wait_time desc;  2    3    4    5    6    7    8    9   10   11

EVENT                                                                           TOTAL_WAIT_TIME
---------------------------------------------------------------- ------------------------------
enq: SQ - contention                                                                       5364
on cpu                                                                                     3381
gc cr failure                                                                              2106
gc cr request                                                                              1889
gc buffer busy acquire                                                                     1597

Top wait events observed moments before the node eviction from node 2 . Same events can be observed from other node which send network heartbeat requests to the evicted node

SQL> SQL> SQL> SQL> SQL> SQL> select
   nvl(event, 'on cpu') as event,
   count(*)               as total_wait_time
from
   dba_hist_active_sess_history
where
   to_char(sample_time,'dd-mm-yy hh24:mi:ss') like '%22-03-21 13:44%'
group by
   event
order by
   total_wait_time desc;  2    3    4    5    6    7    8    9   10   11


EVENT                                                                           TOTAL_WAIT_TIME
---------------------------------------------------------------- ------------------------------
enq: SQ - contention                                                                        841
gc cr failure                                                                               610
gc buffer busy acquire                                                                      349
gc current retry                                                                            244
row cache lock                                                                              207

Alertlog shows that node 1 didnot acknowledge node 2 until certain time and post the timeout all the other nodes killed the sick node (node 1) from the cluster

IPC Send timeout to 1.3 inc 84 for msg type 36 from opid 75
2021-03-22 13:45:14.277000 -07:00
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Dumping diagnostic data in directory=[cdmp_20210322134514], requested by (instance=1, osid=15134 (LMON)), summary=[abnormal instance termination].
2021-03-22 13:45:20.038000 -07:00
Reconfiguration started (old inc 84, new inc 88)
List of instances:
 2 3 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 * domain 0 not valid according to instance 3
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 0: 22 GCS shadows cancelled, 6 closed, 0 Xw survived
 LMS 1: 23 GCS shadows cancelled, 2 closed, 0 Xw survived
 LMS 2: 104 GCS shadows cancelled, 17 closed, 0 Xw survived
2021-03-22 13:45:22.037000 -07:00
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Instance recovery: looking for dead threads
minact-scn: Inst 2 is now the master inc#:88 mmon proc-id:17544 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0014.109f659b gcalc-scn:0x0014.10a90896
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:88 new-inc#:88
Beginning instance recovery of 1 threads
2021-03-22 13:45:24.375000 -07:00
IPC Send timeout detected. Sender: ospid 9568 [oracle@exdbadm02]
Receiver: inst 1 binc -450541871 ospid 15153
2021-03-22 13:45:25.754000 -07:00
 Submitted all GCS remote-cache requests
2021-03-22 13:45:26.913000 -07:00
 Post SMON to start 1st pass IR
 Fix write in gcs resources
2021-03-22 13:45:28.203000 -07:00
Reconfiguration complete
2021-03-22 13:45:30.385000 -07:00
 parallel recovery started with 32 processes
Started redo scan

After digging to trace dump , we found that one of the SQL query which had a high value sequence was not cached in memory and causing cluster waits due to high CPU used by LMS.

Some of the frames in trace shows error

dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst1()+124        CALL     skdstdst()           FFFFFFFF7FFF7F20 ?
                                                   000000003 ? 1021E2AA0 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFCF290 ?
                                                   000000000 ?
ksedst()+52          CALL     ksedst1()            00010D800 ? 00010D800 ?
                                                   10DB4B000 ? 00010DB53 ?
                                                   10DB4B8F0 ? 10DB5386C ?
dbkedDefDump()+1984  CALL     ksedst()             000000001 ? 10DB76000 ?
                                                   00010DB76 ? 10DB53000 ?
                                                   00010D800 ? 00010DB53 ?
ksdxdmp()+2132       PTR_CALL dbkedDefDump()       000000004 ? 00010DB76 ?
                                                   000000001 ? 000000001 ?
                                                   000000001 ? 00010D800 ?
ksdxfdmp()+116       CALL     ksdxdmp()            3800055C0 ? 000000001 ?
                                                   100FE2758 ? 000100800 ?
                                                   00010DB4B ? 000101C00 ?
ksdxcb()+3592        PTR_CALL ksdxfdmp()           FFFFFFFF7FFF9158 ?
                                                   000000011 ? 000000003 ?
                                                   FFFFFFFF7FFF9018 ?
                                                   FFFFFFFF7FFF8F78 ?
                                                   000000001 ?
sspuser()+140        PTR_CALL ksdxcb()             000000026 ? 10DB4BE00 ?
                                                   000000008 ? 000000003 ?
                                                   1A10C18268 ? 000101FE4 ?
__sighndlr()+12      PTR_CALL sspuser()            000000000 ? 000000001 ?
                                                   10DB4B634 ?
                                                   FFFFFFFF7FFF9400 ?
                                                   10DB4B000 ? 00010DB4B ?
call_user_handler()  CALL     __sighndlr()         000000011 ?
+868                                               FFFFFFFF7FFF9A20 ?
                                                   FFFFFFFF7FFF9720 ?
                                                   107196320 ? 000000000 ?
                                                   FFFFFFFF7A48A000 ?
sigacthandler()+92   CALL     call_user_handler()  FFFFFFFF7D102A40 ?
                                                   FFFFFFFF7FFF9A20 ?
                                                   09FBEF057 ? 000000000 ?
                                                   000000000 ? 000000011 ?
__pollsys()+8        PTR_CALL 0000000000000000     000000011 ?
                                                   FFFFFFFF7FFF9A20 ?
                                                   FFFFFFFF7FFF9720 ?
                                                   00014DC94 ?
                                                   FFFFFFFF7A48A000 ?
                                                   000000011 ?
_pollsys()+232       FRM_LESS __pollsys()          00000005B ? 000000004 ?
                                                   FFFFFFFF7FFF9C80 ?
                                                   000000000 ? 00000000B ?
                                                   00000000B ?
_poll()+140          CALL     _pollsys()           FFFFFFFF7FFF9DF0 ?
                                                   000000004 ?
                                                   FFFFFFFF7FFF9C80 ?
                                                   000000000 ? 000000001 ?
                                                   000000000 ?
ssskgxp_poll()+36    PTR_CALL _poll()              FFFFFFFF7FFF9DF0 ?
                                                   000000004 ? 000F42400 ?
                                                   00003D090 ? 000000010 ?
                                                   000000010 ?
sskgxp_selectex()+2  CALL     ssskgxp_poll()       FFFFFFFF7FFFB380 ?
24                                                 FFFFFFFF78B28A30 ?
                                                   FFFFFFFF7FFF9DF0 ?
                                                   000000004 ? 000000010 ?
                                                   10DB481C0 ?
skgxpiwait()+6456    CALL     sskgxp_selectex()    FFFFFFFF7FFFB380 ?
                                                   000000018 ?
                                                   FFFFFFFF7FFFA538 ?
                                                   000000004 ? 000000010 ?
                                                   000000000 ?
skgxpwaiti()+5044    CALL     skgxpiwait()         000000001 ?
                                                   FFFFFFFF78B28A30 ?
                                                   FFFFFFFF7D30C0E0 ?
                                                   000000010 ?
                                                   FFFFFFFF78B29B38 ?
                                                   FFFFFFFF78B29B88 ?
skgxpwait()+984      CALL     skgxpwaiti()         000000001 ?
                                                   FFFFFFFF78B28A30 ?
                                                   000001380 ?
                                                   FFFFFFFF7FFFB458 ?
                                                   00000001E ? 000000000 ?
ksxpwait()+3360      PTR_CALL skgxpwait()          FFFFFFFF7FFFB380 ?
                                                   FFFFFFFF78B28A30 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFB458 ?
                                                   00000001E ? 000000000 ?
ksliwat()+10676      CALL     ksxpwait()           00000001E ? 10DB50180 ?
                                                   FFFFFFFF7FFFBB34 ?
                                                   0000013E0 ? 0000013C8 ?
                                                   000000000 ?
kslwait()+240        CALL     ksliwat()            000000000 ? 1A58FB73D0 ?
                                                   1A10C183C0 ? 000000001 ?
                                                   1A10C18268 ? 10DB4BE00 ?
ksxprcv_int()+1420   CALL     kslwait()            000000003 ?
                                                   FFFFFFFF7FFFBC18 ?
                                                   1A58FB73D0 ? 1A58FB73D0 ?
                                                   000000000 ? 000000000 ?
ksxprcvimd()+80      CALL     ksxprcv_int()        FFFFFFFF7889FC98 ?
                                                   000000002 ? 0001009E7 ?
                                                   10DB4BE00 ?
                                                   FFFFFFFF7FFFC90C ?
                                                   0000014C0 ?
kjctr_rksxp()+272    CALL     ksxprcvimd()         000000000 ? 00000001E ?
                                                   00000001E ? 000000000 ?
                                                   000000000 ? 00000004C ?
kjctrcv()+316        CALL     kjctr_rksxp()        000000010 ? 1A3FF4DA60 ?
                                                   000000000 ? 00000000A ?
                                                   000030000 ? 000000000 ?
kjcsrmg()+88         CALL     kjctrcv()            FFFFFFFF7889FB60 ?
                                                   1A3FF4DA60 ?
                                                   FFFFFFFF78B247A8 ?
                                                   000000001 ? 00000001E ?
                                                   00000004C ?
kjmsm()+7128         CALL     kjcsrmg()            FFFFFFFF78B247A8 ?
                                                   00010DB59 ? 00000001E ?
                                                   00000004C ? 000000000 ?
                                                   000000001 ?
ksbrdp()+1720        PTR_CALL kjmsm()              380015000 ? 000000000 ?
                                                   000000000 ? 000000000 ?

opirip()+1680        CALL     ksbrdp()             000000001 ? 000001768 ?
                                                   380017000 ? 380014F20 ?
                                                   100FD7398 ? 000380000 ?
opidrv()+748         CALL     opirip()             000004D28 ? 10DB4BE00 ?
                                                   0001020AA ? 38000A2E8 ?
                                                   10DB72000 ? 10DB72000 ?
sou2o()+88           CALL     opidrv()             10DB50000 ? 000000000 ?
                                                   10DB6E098 ? 000000032 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFE2A0 ?
opimai_real()+512    CALL     sou2o()              FFFFFFFF7FFFE278 ?
                                                   000000032 ? 000000004 ?
                                                   FFFFFFFF7FFFE2A0 ?
                                                   10DB72C00 ? 00010D800 ?
ssthrdmain()+324     PTR_CALL opimai_real()        FFFFFFFF7FFFEAE2 ?
                                                   FFFFFFFF7FFFEA8B ?
                                                   FFFFFFFF7DC4B650 ?
                                                   FFFFFFFF7DC4B650 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFE218 ?
main()+316           CALL     ssthrdmain()         00010D800 ? 00010DB7E ?
                                                   10DB7E000 ? 000000003 ?
                                                   00010DB7E ? 10DD2BD60 ?
_start()+380         CALL     main()               000000003 ? 000000002 ?
                                                   000000000 ?
                                                   FFFFFFFF7FFFE530 ?
                                                   FFFFFFFF7FFFE640 ?
                                                   000002800 ?

========== FRAME [4] (ksdxdmp()+2132 -> dbkedDefDump()) ==========
defined by frame pointers 0xffffffff7fff84d0  and 0xffffffff7fff8110
CALL TYPE: PTR_CALL   ERROR SIGNALED: yes   COMPONENT: VOS
%l0 0000000101DCB000 %l1 0000000000100800 %l2 000000010DB76000
%l3 000000010DB76000 %l4 000000000010DB53 %l5 0000000000000008
%l6 0000000000000000 %l7 000000000010D800 %i0 0000000000000004
%i1 000000000010DB76 %i2 0000000000000001 %i3 0000000000000001
%i4 0000000000000001 %i5 000000000010D800 %fp FFFFFFFF7FFF7CD1
rtn-pc 0000000108899694 argd FFFFFFFF7FFF8190 stret FFFFFFFF7FFF8190
xtraarg FFFFFFFF7FFF7D81 locals FFFFFFFF7FFF81C8
Dump of memory from 0xffffffff7fff8110 to 0xffffffff7fff84d0
FFFFFFFF7FFF8110 00000001 01DCB000 00000000 00100800  [................]
FFFFFFFF7FFF8120 00000001 0DB76000 00000001 0DB76000  [......`.......`.]
FFFFFFFF7FFF8130 00000000 0010DB53 00000000 00000008  [.......S........]
FFFFFFFF7FFF8140 00000000 00000000 00000000 0010D800  [................]
FFFFFFFF7FFF8150 00000000 00000004 00000000 0010DB76  [...............v]
FFFFFFFF7FFF8160 00000000 00000001 00000000 00000001  [................]
FFFFFFFF7FFF8170 00000000 00000001 00000000 0010D800  [................]
FFFFFFFF7FFF8180 FFFFFFFF 7FFF7CD1 00000001 08899694  [......|.........]
FFFFFFFF7FFF8190 FFFFFFFF 7FFF8F90 FFFFFFFF 7FFF9030  [...............0]
FFFFFFFF7FFF81A0 00000000 00000000 00000001 01774970  [.............wIp]
FFFFFFFF7FFF81B0 FFFFFFFF 7FFF7AFC 00000000 00100E2B  [......z........+]
FFFFFFFF7FFF81C0 00000001 00E2B40C 00000000 00000000  [................]
FFFFFFFF7FFF81D0 00000001 01DCB938 00000000 00000000  [.......8........]
FFFFFFFF7FFF81E0 00000000 00000013 00000000 00000001  [................]
FFFFFFFF7FFF81F0 00000000 00000014 00000000 00000000  [................]
FFFFFFFF7FFF8200 00000000 00000000 00000000 00000000  [................]
FFFFFFFF7FFF8210 00000000 00000000 00000001 0DB4BE00  [................]
FFFFFFFF7FFF8220 00000001 0DB4BE00 00000001 0DD37E40  [..............~@]
FFFFFFFF7FFF8230 00000000 00000001 00000000 00000001  [................]
FFFFFFFF7FFF8240 FFFFFFFF 7FFF7C41 00000001 07EB78F0  [......|A......x.]
FFFFFFFF7FFF8250 FFFFFFFF 7FFF825D 00000001 3100AD78  [.......]....1..x]
FFFFFFFF7FFF8260 00000001 0010DB76 00000001 0DB76000  [.......v......`.]
FFFFFFFF7FFF8270 FFFFFFFF 7FFF8588 FFFFFFFF 7FFF82D8  [................]
FFFFFFFF7FFF8280 00000000 00000000 00000000 00000001  [................]
FFFFFFFF7FFF8290 00000008 0DB57000 FFFFFFFF 7FFF82AF  [......p.........]
FFFFFFFF7FFF82A0 00000001 01EA54E6 00000001 0DB4BE00  [......T.........]
FFFFFFFF7FFF82B0 00000001 0DB418D0 00000000 00000004  [................]
FFFFFFFF7FFF82C0 00000000 00101400 00000000 00000000  [................]
FFFFFFFF7FFF82D0 00000000 00000000 00000000 0000001D  [................]
FFFFFFFF7FFF82E0 FFFFFFFF 7FFF7C61 00000001 00000000  [......|a........]
FFFFFFFF7FFF82F0 00026124 E595716A 00000000 00000002  [..a$..qj........]
FFFFFFFF7FFF8300 00000001 01774990 00000001 01774BF0  [.....wI......wK.]


========== FRAME [10] (sigacthandler()+92 -> call_user_handler()) ==========
defined by frame pointers 0xffffffff7fff95b0  and 0xffffffff7fff94c0
CALL TYPE: CALL   ERROR SIGNALED: yes   COMPONENT: (null)
%l0 000000000000000C %l1 00000000000000FF %l2 00000000000000FF
%l3 000000000000000C %l4 00000000FFFFFFF7 %l5 FFFFFFFF7FFF9720
%l6 000000009FBEF057 %l7 0000000000000000 %i0 FFFFFFFF7D102A40
%i1 FFFFFFFF7FFF9A20 %i2 000000009FBEF057 %i3 0000000000000000
%i4 0000000000000000 %i5 0000000000000011 %fp FFFFFFFF7FFF8DB1
rtn-pc FFFFFFFF7A33C3C4 argd FFFFFFFF7FFF9540 stret FFFFFFFF7FFF9540
xtraarg FFFFFFFF7FFF8E61 locals FFFFFFFF7FFF9578
Dump of memory from 0xffffffff7fff94c0 to 0xffffffff7fff9630
FFFFFFFF7FFF94C0 00000000 0000000C 00000000 000000FF  [................]
FFFFFFFF7FFF94D0 00000000 000000FF 00000000 0000000C  [................]
FFFFFFFF7FFF94E0 00000000 FFFFFFF7 FFFFFFFF 7FFF9720  [............... ]
FFFFFFFF7FFF94F0 00000000 9FBEF057 00000000 00000000  [.......W........]
FFFFFFFF7FFF9500 FFFFFFFF 7D102A40 FFFFFFFF 7FFF9A20  [....}.*@....... ]
FFFFFFFF7FFF9510 00000000 9FBEF057 00000000 00000000  [.......W........]
FFFFFFFF7FFF9520 00000000 00000000 00000000 00000011  [................]
FFFFFFFF7FFF9530 FFFFFFFF 7FFF8DB1 FFFFFFFF 7A33C3C4  [............z3..]
FFFFFFFF7FFF9540 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
FFFFFFFF7FFF9570 FFFFFFFF 7A48A000 FFFFFFFF 7FFF95B0  [....zH..........]
FFFFFFFF7FFF9580 0000000C 2F6F7074 00000001 07196320  [..../opt......c ]
FFFFFFFF7FFF9590 9FBFF057 FFFFFFF7 000000FF 00000000  [...W............]
FFFFFFFF7FFF95A0 FFFFFFFF 7FFF9720 00000000 00000011  [....... ........]
FFFFFFFF7FFF95B0 FFFFFFFF 7D102A40 00000000 001E0400  [....}.*@........]
FFFFFFFF7FFF95C0 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
FFFFFFFF7FFF95F0 00000000 00000011 FFFFFFFF 7FFF9A20  [............... ]
FFFFFFFF7FFF9600 FFFFFFFF 7FFF9720 00000000 0014DC94  [....... ........]
FFFFFFFF7FFF9610 FFFFFFFF 7A48A000 00000000 00000011  [....zH..........]
FFFFFFFF7FFF9620 FFFFFFFF 7FFF8E61 FFFFFFFF 7A33A528  [.......a....z3.(]

========== FRAME [12] (_pollsys()+232 -> __pollsys()) ==========
defined by frame pointers 0xffffffff7fff9b20  and 0xffffffff7fff7ca8
CALL TYPE: FRM_LESS   ERROR SIGNALED: yes   COMPONENT: (null)
%l0 0000000000000000 %l1 0000000000000000 %l2 0000000000000000
%l3 0000000000000000 %l4 0000000000000000 %l5 0000000000000000
%l6 0000000000000000 %l7 0000000000000000 %i0 000000000000005B
%i1 0000000000000004 %i2 FFFFFFFF7FFF9C80 %i3 0000000000000000
%i4 FFFFFFFF7D3102D0 %i5 0000000000008000 %fp FFFFFFFF7FFF9321
rtn-pc FFFFFFFF7A33A528 argd FFFFFFFF7FFF7D28 stret FFFFFFFF7FFF7D28
xtraarg FFFFFFFF7FFF93D1 locals FFFFFFFF7FFF7D60
Dump of memory from 0xffffffff7fff9720 to 0xffffffff7fff9b20
FFFFFFFF7FFF9720 0000002F 00000000 00000000 00000000  [.../............]
FFFFFFFF7FFF9730 00000000 00000000 00000000 00000000  [................]
FFFFFFFF7FFF9740 FFFFFFFF 7E000000 00000000 02000000  [....~...........]
FFFFFFFF7FFF9750 00000000 00000000 00000001 000054EC  [..............T.]
FFFFFFFF7FFF9760 00000000 00000001 FFFFFFFF 7A34E38C  [............z4..]
FFFFFFFF7FFF9770 FFFFFFFF 7A34E390 00000000 00000000  [....z4..........]
FFFFFFFF7FFF9780 00000000 000000B7 00000000 00000004  [................]
FFFFFFFF7FFF9790 00000000 00000000 00000000 00007A10  [..............z.]
FFFFFFFF7FFF97A0 00000000 000003F0 00000000 00000000  [................]
FFFFFFFF7FFF97B0 FFFFFFFF 7D102A40 00000000 0000005B  [....}.*@.......[]
FFFFFFFF7FFF97C0 00000000 00000004 FFFFFFFF 7FFF9C80  [................]
FFFFFFFF7FFF97D0 00000000 00000000 FFFFFFFF 7D3102D0  [............}1..]
FFFFFFFF7FFF97E0 00000000 00008000 FFFFFFFF 7FFF9321  [...............!]
FFFFFFFF7FFF97F0 FFFFFFFF 7A33A528 00000000 00000082  [....z3.(........]
FFFFFFFF7FFF9800 00000000 00000007 00000000 00000000  [................]
FFFFFFFF7FFF9810 000000FF 7DA1D882 000000FF 00000000  [....}...........]
FFFFFFFF7FFF9820 3FEFE000 00000000 40700000 00000000  [?.......@p......]
FFFFFFFF7FFF9830 3FEFE000 00000000 41E00000 00000000  [?.......A.......]
FFFFFFFF7FFF9840 406FE000 00000000 41741E52 B0000000  [@o......At.R....]
FFFFFFFF7FFF9850 43741E52 B0000000 41EDBDD0 CB000000  [Ct.R....A.......]
FFFFFFFF7FFF9860 3E7AD7F2 9ABCAF48 43741E52 BEDEE866  [>z.....HCt.R...f]
FFFFFFFF7FFF9870 43741E4F E0000000 41EA442E 25200000  [Ct.O....A.D.% ..]
FFFFFFFF7FFF9880 3E7AD7F2 9ABCAF48 43741E4F ED221713  [>z.....HCt.O."..]
FFFFFFFF7FFF9890 52455345 544C4F47 535F5343 4E02C103  [RESETLOGS_SCN...]
FFFFFFFF7FFF98A0 02C11702 C103FFFF 0180FFFF 02C11001  [................]
FFFFFFFF7FFF98B0 80018001 80018001 8001806C 00140002  [...........l....]
FFFFFFFF7FFF98C0 C10F02C1 0F02C117 01800B53 4F555243  [...........SOURC]
FFFFFFFF7FFF98D0 455F4442 494402C1 0302C117 02C103FF  [E_DBID..........]
FFFFFFFF7FFF98E0 FF0180FF FF02C10F 01800180 01800180  [................]
FFFFFFFF7FFF98F0 01800180 AC000100 16001600 4057AE00  [............@W..]
FFFFFFFF7FFF9900 09004057 AE000904 C3020F1E 85170601  [..@W............]
FFFFFFFF7FFF9910 00000000 00000000 00000001 00000420  [............... ]
        Repeat 1 times
FFFFFFFF7FFF9950 00000000 00000000 00000000 00000004  [................]
FFFFFFFF7FFF9960 00000000 00000000 00000000 00000000  [................]
        Repeat 5 times
FFFFFFFF7FFF99C0 33453430 00000000 FFFFFFFF 7FFF9A00  [3E40............]
FFFFFFFF7FFF99D0 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
FFFFFFFF7FFF9A00 00000000 00000000 00000000 00000800  [................]
FFFFFFFF7FFF9A10 00000000 00000000 00000000 00000000  [................]
FFFFFFFF7FFF9A20 00000011 00000000 00000000 00000000  [................]
FFFFFFFF7FFF9A30 00003AFF 00000000 00000064 00000000  [..:........d....]
FFFFFFFF7FFF9A40 00000000 00000000 00000000 00000000  [................]
FFFFFFFF7FFF9A50 0000009C 00000000 00000000 00000000  [................]
FFFFFFFF7FFF9A60 00000000 00000000 00000000 00000000  [................]
  Repeat 11 times


 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
2021-03-22 13:45:45.351000 -07:00
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
2021-03-22 13:45:47.111000 -07:00
Reconfiguration started (old inc 90, new inc 92)
List of instances:
 1 2 3 (myinst: 2)
 Nested reconfiguration detected.
 Global Resource Directory frozen
 Communication channels reestablished
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
2021-03-22 13:45:49.430000 -07:00
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
2021-03-22 13:45:58.046000 -07:00
 Submitted all GCS remote-cache requests
2021-03-22 13:46:01.793000 -07:00
 Fix write in gcs resources
2021-03-22 13:46:02.967000 -07:00
Reconfiguration complete
2021-03-22 13:46:05.641000 -07:00
Completed instance recovery at
 Thread 1: logseq 68169, block 1628567, scn 86178879969
 115245 data blocks read, 206718 data blocks written, 344656 redo k-bytes read
2021-03-22 13:46:08.779000 -07:00
Thread 1 advanced to log sequence 68170 (thread recovery)
Redo thread 1 internally disabled at seq 68170 (SMON)
2021-03-22 13:46:10.955000 -07:00
minact-scn: master continuing after IR
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:92 errcnt:0
2021-03-22 13:46:12.354000 -07:00
Archived Log entry 250876 added for thread 1 sequence 68169 ID 0xffffffffcabb04d4 dest 1:
ARC3: Archiving disabled thread 1 sequence 68170
Archived Log entry 250877 added for thread 1 sequence 68170 ID 0xffffffffcabb04d4 dest 1:
SERVER COMPONENT id=UTLRP_BGN

Resolution:

The sequence query looked like below in the alertlog. There were multiple sequences observed with millions of records performing block transfer using GCS

select price_seq.NEXTVAL 
  INTO :B1.totprice 
FROM dual;

Generic resolution for this issue would be to distribute the non critical jobs which are resource intensive during offpeak hours.

( or )

Check the top 5 foreground wait events and analyse each event individually.

In my case ,i should have advised the application guy to use optimum parallel features during peak hours of business to save some CPU for concurrent incoming user requests and i found enq: SQ – contention as the top wait contributor

For the sequence contention ,it is better to cache the high value sequencial queries to buffer cache instead of performing IO from disk in RAC. Because, if remote node requests a block from local node which is unavailable in both master node and slave node memory due to LRU function, then round trip time count will be higher.

Check the statistics for enqueue SQ for sequence using below query. If you find a significant value in the output list of below query on wait time, then it is good to cache the sequences in memory(SGA) for faster access

kish@exdbx<>select * from gv$enqueue_stat where EQ_TYPE='SQ';

 INST_ID   EQ_TYPE         TOTAL_REQ# TOTAL_WAIT# SUCC_REQ#  FAILED_REQ# CUM_WAIT_TIME
---------- --------------- ---------- ----------- ---------- ----------- -------------
         1 SQ                       4           0          4           0             0
         2 SQ                       7           2          7           0            40

Check the sequence sequence name and cache size it accomodate in memory

kish@exdbx<>select sequence_name,cache_size from user_sequences;

SEQUENCE_NAME   CACHE_SIZE
--------------- ----------
APPLY$_DEST_OBJ          0
_ID

APPLY$_ERROR_HA         20
NDLER_SEQUENCE

APPLY$_SOURCE_O          0
BJ_ID

AQ$_ALERT_QT_N          20
AQ$_AQ$_MEM_MC_         20
N

Check the sequence name and its cache size

kish@exdbx<>select sequence_name,cache_size from user_sequences where sequence_name='PRICE_SEQ';

SEQUENCE_NAME   CACHE_SIZE
--------------- ----------
PRICE_SEQ             1000

There is no thumb rule to set value for sequence cache. Analyze your database history to find the optimal value to be set.

kish@exdbx<>alter sequence PRICE_SEQ cache 1000;

Sequence altered.

If the sequence is frequently used all the time , then pin the sequence in cache to avoid age out of sequence as per LRU policy



kish@exdbx<>EXEC sys.DBMS_SHARED_POOL.KEEP('PRICE_SEQ','Q');

PL/SQL procedure successfully completed.

Here ‘Q‘ represents sequence

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s