SOLVE 100% HIGH CPU USAGE ORACLE

The following article consists of suggestions based on personal experience ,learning and own style of troubleshooting. This page is just a perception of mine and not to imitate anyone or any organization.

Check the top process to monitor the pid eating up high CPU. Note that , top command will sometimes show process which consume CPU consistently. So you do high level analysis first to check the load average for last five minutes and idle% on the output

First thing to check on top output is ,the user who is on the top of the top command. It should be either root or oracle

If it is a root user, then you may need to involve system admin to resolve the issue. But if the user is oracle ,then it is DBA who need to resolve

If load average value for last 5 minutes is high and idle% is 0.0% consistently, then your oracle database might end up in performance issues because of lack of CPU resource for other processes. This also means your server is CPU bound with 100% CPU used if you find this kind of behaviour most of the time.

Fig – 0.0 – PYRAMID TRIAGE

TOP:

top - 23:04:10 up  1:18,  1 user,  load average: 8.35, 2.17, 0.92
Tasks: 277 total,  25 running, 245 sleeping,   6 stopped,   1 zombie
Cpu(s): 36.6%us, 16.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi, 46.7%si,  0.0%st
Mem:   4723712k total,  3673968k used,  1049744k free,   125640k buffers
Swap:  3172828k total,        0k used,  3172828k free,  1907968k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10739 oracle    16   0 2113m 122m 117m R 36.5  2.7   0:06.07 oracle

If you see the below top output, on the left side, you have to perform three observations

  • Keep an eye on ‘%id’ column for – 0.0 (100% cores of CPU is utilized)
  • Keep an eye on Load average for first 5 mins (highlighted in blue) – Should not be signifantly high compared to 10 mins and 15 mins load. In below example, 56.07 (5 mins load) is significantly high compared to 43.49(10 mins load) and 30.38 (15 mins load)
  • Keep an eye on top PID and USER and check if it is oracle or OS or application process
Fig – 1.0 – Top output with high load average and CPU 100%

VMSTAT:

Use vmstat to see if CPU run queue exceed the total CPU core capacity . Focus on column ‘r’ under procs of vmstat output. It shows that there are high concurrent sessions which are waiting in queue for CPU resource. This means you have to either add more CPU cycles or distribute the application workload around the clock or schedule the jobs during idle time of database.

[oracle@exdbadm01 ~]$ vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
19  0      0 1370648  38384 1932252    0    0   386   116  483 1479  7  7 80  4  3
38  0      0 1354784  38408 1939472    0    0     2    66 1363 1468 35 62  3  0  0
33  0      0 1338144  38432 1945248    0    0     2    66 1298 1372 30 70  0  0  0
43  0      0 1313560  38456 1947764    0    0     2    52 1589 1300 25 75  0  0  0
21  0      0 1301236  38480 1950140    0    0     2    66 1132 1178 28 72  0  0  0
14  0      0 1296888  38496 1953904    0    0     2    67  934 1317 32 68  0  0  0
11  0      0 1291784  38520 1959876    0    0     2    87 1189 1400 39 61  0  0  0
29  0      0 1290676  38536 1961076    0    0     2    82 1367 1368 16 82  2  0  0
18  0      0 1285328  38552 1961964    0    0     2    34 1699 1370 14 86  0  0  0
30  0      0 1282732  38576 1966356    0    0     2    97 1206 1452 31 69  0  0  0

Check the plist-sz column of sar output and plot it in a graph like timeseries plot. Visualization often helps us to easily identify the spikes in resources or workload during a particular time interval.

Fig – 1.1 – Time series plot of total process in database

Plot the load average with respect to CPU using sar output to analyze the history of system behaviour with respect to the workload. If you see at Fig – 1.1 graph, we can easily narrow down the unexpected pile up of CPU utilization on the right corner of graph.

Compare Fig -1.1 graph with Fig – 1.2 . You can observe a pattern of correlation

  • Higher the task count(>600) , higher is the CPU load.
  • Upto some extent the CPU usage is low. But when task count is increased from 18:00:01 till 23:50:03,the CPU usage started to climb the hill
  • If you see below output , CPU usage has not increased till 23:20:04 in Fig 1.2 ,however task count start to increase at 17:50:01 itself in Fig 1.1. This shows that increase in task count(600) at 17:50:01 doesnot suddenly increase the CPU to (idle% (0.0) – 100%). CPU usage is constant for more than 3 hours irrespective of the process count but once task count > 600 at 23:10:01, there is sudden spike observed below.
Fig – 1.2 – Load average CPU in server

To understand it clearly, see the below graph. Red line indicates the load. Yellow,grey and orange represent the CPU usage

Fig – 1.3 – Increase in load(>600) after some extent(600) increase the CPU usage drastically

If you are not sure about the process which consume high CPU, then use below command to check top 5 process which consume high CPU. This command will also not help always to nail down the bottleneck. Because, we need to narrow down process which eats up CPU consistently but not an adhoc process which suddenly cause CPU to spike and immediately disappear from the system. Even though a process is not on the toppers list, they hide between other temporary high processes and deviate us. Its good to be careful in focusing on the right process.

eg: Process X consume high CPU(100%) at the moment and CPU spike observed. After blink of your eyes, the process got vanished from top. Process X tempt us to conclude bottleneck. Thats not always the case.

Process Y consume high CPU(80%) at the moment and it continues to stay at 80% CPU for next 20 minutes. But the process is not in toppers list from top output. Still process Y may be the cause. This is just an illustrated example. In reality , there are various other factors to consider.

DRILL DOWN (NAIL THE DATABASE PROCESS):

From database perspective, there are certain things to rule out to narrow down bottleneck. They are

  • RMAN backups
  • Database internal jobs like stats,advisory,oem etc…
  • SQL query

Let us look at the most common cause of high CPU

Here we identified the top process which is from oracle user and pid is 10739

[oracle@exdbadm01 ~]$ ps -eo pcpu,user,pid|sort -nr|head -5
20.1 oracle   10739
 3.0 oracle   10850
 2.5 oracle   10849
 1.6 oracle   10814
 1.0 oracle   10812

Check the pid for more details. Here we identified that the connection is not local but connected remotely from either sql developer or toad or …

[oracle@exdbadm01 ~]$ ps -ef|grep 10739
oracle   10739     1  3 23:03 ?        00:01:15 oracleexdbx1 (LOCAL=NO)
oracle   13024  5505  0 23:38 pts/0    00:00:00 grep 10739

Login to the database and grab the details of the user

kish@exdbx<>select PID,SPID,PNAME,USERNAME,SERIAL#,TERMINAL,PROGRAM from v$process where spid=10739;

   PID     SPID            PNAME           USERNAME         SERIAL#   TERMINAL        PROGRAM
---------- --------------- --------------- --------------- ---------- --------------- ---------------
        81 10739           NULL            oracle                   1 UNKNOWN         oracle@exdbadm0
                                                                                      1

SQL QUERY:

Then you can use the below query to narrow down the session which consume high CPU and if it is caused by a SQL query or not . Here we can see the SQL_ID column has a sql id ‘ 74b11p5tzdtgy

kish@exdbx<>set lines 250
set pages 2000
col username format a15 justify center
col program format a20 justify center
col event format a30 justify center
col sid format 99999 justify center
col SESSION_CPU_USAGE format 99999 justify center
select * from (select z.sid,nvl(z.username,'oracle-bg') as username,nvl(z.SQL_ID,'non-SQL') as SQL_ID,z.EVENT,z.program,round(sum(y.value)/100,6) as "SESSION_CPU_USAGE"
from v$statname x
inner join v$sesstat y on x.STATISTIC# = y.STATISTIC#
inner join v$session z on y.SID = z.SID
where x.name in ('CPU used by this session') and z.status='ACTIVE' group by z.sid,z.username,z.SQL_ID,z.EVENT,z.program order by SESSION_CPU_USAGE desc)
where rownum < 10;kish@exdbx<>kish@exdbx<>kish@exdbx<>kish@exdbx<>kish@exdbx<>kish@exdbx<>kish@exdbx<>  2    3    4    5    6

 SID      USERNAME        SQL_ID                 EVENT                    PROGRAM        SESSION_CPU_USAGE
------ --------------- ------------- ------------------------------ -------------------- -----------------
    91 TEST            74b11p5tzdtgy SQL*Net message from client    sqlplus.exe                         75
    14 oracle-bg       non-SQL       gcs remote message             oracle@exdbadm01 (LM                 7
                                                                    S0)

    95 SYS             d5ztjkzzv5nmb SQL*Net message to client      sqlplus@exdbadm01 (T                 6
                                                                    NS V1-V3)

    13 oracle-bg       non-SQL       ges remote message             oracle@exdbadm01 (LM                 5
                                                                    D0)

    62 oracle-bg       non-SQL       rdbms ipc message              oracle@exdbadm01 (CJ                 4
                                                                    Q0)

    63 TEST            9013dtzg5xx47 SQL*Net message from client    sqlplus.exe                          3

In my case , i identified that the highest CPU session execute an SQL query and i correlate the sql id with sql text and wait event which shows enqueue waits due to update statement

kish@exdbx<>select sq.sql_text,se.event from v$sql sq,v$session se where se.sql_id=sq.sql_id and sq.sql_id='74b11p5tzdtgy';

SQL_TEXT          EVENT
--------------- ----------
update SALES_TA enq: TX -
B_COPY set NULL row lock c
_COLUMN = 'ABC' ontention
 where mod(ORDE
R_ID,3)=1

update SALES_TA SQL*Net me
B_COPY set NULL ssage from
_COLUMN = 'ABC'  client
 where mod(ORDE
R_ID,3)=1

You can also check the pid which causes high CPU from database using column spid to confirm the user

kish@exdbx<>select s.sid,s.username,s.sql_id,p.spid from v$process p,v$session s where p.spid=10739 and s.username='TEST';

 SID      USERNAME        SQL_ID     SPID
------ --------------- ------------- ---------------
    35 TEST            9013dtzg5xx47 10739
    36 TEST            9013dtzg5xx47 10739
    44 TEST            9013dtzg5xx47 10739
    49 TEST            9013dtzg5xx47 10739
    51 TEST            74b11p5tzdtgy 10739

.......


kish@exdbx<>select s.sid,p.spid,count(*) from v$process p,v$session s where p.spid=10739 and s.username='TEST' group by s.sid,p.spid;

   SID     SPID                                                      COUNT(*)
---------- --------------- --------------------------------------------------
        36 10739                                                            1
        44 10739                                                            1
        82 10739                                                            1
        51 10739                                                            1
        52 10739                                                            1
        73 10739                                                            1
        57 10739                                                            1
        60 10739                                                            1
        72 10739                                                            1

Causes of high CPU due to SQL query:

HIGH PARALLEL SERVERS:

Check for any parallel processes used by any user from OS perspective. Here we can tell that oracle user has used parallel slaves. Here we can grep the parallel process using p0*

[oracle@exdbadm01 ~]$ ps -ef|grep ora_p0|grep -v 'grep'|wc -l
27

output:

oracle   10347     1  0 22:58 ?        00:00:00 ora_p000_exdbx1
oracle   10349     1  0 22:58 ?        00:00:00 ora_p001_exdbx1
oracle   10351     1  0 22:58 ?        00:00:00 ora_p002_exdbx1
oracle   10358     1  0 22:58 ?        00:00:00 ora_p003_exdbx1
oracle   10360     1  0 22:58 ?        00:00:00 ora_p004_exdbx1
oracle   10362     1  0 22:58 ?        00:00:00 ora_p005_exdbx1
oracle   10364     1  0 22:58 ?        00:00:00 ora_p006_exdbx1
oracle   10366     1  0 22:58 ?        00:00:00 ora_p007_exdbx1
oracle   10368     1  0 22:58 ?        00:00:00 ora_p008_exdbx1
......

Check parallel process usage from database perspective if you confirm that parallelism used from OS.Here we can observe high parallel activity being utilized in a 4 core CPU which is beyond the workload and is not normal

kish@exdbx<>select * from v$px_process;

SERVER_NAME      STATUS      PID     SPID               SID      SERIAL#
--------------- --------- ---------- --------------- ---------- ----------
P026               IN USE         79 10515           NULL       NULL
P020            AVAILABLE         73 10502           NULL       NULL
P018            AVAILABLE         71 10498           NULL       NULL
P007            AVAILABLE         58 10366           NULL       NULL
P013            AVAILABLE         64 10379           NULL       NULL
P016            AVAILABLE         69 10494           NULL       NULL
P003            AVAILABLE         54 10358           NULL       NULL
P021               IN USE         74 10504           NULL       NULL
P009               IN USE         60 10371           NULL       NULL
P025            AVAILABLE         78 10513           NULL       NULL
PZ99            AVAILABLE         52 12623           NULL       NULL
P011            AVAILABLE         62 10375           NULL       NULL
P002            AVAILABLE         53 10351           NULL       NULL
P000            AVAILABLE         49 10347           NULL       NULL
P017            AVAILABLE         70 10496           NULL       NULL
P010            AVAILABLE         61 10373           NULL       NULL
P006            AVAILABLE         57 10364           NULL       NULL
P001            AVAILABLE         50 10349           NULL       NULL
P023               IN USE         76 10509           NULL       NULL
P024            AVAILABLE         77 10511           NULL       NULL
P014            AVAILABLE         65 10381           NULL       NULL
P012            AVAILABLE         63 10377           NULL       NULL
P015            AVAILABLE         66 10383           NULL       NULL
P022            AVAILABLE         75 10506           NULL       NULL
P019            AVAILABLE         72 10500           NULL       NULL
P005            AVAILABLE         56 10362           NULL       NULL
P008            AVAILABLE         59 10368           NULL       NULL
P004            AVAILABLE         55 10360           NULL       NULL

28 rows selected.

Now correlate the sid to find username and sql id of the parallel consumption

kish@exdbx<>select s.sid,s.username,s.sql_id,p.spid,p.pid from v$process p,v$session s,v$px_process px where p.spid=px.spid and p.pid=px.pid and s.serial#=px.serial# and s.sid=px.sid and s.username='TEST';

   SID     USERNAME           SQL_ID     SPID               PID
---------- --------------- ------------- --------------- ----------
        80 TEST            9013dtzg5xx47 13630                   40
        81 TEST            9013dtzg5xx47 13632                   45
        87 TEST            a1xbv4d9f876d 13577                   50
        88 TEST            a1xbv4d9f876d 13579                   52
        89 TEST            a1xbv4d9f876d 13581                   53
        57 TEST            a1xbv4d9f876d 13583                   54
        36 TEST            a1xbv4d9f876d 13588                   55
        95 TEST            a1xbv4d9f876d 13590                   56
        49 TEST            a1xbv4d9f876d 13592                   57
        83 TEST            a1xbv4d9f876d 13594                   58
        52 TEST            a1xbv4d9f876d 13596                   59
        84 TEST            a1xbv4d9f876d 13598                   60
        72 TEST            a1xbv4d9f876d 13600                   61
        70 TEST            a1xbv4d9f876d 13602                   62
        85 TEST            a1xbv4d9f876d 13604                   63
        74 TEST            a1xbv4d9f876d 13606                   64
        66 TEST            a1xbv4d9f876d 13608                   65
        86 TEST            a1xbv4d9f876d 13612                   66
       100 TEST            9013dtzg5xx47 13616                   69
        69 TEST            9013dtzg5xx47 13618                   70
        65 TEST            9013dtzg5xx47 13620                   71
        79 TEST            9013dtzg5xx47 13622                   72
        35 TEST            9013dtzg5xx47 13624                   73
        44 TEST            9013dtzg5xx47 13626                   74
        60 TEST            9013dtzg5xx47 13628                   75
        73 TEST            9013dtzg5xx47 13634                   76

26 rows selected.

HIGH CPU DUE TO RMAN BACKUPS:

Check if backups run during high CPU. Channels allocation from rman cause CPU to pile up.

kish@exdbx<>col STATUS format a15
col hrs format 999.99
select
INPUT_TYPE, STATUS,
to_char(START_TIME,'mm/dd/yy hh24:mi') start_time,
to_char(END_TIME,'mm/dd/yy hh24:mi') end_time,
elapsed_seconds/3600 hrs
from V$RMAN_BACKUP_JOB_DETAILS
order by session_key desc;kish@exdbx<>kish@exdbx<>  2    3    4    5    6    7

INPUT_TYPE          STATUS        START_TIME      END_TIME        HRS
--------------- --------------- -------------- -------------- -------
ARCHIVELOG      FAILED          04/12/21 23:15 NULL           NULL
ARCHIVELOG      FAILED          04/06/21 22:13 NULL           NULL

Elapsed: 00:00:00.65

If backup run currently, then check for ‘MMNL’ in PROGRAM column of v$session. If the program is on top of CPU contributors, then you probably have to keep an eye on media layer.

SQL> select sid,program,status
  2  from
  3  v$session
  4  where paddr=(select addr from v$process where program like '%MMNL%');

       SID PROGRAM                                          STATUS
---------- ------------------------------------------------ --------
        16 oracle@orcl19x (MMNL)                            ACTIVE

SQL> set lines 250
SQL> set pages 2000
col username format a15 justify center
col program format a20 justify center
col event format a30 justify center
col sid format 99999 justify center
col SESSION_CPU_USAGE format 99999 justify center
select * from
(
select z.sid,nvl(z.username,'oracle-bg') as username,
nvl(z.SQL_ID,'non-SQL') as SQL_ID,
z.EVENT,z.program,round(sum(y.value)/100,6) as "SESSION_CPU_USAGE"
from v$statname x
inner join v$sesstat y on x.STATISTIC# = y.STATISTIC#
inner join v$session z on y.SID = z.SID
where x.name in ('CPU used by this session')
and z.status = 'ACTIVE'
and z.program = 'MMNL'
group by
z.sid,z.username,z.SQL_ID,z.EVENT,z.program order by SESSION_CPU_USAGE desc
)
where rownum < 10;SQL> SQL> SQL> SQL> SQL> SQL>   2    3    4    5    6    7    8    9   10   11   12   13   14   15

no rows selected

HIGH CPU DUE TO STATS JOB:

Last but not the least! Check for any stats job run on your database using sys user which consume high CPU

kish@exdbx<>select job from dba_jobs_running where sid =
(select sid from v$session where paddr=
(select addr from v$process where program like '%J000%'));  2    3

no rows selected


kish@exdbx<>@colfrmt
Procedure created.
Procedure created.


kish@exdbx<>SELECT JOB_NAME, SCHEDULE_NAME, SCHEDULE_TYPE, ENABLED,to_char(LAST_START_DATE,'dd-mon-yy hh24:mi:ss'),to_char(LAST_RUN_DURATION,'dd-mon-yy hh24:mi:ss'),to_char(NEXT_RUN_DATE,'dd-mon-yy hh24:mi:ss'),MAX_RUN_DURATION
FROM DBA_SCHEDULER_JOBS
WHERE PROGRAM_NAME like '%STAT%';  2    3


JOB_NAME        SCHEDULE_NAME   SCHEDULE_TYPE   ENABL TO_CHAR(LAST_START_DATE,'DD TO_CHAR(LAST_RUN_DURATION,' TO_CHAR(NEXT_RUN_DATE,'DD-M MAX_RUN_DURATIO
--------------- --------------- --------------- ----- --------------------------- --------------------------- --------------------------- ---------------
BSLN_MAINTAIN_S BSLN_MAINTAIN_S NAMED           TRUE  07-jun-21 09:23:04          +000000000 00:00:13.294716  13-jun-21 00:00:00        NULL
TATS_JOB        TATS_SCHED

ORA$AT_OS_OPT_S ORA$AT_WGRP_OS  WINDOW_GROUP    TRUE  08-jun-21 22:00:06          NULL                        NULL     NULL
Y_386

=============================================================================================================

kish@exdbx<>select
   log_date,
   job_name,
   status,
   to_char(req_start_date,'dd-mon-yy hh24:mi:ss'),
   to_char(actual_start_date,'dd-mon-yy hh24:mi:ss'),
   run_duration,CPU_USED
from
   dba_scheduler_job_run_details where job_name like '%STAT%'order by to_char(actual_start_date,'dd-mon-yy hh24:mi:ss') desc;
  2    3    4    5    6    7    8    9
LOG_DATE        JOB_NAME                    STATUS             TO_CHAR(REQ_START_DATE,'DD- TO_CHAR(ACTUAL_START_DATE,' RUN_DURATION    CPU_USED
--------------- --------------- ------------------------------ --------------------------- --------------------------- --------------- ---------------
23-MAY-21 01.06 BSLN_MAINTAIN_S SUCCEEDED                      23-may-21 00:00:00          23-may-21 00:36:18          +000 00:00:02   +000 00:00:00.1
.20.887438 PM + TATS_JOB                                                                                                    7
05:30

18-MAY-21 09.35 BSLN_MAINTAIN_S SUCCEEDED                      16-may-21 00:00:00          18-may-21 09:04:52          +000 00:00:17   +000 00:00:01.7
.09.149467 PM + TATS_JOB                                                                                                    8
05:30

07-JUN-21 09.53 MGMT_STATS_CONF SUCCEEDED                      07-jun-21 21:53:13          07-jun-21 21:53:38          +000 00:00:00   +000 00:00:00.1
.38.693675 PM + IG_JOB_2                                                                                                    2
05:30

07-JUN-21 09.53 MGMT_STATS_CONF SUCCEEDED                      07-jun-21 21:53:13          07-jun-21 21:53:13          +000 00:00:01   +000 00:00:00.0
.15.085393 PM + IG_JOB_1                                                                                                    7
05:30

07-JUN-21 09.53 BSLN_MAINTAIN_S SUCCEEDED                      30-may-21 00:00:00          07-jun-21 09:23:04          +000 00:00:13   +000 00:00:00.6
.17.798825 PM + TATS_JOB                                                                                                    0
05:30

07-JUN-21 09.53 MGMT_STATS_CONF SUCCEEDED                      01-jun-21 01:01:01          07-jun-21 09:23:03          +000 00:00:10   +000 00:00:00.4
.13.875305 PM + IG_JOB                                                                                                      1
05:30


6 rows selected.
======================================================================================================

kish@exdbx<>select OPERATION,TARGET,to_char(START_TIME,'dd-mm-yy hh24:mi:ss'),to_char(END_TIME,'dd-mm-yy hh24:mi:ss') from dba_optstat_operations order by END_TIME desc;

                           OPERATION                             TARGET          TO_CHAR(START_TIM TO_CHAR(END_TIME,
---------------------------------------------------------------- --------------- ----------------- -----------------
purge_stats                                                      NULL            08-06-21 22:10:42 08-06-21 22:10:46
gather_database_stats(auto)                                      NULL            08-06-21 22:00:10 08-06-21 22:10:42
gather_table_stats                                               SYS.WRH$_FILEST 07-06-21 22:22:50 07-06-21 22:22:51
                                                                 ATXS.WRH$_FILES
                                                                 T_577612197_247

As this is not a proactive approach , we have to act immediately to resolve high CPU issue. As this may cause database performance to slow down. Kill the pid which cause high CPU. We dont have any other options other than killing the session

[oracle@exdbadm01 ~]$ kill -9 10739

HIGH CPU INDICATION DUE TO RESOURCE MANAGER:

Sometimes if resource manager is configured to limit the CPU for resource group , “resmgr:cpu quantum” wait event will be on top the top foreground wait events which also indicate task wait to acquire CPU resource

Event                    Waits   Time(s)  Avg wait(ms) % DB time Wait Class

------------------------ ------- -------- ------------ -------------- ---------- -----------


resmgr:cpu quantum         1,596    281         134           3.9 Scheduler


db file scattered read     1,051    778          86           3.81 User I/O

For long term solution, we need to tune the SQL queries to access less CPU and distribute the application workload to a non business hour

Hope you liked this post!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s