“io done” oracle

These type of events happen very rare even in a minimum tuned system. But if there is limited resource with heavy DML operations, then this event is encountered.

Description:

The session’s waiting for an I/O to finish or for a slave process to become ready to submit the I/O request is known as the “io done” time event. Only platforms that do not allow asynchronous I/O experience this event.

Troubleshoot the issue:

1)As per spade, the event “io done” blocked by the lgwr process due to lack of space in log buffer to store the redo changes which cannot be hold longer. As this event happens due to synchronous IO, the next step is to check if async IO is enabled or disabled

  SID TIME               WHO     PROGRAM SQL_ID        USERNAM STATE      W_CHAINS                       WHO_BLOCK_WHO
----- ------------------ ------- ------- ------------- ------- ---------- ------------------------------ --------------------
  321 09-09-22 21:00:48  oracle  LGWR                  SYS     LONG_WAIT  NULL                           |NULL blocks===> 321
                                                                                                         |

  363 09-09-22 21:00:48  sqlplus TNS V1- 5awjvvs8rvx7g SYS     LONG_WAIT  |io done [|]<~~~ log buffer sp |321 blocks===> 363|
                                 V3                                       ace|

    9 09-09-22 21:00:48  oracle  DBW0                  SYS     LONG_WAIT  NULL                           |NULL blocks===> 9|
   22 09-09-22 21:00:48  oracle  ARC1                  SYS     LONG_WAIT  NULL                           |NULL blocks===> 22|
    9 09-09-22 21:00:45  oracle  DBW0                  SYS     LONG_WAIT  NULL                           |NULL blocks===> 9|

Significant “log file switch (archiving needed)” events also observed during the near time. Hence lack of async IO vastly affects the IO performance of database.

W_CHAINS                                                TIMER CHART                SQL_ID         TOTAVGACT     PER%AS
-------------------------------------------------- ---------- -------------------- ------------- ---------- ----------
|log file switch (archiving needed) [|]<~~~ log fi        945  <><><><><><><><><>< 5awjvvs8rvx7g       1.75       12.6
le switch (archiving needed)|                                 ><><><><><><><><><><
                                                              ><><><><><><><><><><
                                                              ><><><><><><><><><><
                                                              ><><><><><><><

|log file switch (archiving needed) [|]<~~~ log fi        451  <><><><><><><><><><                     1.75          6
le switch (archiving needed)|                                 ><><><><><><><><><><
                                                              ><><>

Hang analysis on an archiver process clearly reveals that the issue needs to be focused on IO. Either settings need to be changed from database to influence IO behavior or storage team need to be approached for any disk issues

Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'io done' <===
     Chain 1 Signature Hash: 0x967ac4c9
 [b] Chain 2 Signature: 'CPU or Wait CPU'
     Chain 2 Signature Hash: 0xabe82914
 [c] Chain 3 Signature: 'Log archive I/O'
     Chain 3 Signature Hash: 0x516ee54

In this first chain, DBWR is waiting on IO operation to complete synchronously waiting on db file parallel write. This mean that issue is specifically on DBWR process slowness or LGWR slowness, but overall IO is affected in the database

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (db9zx.db9zx)
                   os id: 4360
              process id: 18, oracle@xhydra (DBW0)
              session id: 9
        session serial #: 51201
                  pdb id: 1 (CDB$ROOT)
    }
    is waiting for 'io done' with wait info: <<== DBWR waiting on io done event
    {
                      p1: 'msg ptr'=0x3
            time in wait: 0.000558 sec
      heur. time in wait: 0.003753 sec
           timeout after: never
                 wait id: 10788813
                blocking: 0 sessions
          current sql_id: 0
             current sql: <none>
             short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-semtimedop()+14<-skgpwwait()+187<-ksliwat()+2218<-kslwaitctx()+188<-ksfv_wait_on_slave()+214<-ksfvsubmit()+4485<-ksfdaio()+3238<-kcfisio()+312<-kcbbdrv()+2144<-ksb_act_run_int()+117<-ksb_act_run()+130<-ksbabs()+746<-ksbrdp()+1123<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+243<-0x5541F689495641D7
            wait history:
              * time between current wait and wait #1: 0.000002 sec
              1.       event: 'io done'  <<==== 
                 time waited: 0.003193 sec
                     wait id: 10788812         p1: 'msg ptr'=0x3
              * time between wait #1 and #2: 0.000002 sec
              2.       event: 'io done'
                 time waited: 0.001904 sec
                     wait id: 10788811         p1: 'msg ptr'=0x2
              3.       event: 'db file parallel write'
                 time waited: 0.000003 sec (last interval)
                 time waited: 0.000780 sec (total)
                     wait id: 10788809         p1: 'requests'=0x1
                                               p2: 'interrupt'=0x0
                                               p3: 'timeout'=0x7fffffff
    }

strace on one of the archiver process shows lot of pwrite64 calls which also confirms synchronous writes to disk by archiver. This reminds me of two parameter which can influence the IO calls performed by database process and also to bypass the filesystem cache. Refer the below post on the two valuable parameters which can perform magic on IO calls

[oracle@xhydra ~]$ grep 'pwrite' arc1 |wc -l
855
[oracle@xhydra ~]$ grep 'pwrite' arc1 |head
     0.000025 pwrite64(256, "\25\302\0\0*\3\0\0\247\305\2\0\377\377\1\4mP\0\0\0$\277\v\0\0\0\0e\22uB"..., 16384, 13271040) = 16384
     0.001160 pwrite64(257, "\25\302\0\0*\3\0\0\247\305\2\0\377\377\1\4mP\0\0\0$\277\v\0\0\0\0e\22uB"..., 16384, 13271040) = 16384
     0.001682 pwrite64(256, "\25\302\0\0\22\0\0\0\247\305\2\0\377\377\1\4\326?\0\0\0\0\0\0\0\0\0\0D>\336A"..., 16384, 294912) = 16384
     0.001240 pwrite64(257, "\25\302\0\0\22\0\0\0\247\305\2\0\377\377\1\4\326?\0\0\0\0\0\0\0\0\0\0D>\336A"..., 16384, 294912) = 16384
     0.001574 pwrite64(256, "\25\302\0\0\20\0\0\0\247\305\2\0\377\377\1\4\201\247\0\0\200\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 262144) = 16384
     0.000770 pwrite64(257, "\25\302\0\0\20\0\0\0\247\305\2\0\377\377\1\4\201\247\0\0\200\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 262144) = 16384
     0.001312 pwrite64(256, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\211\272\0\0\0\0\0\0\0\0\0\23D\300?<"..., 16384, 16384) = 16384
     0.001144 pwrite64(257, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\211\272\0\0\0\0\0\0\0\0\0\23D\300?<"..., 16384, 16384) = 16384
     0.000016 pwrite64(11, "\0\"\0\0\0\0\300\377\0\0\0\0\0\0\0\4\361\203\0\0\0\2\0\0\222\337\5\0}|{z"..., 512, 0) = 512
     0.000684 pwrite64(260, "\1\"\0\0\1\0\0\0\33,\0\0\0\200[\262\0\0\0\0\0\0\0\23D\300?<DB9Z"..., 1048064, 512) = 1048064

Both the parameters are disabled unfortunately which cause the repercussions

kIsH@Xhydra<>show parameter disk_async

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
disk_asynch_io                       boolean     FALSE
kIsH@Xhydra<>show parameter filesyste_io

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
filesystemio_options                 string      none

Solution:

Enable asynchronous IO and direct IO using the below changes


kIsH@Xhydra<>alter system set disk_asynch_io=true scope=spfile;

System altered.
kIsH@Xhydra<>alter system set filesystemio_options=setall scope=spfile;

System altered.

After setting the above parameters, the wait’s related to “io done” are vanished suddenly

W_CHAINS                                                TIMER CHART                SQL_ID         TOTAVGACT     PER%AS
-------------------------------------------------- ---------- -------------------- ------------- ---------- ----------
|enq: CR - block range reuse ckpt [|]<~~~ enq: CR         288  <><><><><><><><><>< 5awjvvs8rvx7g       1.36        4.4
- block range reuse ckpt|                                     ><><><><

|enq: CR - block range reuse ckpt [|]<~~~ reliable         74  <><><>              5awjvvs8rvx7g       1.36        1.1
 message|

Tracing archiver process shows that “aio” and “io_submit” calls are executed, which denotes that async IO is active.

     0.000392 io_submit(0x7f69dcc7c000, 2, [{aio_data=0x7f69dcb5f978, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=256, aio_buf="\25\302\0\0*\3\0\0\t]\3\0\377\377\1\4\231m\0\0\0(\277\v\0\0\0\0\223#vB"..., aio_nbytes=16384, aio_offset=13271040}, {aio_data=0x7f69dcb5fc20, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=257, aio_buf="\25\302\0\0*\3\0\0\t]\3\0\377\377\1\4\231m\0\0\0(\277\v\0\0\0\0\223#vB"..., aio_nbytes=16384, aio_offset=13271040}]) = 2
     0.000297 io_getevents(0x7f69dcc7c000, 2, 128, [{data=0x7f69dcb5f978, obj=0x7f69dcb5f978, res=16384, res2=0}, {data=0x7f69dcb5fc20, obj=0x7f69dcb5fc20, res=16384, res2=0}], {tv_sec=600, tv_nsec=0}) = 2
     0.000606 io_submit(0x7f69dcc7c000, 2, [{aio_data=0x7f69dcb5f978, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=256, aio_buf="\25\302\0\0\22\0\0\0\t]\3\0\377\377\1\4\\9\0\0\0\0\0\0\0\0\0\0D>\336A"..., aio_nbytes=16384, aio_offset=294912}, {aio_data=0x7f69dcb5fc20, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=257, aio_buf="\25\302\0\0\22\0\0\0\t]\3\0\377\377\1\4\\9\0\0\0\0\0\0\0\0\0\0D>\336A"..., aio_nbytes=16384, aio_offset=294912}]) = 2
     0.000266 io_getevents(0x7f69dcc7c000, 2, 128, [{data=0x7f69dcb5f978, obj=0x7f69dcb5f978, res=16384, res2=0}, {data=0x7f69dcb5fc20, obj=0x7f69dcb5fc20, res=16384, res2=0}], {tv_sec=600, tv_nsec=0}) = 2

Leave a Reply