“io done” oracle
These type of events happen very rare even in a minimum tuned system. But if there is limited resource with heavy DML operations, then this event is encountered.
Description:
The session’s waiting for an I/O to finish or for a slave process to become ready to submit the I/O request is known as the “io done” time event. Only platforms that do not allow asynchronous I/O experience this event.
Troubleshoot the issue:
1)As per spade, the event “io done” blocked by the lgwr process due to lack of space in log buffer to store the redo changes which cannot be hold longer. As this event happens due to synchronous IO, the next step is to check if async IO is enabled or disabled
SID TIME WHO PROGRAM SQL_ID USERNAM STATE W_CHAINS WHO_BLOCK_WHO
----- ------------------ ------- ------- ------------- ------- ---------- ------------------------------ --------------------
321 09-09-22 21:00:48 oracle LGWR SYS LONG_WAIT NULL |NULL blocks===> 321
|
363 09-09-22 21:00:48 sqlplus TNS V1- 5awjvvs8rvx7g SYS LONG_WAIT |io done [|]<~~~ log buffer sp |321 blocks===> 363|
V3 ace|
9 09-09-22 21:00:48 oracle DBW0 SYS LONG_WAIT NULL |NULL blocks===> 9|
22 09-09-22 21:00:48 oracle ARC1 SYS LONG_WAIT NULL |NULL blocks===> 22|
9 09-09-22 21:00:45 oracle DBW0 SYS LONG_WAIT NULL |NULL blocks===> 9|
Significant “log file switch (archiving needed)” events also observed during the near time. Hence lack of async IO vastly affects the IO performance of database.
W_CHAINS TIMER CHART SQL_ID TOTAVGACT PER%AS
-------------------------------------------------- ---------- -------------------- ------------- ---------- ----------
|log file switch (archiving needed) [|]<~~~ log fi 945 <><><><><><><><><>< 5awjvvs8rvx7g 1.75 12.6
le switch (archiving needed)| ><><><><><><><><><><
><><><><><><><><><><
><><><><><><><><><><
><><><><><><><
|log file switch (archiving needed) [|]<~~~ log fi 451 <><><><><><><><><>< 1.75 6
le switch (archiving needed)| ><><><><><><><><><><
><><>
Hang analysis on an archiver process clearly reveals that the issue needs to be focused on IO. Either settings need to be changed from database to influence IO behavior or storage team need to be approached for any disk issues
Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'io done' <===
Chain 1 Signature Hash: 0x967ac4c9
[b] Chain 2 Signature: 'CPU or Wait CPU'
Chain 2 Signature Hash: 0xabe82914
[c] Chain 3 Signature: 'Log archive I/O'
Chain 3 Signature Hash: 0x516ee54
In this first chain, DBWR is waiting on IO operation to complete synchronously waiting on db file parallel write. This mean that issue is specifically on DBWR process slowness or LGWR slowness, but overall IO is affected in the database
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (db9zx.db9zx)
os id: 4360
process id: 18, oracle@xhydra (DBW0)
session id: 9
session serial #: 51201
pdb id: 1 (CDB$ROOT)
}
is waiting for 'io done' with wait info: <<== DBWR waiting on io done event
{
p1: 'msg ptr'=0x3
time in wait: 0.000558 sec
heur. time in wait: 0.003753 sec
timeout after: never
wait id: 10788813
blocking: 0 sessions
current sql_id: 0
current sql: <none>
short stack: ksedsts()+426<-ksdxfstk()+58<-ksdxcb()+872<-sspuser()+200<-__sighandler()<-semtimedop()+14<-skgpwwait()+187<-ksliwat()+2218<-kslwaitctx()+188<-ksfv_wait_on_slave()+214<-ksfvsubmit()+4485<-ksfdaio()+3238<-kcfisio()+312<-kcbbdrv()+2144<-ksb_act_run_int()+117<-ksb_act_run()+130<-ksbabs()+746<-ksbrdp()+1123<-opirip()+541<-opidrv()+581<-sou2o()+165<-opimai_real()+173<-ssthrdmain()+417<-main()+256<-__libc_start_main()+243<-0x5541F689495641D7
wait history:
* time between current wait and wait #1: 0.000002 sec
1. event: 'io done' <<====
time waited: 0.003193 sec
wait id: 10788812 p1: 'msg ptr'=0x3
* time between wait #1 and #2: 0.000002 sec
2. event: 'io done'
time waited: 0.001904 sec
wait id: 10788811 p1: 'msg ptr'=0x2
3. event: 'db file parallel write'
time waited: 0.000003 sec (last interval)
time waited: 0.000780 sec (total)
wait id: 10788809 p1: 'requests'=0x1
p2: 'interrupt'=0x0
p3: 'timeout'=0x7fffffff
}
strace on one of the archiver process shows lot of pwrite64 calls which also confirms synchronous writes to disk by archiver. This reminds me of two parameter which can influence the IO calls performed by database process and also to bypass the filesystem cache. Refer the below post on the two valuable parameters which can perform magic on IO calls
[oracle@xhydra ~]$ grep 'pwrite' arc1 |wc -l
855
[oracle@xhydra ~]$ grep 'pwrite' arc1 |head
0.000025 pwrite64(256, "\25\302\0\0*\3\0\0\247\305\2\0\377\377\1\4mP\0\0\0$\277\v\0\0\0\0e\22uB"..., 16384, 13271040) = 16384
0.001160 pwrite64(257, "\25\302\0\0*\3\0\0\247\305\2\0\377\377\1\4mP\0\0\0$\277\v\0\0\0\0e\22uB"..., 16384, 13271040) = 16384
0.001682 pwrite64(256, "\25\302\0\0\22\0\0\0\247\305\2\0\377\377\1\4\326?\0\0\0\0\0\0\0\0\0\0D>\336A"..., 16384, 294912) = 16384
0.001240 pwrite64(257, "\25\302\0\0\22\0\0\0\247\305\2\0\377\377\1\4\326?\0\0\0\0\0\0\0\0\0\0D>\336A"..., 16384, 294912) = 16384
0.001574 pwrite64(256, "\25\302\0\0\20\0\0\0\247\305\2\0\377\377\1\4\201\247\0\0\200\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 262144) = 16384
0.000770 pwrite64(257, "\25\302\0\0\20\0\0\0\247\305\2\0\377\377\1\4\201\247\0\0\200\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 262144) = 16384
0.001312 pwrite64(256, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\211\272\0\0\0\0\0\0\0\0\0\23D\300?<"..., 16384, 16384) = 16384
0.001144 pwrite64(257, "\25\302\0\0\1\0\0\0\0\0\0\0\0\0\1\4\211\272\0\0\0\0\0\0\0\0\0\23D\300?<"..., 16384, 16384) = 16384
0.000016 pwrite64(11, "\0\"\0\0\0\0\300\377\0\0\0\0\0\0\0\4\361\203\0\0\0\2\0\0\222\337\5\0}|{z"..., 512, 0) = 512
0.000684 pwrite64(260, "\1\"\0\0\1\0\0\0\33,\0\0\0\200[\262\0\0\0\0\0\0\0\23D\300?<DB9Z"..., 1048064, 512) = 1048064
Both the parameters are disabled unfortunately which cause the repercussions
kIsH@Xhydra<>show parameter disk_async
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
disk_asynch_io boolean FALSE
kIsH@Xhydra<>show parameter filesyste_io
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
filesystemio_options string none
Solution:
Enable asynchronous IO and direct IO using the below changes
kIsH@Xhydra<>alter system set disk_asynch_io=true scope=spfile;
System altered.
kIsH@Xhydra<>alter system set filesystemio_options=setall scope=spfile;
System altered.
After setting the above parameters, the wait’s related to “io done” are vanished suddenly
W_CHAINS TIMER CHART SQL_ID TOTAVGACT PER%AS
-------------------------------------------------- ---------- -------------------- ------------- ---------- ----------
|enq: CR - block range reuse ckpt [|]<~~~ enq: CR 288 <><><><><><><><><>< 5awjvvs8rvx7g 1.36 4.4
- block range reuse ckpt| ><><><><
|enq: CR - block range reuse ckpt [|]<~~~ reliable 74 <><><> 5awjvvs8rvx7g 1.36 1.1
message|
Tracing archiver process shows that “aio” and “io_submit” calls are executed, which denotes that async IO is active.
0.000392 io_submit(0x7f69dcc7c000, 2, [{aio_data=0x7f69dcb5f978, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=256, aio_buf="\25\302\0\0*\3\0\0\t]\3\0\377\377\1\4\231m\0\0\0(\277\v\0\0\0\0\223#vB"..., aio_nbytes=16384, aio_offset=13271040}, {aio_data=0x7f69dcb5fc20, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=257, aio_buf="\25\302\0\0*\3\0\0\t]\3\0\377\377\1\4\231m\0\0\0(\277\v\0\0\0\0\223#vB"..., aio_nbytes=16384, aio_offset=13271040}]) = 2
0.000297 io_getevents(0x7f69dcc7c000, 2, 128, [{data=0x7f69dcb5f978, obj=0x7f69dcb5f978, res=16384, res2=0}, {data=0x7f69dcb5fc20, obj=0x7f69dcb5fc20, res=16384, res2=0}], {tv_sec=600, tv_nsec=0}) = 2
0.000606 io_submit(0x7f69dcc7c000, 2, [{aio_data=0x7f69dcb5f978, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=256, aio_buf="\25\302\0\0\22\0\0\0\t]\3\0\377\377\1\4\\9\0\0\0\0\0\0\0\0\0\0D>\336A"..., aio_nbytes=16384, aio_offset=294912}, {aio_data=0x7f69dcb5fc20, aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=257, aio_buf="\25\302\0\0\22\0\0\0\t]\3\0\377\377\1\4\\9\0\0\0\0\0\0\0\0\0\0D>\336A"..., aio_nbytes=16384, aio_offset=294912}]) = 2
0.000266 io_getevents(0x7f69dcc7c000, 2, 128, [{data=0x7f69dcb5f978, obj=0x7f69dcb5f978, res=16384, res2=0}, {data=0x7f69dcb5fc20, obj=0x7f69dcb5fc20, res=16384, res2=0}], {tv_sec=600, tv_nsec=0}) = 2