oracle慎用基于on commit刷新物化视图的方式(一)
背景
上周五,也就是-08-07那天快下班的时候,突然发生一件事。客户反馈xxxx平台登录不上去,当时一开始没意识到是数据库导致的问题。
业务人员在那怀疑这里那里,还是网络ip哪里地区又被封了。后来都排除没发生什么问题,还想着重启Nginx服务器(我也不懂)。
但是维护Nginx的人员刚好请假了。
刚好当时大概是这个事情发生的半个小时之前,监控邮件报出如下错误,
当时以为只是归档目录快满了,简单的备份了下归档,清理掉多余的归档。
排查
问题出现后,虽然没怀疑到数据库的问题,不过我也还是看了一眼数据库。
发现警告日志闪回恢复区有在报使用率超标的问题,有问题,绝对有问题。
除了归档量暴增的问题和刚搭建起来没多久的备库归档也爆掉的问题,没发现其他的报错。
先看看日志切换的频率,和平时同一时段相比,这归档多了100多倍。
让我一度怀疑是应用出现了什么问题导致dml操作剧增导致的。
SELECT TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'999') "00",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'999') "01",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'999') "02",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'999') "03",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'999') "04",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'999') "05",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'999') "06",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'999') "07",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'999') "08",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'999') "09",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'999') "10",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'999') "11",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'999') "12",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'999') "13",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'999') "14",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'999') "15",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'999') "16",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'999') "17",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'999') "18",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'999') "19",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'999') "20",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'999') "21",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'999') "22",
TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'999') "23"FROMV$LOG_HISTORYGROUP BY TO_CHAR(FIRST_TIME,'YYYY-MM-DD')ORDER BY 1 DESC;
View Code
这个时候我需要定位找出是哪条SQL或者哪些SQL引起的问题。
通过如下语句,可以找出哪些对象有大量数据块变化情况,
SELECT TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24') SNAP_TIME,
DHSO.OBJECT_NAME,SUM(DB_BLOCK_CHANGES_DELTA) BLOCK_CHANGEDFROMDBA_HIST_SEG_STAT DHSS,
DBA_HIST_SEG_STAT_OBJ DHSO,
DBA_HIST_SNAPSHOT DHSWHERE DHS.SNAP_ID =DHSS.SNAP_IDAND DHS.INSTANCE_NUMBER =DHSS.INSTANCE_NUMBERAND DHSS.OBJ# =DHSO.OBJ#AND DHSS.DATAOBJ# =DHSO.DATAOBJ#AND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('-08-07 16:00','YYYY-MM-DD HH24:MI')ANDTO_DATE('-08-07 17:30', 'YYYY-MM-DD HH24:MI')GROUP BY TO_CHAR(BEGIN_INTERVAL_TIME, 'YYYY-MM-DD HH24'),
DHSO.OBJECT_NAME
HAVING SUM(DB_BLOCK_CHANGES_DELTA) > 0
ORDER BY SUM(DB_BLOCK_CHANGES_DELTA) asc;
View Code
部分关键结果如下,
SNAP_TIME |OBJECT_NAME |BLOCK_CHANGED---------------------------------------|------------------------------|-------------
............. |.............. |......-08-07 16 |.............. | 100480
-08-07 16 |.............. | 129184
-08-07 16 |.............. | 137024
-08-07 17 |.............. | 143520
-08-07 17 |.............. | 145440
-08-07 16 |.............. | 228624
-08-07 16 |.............. | 256704
-08-07 17 |.............. | 548336
-08-07 17 |.............. | 55019536
-08-07 17 |.............. | 103429408
-08-07 16 |T_XXXXXXXXXX_INFO | 103997488
-08-07 16 |PK_X_XXXXXX_XXXX | 195532752
210rows selected.
Elapsed:00:00:02.29
在-08-07 16点内,有两个对象的块变化是最多的,分别是
T_XXXXXXXXXX_INFO
PK_X_XXXXXX_XXXX
PK_X_XXXXXX_XXXX很容易就可以查出,这是一个主键索引,对应的表名为X_XXXXXX_XXXX。
赶紧取问题时间段的awr报告,看看是否存在相关的X_XXXXXX_XXXX以及T_XXXXXXXXXX_INFO的sql,顺便也看看等待事件。
结果一眼就看出等待事件存在问题,
EventWaitsTotal Wait Time (sec)Wait Avg(ms)% DB timeWait Class
enq: TX - row lock contention
683
223.1K
326667
39.0
Application
gc buffer busy acquire
10,048
163.8K
16302
28.6
Cluster
enq: JI - contention
23,034
112.7K
4892
19.7
Other
buffer busy waits
1,770
20.9K
11784
3.6
Concurrency
DB CPU
20K
3.5
enq: TX - index contention
591
6630.2
11219
1.2
Concurrency
log file switch (checkpoint incomplete)
481
6268.1
13031
1.1
Configuration
db file scattered read
482,552
4536
9
.8
User I/O
gc current block busy
10,845
3853.8
355
.7
Cluster
direct path read
832,639
3170
4
.6
User I/O
这个系统平时没有存在那么高的TX锁等待,赶紧看看哪个段的行锁等待比较严重,
Segments by Row Lock Waits
OwnerTablespace NameObject NameSubobject NameObj. TypeRow Lock Waits% of Capture
ABCD_EFG
ABCD
X_XXXXXX_XXXX
TABLE
681
53.71
ABCD_XXXXXXXXS
DYCK_FILETRANS
PK_XXXX_XXXXXXXX_TMP
INDEX
321
25.32
刚好和前边块变化最多的对象是同一个,也可以看看块变化,
Segments by DB Blocks Changes
OwnerTablespace NameObject NameSubobject NameObj. TypeDB Block Changes% of Capture
XXXXX_DATA
USERS
PK_X_XXXXXX_XXXX
INDEX
161,129,664
64.72
XXXXX_DATA
USERS
T_XXXXXXXXXX_INFO
TABLE
85,726,272
34.43
都对的上!!!
这个时候在从awr中找找含有X_XXXXXX_XXXX表的SQL,
Elapsed Time (s)ExecutionsElapsed Time per Exec (s)%Total%CPU%IOSQL IdSQL ModuleSQL Text
226,451.96
1,075
210.65
39.58
0.01
0.00
JDBC Thin Client
updateX_XXXXXX_XXXX set ID=:1...
166,564.13
1,368
121.76
29.12
0.14
0.00
JDBC Thin Client
......................
20,730.38
294
70.51
3.62
0.01
0.03
JDBC Thin Client
......................
5,395.58
0
0.94
5.25
94.88
......................
5,395.54
7
770.79
0.94
5.25
94.88
......................
5,205.90
38
137.00
0.91
40.58
58.45
xxxxxxxxx.exe
......................
4,619.25
17,384
0.27
0.81
0.05
0.08
xxxxxxxxxxxxxxxxxx.exe
......................
4,231.15
1
4,231.15
0.74
99.72
0.09
xxxxxxxxxxx.exe
......................
4,049.48
47,215
0.09
0.71
0.14
0.20
xxxxxxxxxxxxxxxxxx.exe
......................
2,707.50
1
2,707.50
0.47
99.87
0.10
xxxx.exe
......................
set lines 150 pages 150col BEGIN_INTERVAL_TIMEfora23
col PLAN_HASH_VALUEfor 9999999999col date_timefora30
col snap_id heading'SnapId'col executions_delta heading "No.of exec"
col sql_profile heading "SQL|Profile" fora7
col date_time heading'Date time'col avg_lio heading'LIO/exec' for 99999999999.99col avg_cputime heading'CPUTIM/exec' for 9999999.99col avg_etime heading'ETIME/exec' for 9999999.99col avg_pio heading'PIO/exec' for 9999999.99col avg_row heading'ROWs/exec' for 9999999.99
SELECT distincts.snap_id ,
PLAN_HASH_VALUE,
to_char(s.BEGIN_INTERVAL_TIME,'yyyy-mm-dd hh24:mi:ss')|| to_char(s.END_INTERVAL_TIME,'-hh24:mi:ss') Date_Time,
SQL.executions_delta,
SQL.buffer_gets_delta/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_lio,--wait_delta,
(SQL.cpu_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_cputime ,
(SQL.elapsed_time_delta/1000000)/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_etime,
SQL.DISK_READS_DELTA/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_pio,
SQL.rows_processed_total/decode(nvl(SQL.executions_delta,0),0,1,SQL.executions_delta) avg_row--,SQL.sql_profile
FROMdba_hist_sqlstat SQL,
dba_hist_snapshot sWHERESQL.instance_number=(select instance_number fromv$instance)and SQL.dbid =(select dbid from v$database)and s.snap_id =SQL.snap_idAND sql_id in('&SQLID') order bys.snap_id/
View Code
结果如下,
SnapId|PLAN_HASH_VALUE|Date time |No. of exec| LIO/exec|CPUTIM/exec| ETIME/exec| PIO/exec| ROWs/exec
----------|---------------|------------------------------|-----------|---------------|-----------|-----------|-----------|-----------
16581| 2297901584|-08-07 13:30:02-14:00:09 | 370| 4.04| .00| .00| .00| 5547.76
16583| 2297901584|-08-07 14:30:16-15:00:24 | 563| 4.02| .00| .00| .00| 3648.02
16584| 2297901584|-08-07 15:00:24-15:30:31 | 567| 4.03| .00| .00| .00| 3623.28
16585| 2297901584|-08-07 15:30:31-16:00:39 | 442| 4.05| .00| .00| .00| 4648.97
16586| 2297901584|-08-07 16:00:39-16:30:47 | 429| 4.28| .00| .59| .01| 4790.84
16587| 2297901584|-08-07 16:30:47-17:01:31 | 366| 5.72| .06| 169.81| .07| 5616.50
16587| 2297901584|-08-07 16:30:47-17:07:50 | 366| 5.72| .06| 169.81| .07| 5616.50
16588| 2297901584|-08-07 17:01:31-17:30:32 | 278| 15.70| .03| 423.38| .15| 7395.38
16588| 2297901584|-08-07 17:07:50-17:30:32 | 278| 15.70| .03| 423.38| .15| 7395.38
16589| 2297901584|-08-07 17:30:32-18:00:41 | 369| 31.79| .07| 1180.75| .00| 5572.59
16590| 2297901584|-08-07 18:00:41-18:30:51 | 535| 44.85| .06| 909.64| .00| 3844.53
16591| 2297901584|-08-07 18:30:51-19:00:00 | 201| 55.29| .01| 192.97| .00| 10233.94
16592| 2297901584|-08-07 19:00:00-19:30:08 | 26| 4.69| .00| 1.26| .00| 79117.23
从8月7日下午1点开始可以查询到记录,一直在下午4点半后开始,sql执行时间急剧增加,但是执行计划确是并没有变化过。
并且从执行次数看,并没有多太过于剧烈的波动。
到这里目前似乎没有什么好办法可以自己追查下去,能确定的是出问题的SQL语句就是95nn49n57dqza,由于update引起大量的TX锁等待,导致xxxx平台登录不上去或者响应很久才出现页面。
而TX锁严重的原因就是一次update的时间从不到1s变到几百s甚至上千s。
这里存在一个问题,便是在看了awr发现tx锁严重,再到确定SQL的过程,由于重心太过明确导致忽略掉了一开始找出哪些对象有大量数据块变化情况那里,除了PK_X_XXXXXX_XXXX外其实还有另外一个对象T_XXXXXXXXXX_INFO。
暂且忽略T_XXXXXXXXXX_INFO顺着一直下去的思路的话,单纯从视图间的关联貌似已经无法继续下去,可以选择在数据库级别做一次ssd,或者10046。
推荐10046,从结果反推回去,10046是可以看出会话除了有执行95nn49n57dqza之外,也是有对T_XXXXXXXXXX_INFO表做insert的操作的。
因为第一时间发现95nn49n57dqza的问题,立马问了业务人员之后,他们有反馈在4点左右有创建一张物化视图T_XXXXXXXXXX_INFO,基表就是X_XXXXXX_XXXX。因此没做10046。
看了创建语句,用了on commit。大致找了下资料,基本确定是这个引起的。当即决定删除这个物化视图。
18:54:45 SYS@dyckdb1(2194)> drop MATERIALIZED VIEWMACRO_DATA.T_ENTERPRISE_INFO;drop MATERIALIZED VIEWMACRO_DATA.T_ENTERPRISE_INFO*ERROR at line1:
ORA-04021: timeout occurred while waiting tolock object
Elapsed:00:00:49.05
--撤销授权操作,不然物化视图删不掉
18:57:45 SYS@dyckdb2(1269)> revoke select on dyck_pub.t_client_info fromMACRO_DATA;^Crevoke select on dyck_pub.t_client_info fromMACRO_DATA*ERROR at line1:
ORA-01013: user requested cancel of currentoperation
Elapsed:00:02:49.57
19:02:04 SYS@dyckdb2(1269)> /
Revokesucceeded.
Elapsed:00:00:09.58
19:02:46 SYS@dyckdb1(2194)> drop MATERIALIZED VIEWMACRO_DATA.T_ENTERPRISE_INFO;
Materializedviewdropped.
Elapsed:00:00:03.82
删除视图,问题解决。
这里提供另外一个查找产生指定时间段内产生redo最多的SQL做参考:
SELECT TO_CHAR(BEGIN_INTERVAL_TIME,'YYYY_MM_DD HH24') WHEN,
DBMS_LOB.SUBSTR(SQL_TEXT,4000,1) SQL,
DHSS.INSTANCE_NUMBER INST_ID,
DHSS.SQL_ID,
EXECUTIONS_DELTA EXEC_DELTA,
ROWS_PROCESSED_DELTA ROWS_PROC_DELTAFROMDBA_HIST_SQLSTAT DHSS,
DBA_HIST_SNAPSHOT DHS,
DBA_HIST_SQLTEXT DHSTWHERE UPPER(DHST.SQL_TEXT) LIKE '%T_CLIENT_INFO%'
AND LTRIM(UPPER(DHST.SQL_TEXT)) NOT LIKE 'SELECT%'
AND DHSS.SNAP_ID=DHS.SNAP_IDAND DHSS.INSTANCE_NUMBER=DHS.INSTANCE_NUMBERAND DHSS.SQL_ID=DHST.SQL_IDAND BEGIN_INTERVAL_TIME BETWEEN TO_DATE('-08-07 16:00','YYYY-MM-DD HH24:MI')AND TO_DATE('-08-07 17:00','YYYY-MM-DD HH24:MI')order by exec_delta;
View Code
通过上边的语句查找是可以发现update基表和insert物化视图的SQL语句的。
接下来可以通过模拟on commit对update影响有多大,顺便看看是否由于物化视图关系是否导致redo量激增。