Subject: Bug:350174 Checking the trace file & follow on steps Creation Date: 13-MAR-1998 Dictionary cluster C_TS# can become corrupt ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This describes the prefered method of dealing with this problem. The bug is alerted in: [NOTE:39694.1] Logging a call (UK only): [NOTE:42969.1] Customer instructions to apply fix [NOTE:43999.1] 7.3.3.6 fixup code INTERNAL ALERT [NOTE:60908.1] TCPATCH fix locations [NOTE:41839.1] (may be out of date) Checking the trace file ~~~~~~~~~~~~~~~~~~~~~~~ On receipt of a trace file from a failed ANALYZE C_TS# we need to confirm if this is a stage 1 corruption, stage 2+ corruption OR a completely different issue. ORA 60 Deadlock Graph? ~~~~~~~~~~~~~~~~~~~~~~ If the ANALYZE produces an ORA-60 deadlock graph it is likely that the dictionary has statistics present. Delete the statistics and repeat the analyze c_ts# validate structure cascade command. "kdcchk: chain pointer XX points to key row with no references but FFK bit not set!" ~~~~~~~~~~~~~~~~~~ If you see this error in the ANALYZE trace file it implies there is a MINOR corruption on the index at the location indicated. This corruption does NOT need patching and does not have any adverse side effects save for wasting some space. HOWEVER, if you do encounter this error the cluster ANALYZE may have terminated prematurely preventing proper checking of the cluster. See [BUG:422050] which allows cluster checking to continue with this corruption present. ** The above warning is included to help stop mis-diagnosis of 350174. Stage 1 corruption: ~~~~~~~~~~~~~~~~~~~ Stage 1 corruption should produce a trace-file of the form: *** SESSION ID:(6.3) 1996.08.07.12.17.23.000 kdcchk: error when looking at index with key: col 0; len 2; (2): c1 04 kdcchk: index points to block 0x04000429 slot 0x0 chain length is 2 kdcchk: chain count wrong 0x04000429.0 chain is 3 index says 2 last entry 0x04000079.0 blockcount = 3 kdavls: kdcchk returns 3 when checking cluster dba 0x04000075 obj 5 kdavls: index dba 0x0400007a obj 6 # of cols in key 1 Max size cluster 32767 The important line is: "chain count wrong 0x04000429.0 chain is 3 index says 2" If the chain count is ONE higher than what the index says this is a first level corruption. THERE MAY BE FURTHER CORRUPTIONS on the cluster but you cannot tell this at this stage. Other information in the trace file: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If there is any other form of output in the tracefile this is either a severe corruption or a different problem. In either case the best advice is to (a) apply the patch then (b) rebuild the database using export/import as soon as possible. Correcting a Stage 1 corruption: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Do *NOT* tell the customer we can patch the situation as we may fix the stage 1 problem to reveal a more serious corruption. Do *NOT* just send these instructions to the customer - send [NOTE:43999.1] **WARNING** The notes here apply to the NEW 350174 patches. The first set of patches do not include fixup code which is described in [BUG:418194]. The 'fixup' patches have ONLY being produced for terminal releases (7.1.6, 7.2.3). **ADDITIONAL WARNING** The NEW 350174 patch is NOT included with either 7.3.2.3 NOR 7.3.3 releases. Although the bug itself is not present in these releases customers who upgrade may inherit a corruption, in which case no patch is be available. The NEW 350174 corrective patch is NOT included in 7.3.4.1/2 even though the PSR README indicates it it. It IS included in 7.3.4.3 and has been briefly tested on Solaris. **EVEN MORE WARNINGS** PSR 7.3.3.6 includes a BAD version of the corrective FIX. ** DO NOT USE EVENT:10668 on 7.3.3.6 ** See [NOTE:60908.1] 1) Make sure the customer has the NEW fix for [BUG:350174]. On Unix do the following: strings -a kdc.o | grep 'correcting count in keydata to' If this does NOT return anything then the patch is the OLD version. It should return: "kdcchk: correcting count in keydata to: %d" Note: Some versions of 'strings' does not support the '-a' flag. Look at the man page for 'strings' for any option that scans the entire file for ASCII strings. For a list of patches: Sanitised list of 350174 fixes: [NOTE:41839.1] Raw Bugmatrix for 350174: 2) The customer should shutdown the database cleanly and take a FULL COLD BACKUP. Shutdown ABORT should NOT be used. 3) Install the patch. See any port specific README on the patch for instructions. 4) Start the database in restricted mode and connect internal. 5) To enable the index block fixing code, set at the session level: e.g. ALTER SESSION SET EVENTS '10668 trace name context forever'; 6) Analyze on the c_ts# cluster: e.g. ANALYZE CLUSTER c_ts# VALIDATE STRUCTURE CASCADE; 7) The trace file generated should indicate that a corruption has been detected and is being fixed. Repeat the ANALYZE command - it should return no errors. If ANALYZE still reports an error then there may be several corruptions: If so run ANALYZE several times. Each time ANALYZE report an error the trace should indicate the 'correcting count' message above. If ANALYZE errors but there is no 'correcting count' in the trace file then either the patch is not installed correctly or there is different problem on the cluster (not a stage 1 corruption). At most you should need to run ANALYZE once for each tablespace on the database (with set). 8) Shutdown normal the database. 9) Disable the event if it was set in init.ora rather than at session level. 10) Startup restrict and connect internal. 11) Repeat the command: ANALYZE CLUSTER c_ts# VALIDATE STRUCTURE CASCADE; This should return no errors. If there are any errors get the trace file for analysis. 12) Shutdown normal and take a FULL COLD BACKUP. It is not advisable to ever use recovery across this operation. Evidence to gather if a stage 2 corruption is suspected ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Identify any differences between a cluster scan and a cluster index lookup of SYS.TS$: spool ts_index.out select rowid, ts#, name, online$ from sys.ts$ where ts#>=0 order by ts#; spool off spool ts_fts.out select rowid, ts#, name, online$ from sys.ts$ where ts#+0>=0 order by ts#; spool off 2. Attempt to offline and re-online ALL valid non-SYSTEM tablespaces. The following script will output a series of statements that can be used to offline/online all online tablespaces: spool offline_tbs.sql set serveroutput on declare cursor c_tbsname is select tablespace_name from dba_tablespaces where tablespace_name != 'SYSTEM' and status = 'ONLINE'; begin for v_tbsname in c_tbsname loop dbms_output.put_line ('alter tablespace '||v_tbsname.tablespace_name||' offline;'); dbms_output.put_line ('alter tablespace '||v_tbsname.tablespace_name||' online'); end loop; end; / spool off The offline_tbs.sql spool file can now be edited and run to offline/online the tablespaces. Any errors should be noted (error codes and tablespace the error occured on). Correcting Stage 2 corruption ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ** For experienced analysts only. A rebuild is ALWAYS ALWAYS preferred ** ** This should never be repesented as an option to a customer ** From: Andy McGugan and Russell Green This is the documentation for patching the Orphan datablock scenario which I have pulled together after the patching of the Fingerhut DB by Russ and myself (mainly Russ tho'!). I hope it makes sense. Bug 350174 - Patching the orphan datablock scenario =================================================== This scenario is the one that requires the most effort to patch. In this situation the index entry for a datablock has been deleted because (as the cluster chain count is out by 1) the cluster chain count has reached 0, despite there still being a datablock with valid data in it. We are able to patch this because we are expecting to find the deleted index row still in the index block but marked as deleted. * NOTE - ALL PATCHING SHOULD BE VERIFIED ON A COPY OF THE SYSTEM TABLESPACE BEFORE ATTEMPTING TO PATCH A PRODUCTION DATABASE Structure of an Index Leaf block ================================ (From kdx.h) Index blocks are the basic building blocks of B-tree indexes. They store a set of rows just as do ordinary data blocks. Index blocks come in two flavors: branch and leaf. Branch block rows hold pairs used to guide the B-tree search to a row in a leaf block. Leaf block rows hold the pairs stored by the B-tree. An index block is layed-out as follows: The header contains information about the contents of the block such as its level in the b-tree and the number of rows in the block. The branch block header is described by type kdxbr and the leaf block header is described by type kdxle. Let N be the number of rows on an index block. Then each row on the block is identified by an integer in the range 0...N-1 called a slot number (kd_sno). The slot number is used as an index into the block's row index. The row index is an array of N byte offsets (kd_off) to the rows in the block. The byte offsets are relative to the start of the index block header. The key of the row at slot i is less than the key of the row at slot j iff i < j. In this situation we will be patching leaf blocks and therefore any further discussions relate only to index leaf blocks. The layout of an index leaf block is as follows: +-------------------------------+ | Block Header | +-------------------------------+ | Common Index block header | +-------------------------------+ | Leaf block header | +-------------------------------+ | Row Index | +-------------------------------+ | | | Free Space | | | | | | | +-------------------------------+ | Row Data | | | | | | +-------+ | |incseq | +-----------------------+-------+ Index block header common to all index blocks. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ub1 kdxcolev; /* block level; 0 is leaf block ub1 kdxcolok; /* itl of service tx holding block lock */ ub1 kdxcoopc; /* block lock op code */ ub1 kdxconco; /* # columns in key; */ kdxsdc (ub4) kdxcosdc; /* Split or Deleted count */ kd_sno (sb2) kdxconro; /* # rows in the row index */ kd_off (b2) kdxcofbo; /* free space beginning offset */ kd_off (b2) kdxcofeo; /* free space ending offset */ kd_off (b2) kdxcoavs; /* available space (committed)*/ Leaf block header ~~~~~~~~~~~~~~~~~ kd_off (b2) kdxlespl; /* space held by unlocked split holes; */ /* space includes row index entries. */ kd_sno (sb2) kdxlende; /* # holes (rows marked deleted) on block */ /* includes unlocked split holes */ kdba (ub4) kdxlenxt; /* next leaf block */ kdba (ub4) kdxleprv; /* previous leaf block */ ub1 kdxledsz; /* # bytes in KEYDATA */ ub1 kdxlecol; /* # non-key columns (for index-only tables) */ Row Index ~~~~~~~~~ The row index is an array of offsets (kd_off) to the rows in the block. The byte offsets are relative to the start of the index block header. Leaf Block Row Format. ~~~~~~~~~~~~~~~~~~~~~~ A leaf block row is structured as follows: where flag holds row flags, lock holds an itl lock entry, data is a fixed-size data area (KEYDATA), and key is a key stored in linear format. The number of bytes in the KEYDATA is part of the index definition and is also stored in the block header (kdxledsz). If the row has been marked deleted the flag byte will be 0x01. Steps to patching the Orphan Block scenario =========================================== 1. Identify the tablespace number whose index entry has been deleted. 2. Dump the index block using OS utility (e.g. DD on unix) and Oracle Block dump. e.g dd bs=2048 if=system1.f skip=123 count=1 | od -x >rawblockdump alter session set events 'immediate trace name blockdump level 67108987'; Note that this will enable you to identify the areas of the block that require patching and whether the platform in question uses byte swapping. If byte swapping is in use then extra caution is required to ensure that the correct fields are patched in a correct manner. 3. Identify that the index row entry for the missing tablespace still exists in the block and has been marked deleted. (use the DD output as the blockdump will not show it). If it does not then the patching required is out of the scope of this document. 4. Locate the orphan data block for this tablespace using the cluster key for confirmation. Dump this block using Oracle block dump. Note that the rowid in the deleted index row may contain the wrong rowid for the data block (we will need to patch this later) and should not be used for locating the orphan data block. SELECT rowid FROM ts$ WHERE ts#+0 = ; 5. In the orphan datablock, patch the PK and NK fields in the cluster key changing the DBA to point to datablock itself. 6. In the index block, patch the flag byte of the index row to undelete the row. i.e. change the flag byte from 0x01 to 0x00. 7. In the index block, patch the index row entry data field to change the DBA in it to point to the orphan data block. 8. In the index block, patch the row count (kdxconro) to increment it by 1. 9. In the index block, patch the free space beginning offset (kdxconfbo) to allow for the extra entry inserted into the row index (this value should increase by 2). 10. In the index block, patch the row index to insert the offset of the deleted row in the appropriate position. This will require calculating the correct offset, locating the correct position in the row index, inserting it, and then shuffling down by 2 bytes the subsequent row index offsets. Note: The index entries are accessed by the order they appear in the row index. Therefore the new entry must go in the correct position depending on the key value. For example, the entry for TS 50 must go between the entry for TS 49 and TS 51. 11. In the index block, patch the available space in the block (kdxcoavs) reducing it to allow for the row we are undeleting. The correct value for this can be simply obtained by starting the DB and analayzing the the C_TS# cluster which should tell you what the correct value for this field should be. e.g. output from analyze: **** actual free space = 1742 < kdxcoavs = 1757 ---- end index block validation ---- for block 67108987 = 0x400007b 12. Start the database and analyze the C_TS# cluster. There should be no errors!. 13. Examine TS$ (not using the I_TS# index) to check the missing tablespace is now back again. Disclaimer: the author takes no resposibility for failure to patch databases or any subsequent problems that may arise from following this procedure!