Subject: ORA-353 applying archived log during media recovery Type: PATCH Content Type: TEXT/PLAIN Creation Date: 13-FEB-1998 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ORA-353 applying archived log during media recovery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Initial Broken version: at least 7.1.6 Initial version in which fix will appear: patch on 7.3.4 Release Notes: When using non-async write I/O together with asynchronous read I/O on some platforms, it was possible for archived redo logs to be corrupted. This caused no errors at runtime, but would result in ORA-354, ORA-353, and ORA-334 when applying one during media recovery, or when dumping an archived redo log. This has been fixed. External Description: Under certain circumstances on some platforms, it is possible for archived redo logs to be corrupted during their creation. This causes no errors at runtime, but if you try to apply such an archived redo log during media recovery, you get the errors such as the following: ORA-00354: corrupt redo log block header ORA-00353: log corruption near block 769 change 219835 time 09/05/97 04:57:02 ORA-00334: archived log: '/data/oracle/archive/prod/arch0000015341.arc' The exact circumstances most likely differ by platform, since the cause of the corruption depends on particulars of synchronous and asynchronous I/O, and how the OSD I/O services are used by archiving. Even though most platforms do not support independent configuration of asynchronous reads and writes, there are those for which asynchronous I/O is supported only for raw devices, and not for conventional filesystems. The bug may be present on these platforms. One circumstance that is known is under Solaris 2.x when ASYNC_WRITE = FALSE but ASYNC_READ = TRUE, and LOG_ARCHIVE_BUFFERS > 1 (the defaults are LOG_ARCHIVE_BUFFERS = 4, and both ASYNC_WRITE and ASYNC_READ = TRUE). The bug will _not_ occur under Solaris 2.x when async write I/O is turned on, because aiowrite() is implemented under Solaris 2.x through pwrite(), which is a positioned I/O. Not all platforms allow for independent configuration of async reads and writes, and even those that do may not hit this bug because of the OSD implementation. Impact: Media recovery fails, with attendant data loss. The only option is to use incomplete recovery (cancel-based, SCN-based, or time-based partial recovery). The most insidious thing about this bug is that customers have most likely had corrupted archived redo logs since they began using Oracle with the required configuration; and yet they don't discover that their archived logs are useless until they need them most, namely when they need to recover their database due to some other failure. Workaround: None, once the corruption has occurred. You can only do incomplete recovery, up to the point of the corruption. Under extreme circumstances, it may be possible to patch the recovery logs and progress slightly further (Russell Green has done this), but this is extremely hazardous and definitely not recommended. To prevent the corruption under Solaris 2.x, you can either set LOG_ARCHIVE_BUFFERS to one; or ensure either that ASYNC_WRITE is TRUE, or that ASYNC_READ is set FALSE whenever ASYNC_WRITE is set FALSE. One can proactively detect the corruption by dumping each archived redo log as it gets created with ALTER SYSTEM DUMP LOGFILE. This command uses the same code to detect this type of corruption as does media recovery. One can also compare the archived redo log with its corresponding online redo log, with two caveats. (1) You must do this immediately after the archived log gets created and completely copied, before continuing redo generation causes a log switch that overwrites the online redo log. (2) There are typically a few bytes in the first two blocks of the archived redo that may be different from the corresponding online redo log (the reason for this is unknown at this time). If you determine that you have a corrupted archived log, and the system has not yet overwritten the corresponding online log through a log switch, then you may be able to get a good archived log by archiving again manually with ALTER SYSTEM ARCHIVE LOG. Then again, maybe not. You may get corruption in a different spot, or none at all. If you already have corrupted archive logs, then the best course of action is to obtain a patch with this fix, and take a new full backup of the database in order to restore recoverability going forward. How to determine rediscovery: The symptoms are quite apparent. Look for ORA-354 during media recovery, or during ALTER SYSTEM DUMP LOGFILE. Note that in order to confirm that this bug is the cause of the ORA-354, it is necessary to thoroughly analyze a binary copy of the archived redo log. Other circumstances can also cause corruption to block headers (such as user error, or hardware or operating system defects). The first ub4 of each block is the log sequence number, and the second ub4 is the block number. This bug causes whole chunks of blocks to get written to the wrong location in the file, so that the logical block number recorded in the block header does not match the physical block number (as determined by the relative block location within the file, starting with block zero). Note that blocks zero and one are special, and do not contain redo, and they do not have normal redo block headers (block zero is the OSD file header, and block 1 is the generic Oracle file header). Note also that the block size for redo differs by platform. Under Solaris 2.x, it is 512 bytes. Under Digital Unix, it is 1024 bytes. Some platforms have 2048 bytes per block of redo. In the circumstance I have observed, when a chunk of blocks gets written to the wrong location, the physical location of the file (where that chunk was supposed to get written) will never get allocated to the file, and hence will be dumped as all zeros. This behavior is likely platform dependent. Under Solaris, one finds that the number of blocks allocated to the file (as determined by "du -s" or "ls -s" or the "stat()" system call) does not match the logical file size (as returned by "ls -l" or the "stat()" system call). Note that the size of the file in blocks returned by these commands includes indirect blocks, and hence even healthy files don't quite match up in size, but typically the size in blocks of a corrupted file comes out to something rather less than the corresponding logical size in bytes, because the bug causes the archiving to fail to write certain chunks of blocks, and the operating system does not allocate those chunks to the file. Internal Description: The main driver routine for archiving is kcrzar() in file kcrz.c. This routine opens the online redo logs for input and the archive log for output, then copies the input to the output. The routine also has a collection of buffers that it uses to maximize throughput by using asynchronous I/O requests, so that it can have multiple read and write requests outstanding at any one time, with one read or one write per buffer. Under Solaris 2.x these asynchronous requests are implemented with aioread() and aiowrite(). During a check for whether an asynch read is done, we have this stack trace: kcrzar => sksadn => sfdone => ssfdone => sfapol => aiowait During a synchronous write, we have this stack trace: kcrzar => sksawr => sfqio (= sfmqio) => ssfsync => sfwfb => write In routine kcrzar(), there is an array of structures of type kcrzio which is used to describe the I/O's for each of the buffers that kcrzar() uses to copy the online redo log to the archived copy. The number of these buffers is LOG_ARCHIVE_BUFFERS (default = 4 under Solaris 2.x), and their size is LOG_ARCHIVE_BUFFER_SIZE (default = 32KB under Solaris 2.x). The kcrzio structure contains a member called kcrziwiv, which is supposed to be the IOV for the output file. The IOV structure contains a member called sfiovoffset which is supposed to maintain the current position of the file, and which is the position at which a subsequent normal read() or write() will begin within the file. The bug is that each buffer has its own copy of the IOV, and so the sfiovoffset member is not being maintained properly across all kcrzio buffers. The initial IOV for the output file (the archived log) is created on the stack in routine kcrzar(), filled in by routine sksacre(), and then duplicated for each buffer with sksadup(). Routine kzrzar() passes the IOV for each buffer down to the lower layers to do the I/O. The sfwfb() routine, which is used to do synchronous writes, contains an optimization that avoids doing an llseek() (Solaris) when the file location requested is the same as the supposed current location in the file, as recorded by the sfiovoffset for the IOV that gets passed in. This optimization is missing on some platforms (e.g., Sequent), and the bug will not happen on these platforms (because we always seek to the correct offset in the file prior to writing). Under the circumstance that LOG_ARCHIVE_BUFFERS is more than one, and ASYNC_READ is true, and ASYNC_WRITE is false, it is possible for the reads to finish and their buffers to be written in an almost arbitrary order. This means that we may have to seek backwards and forwards in the output file in order to write these completed read requests to the write location in the output file. Thus it is possible for the current offset for one of the given output IOV's to be incorrect. It _was_ correct after the last write() for that buffer, but after that an earlier (or later) aioread() for another buffer may complete and we have to llseek() backwards (or forwards) in the file so that we can write() it to its correct position. If by coincidence the previous write() for this buffer took us to the position where the current write() is supposed to go, but an intervening aioread() completion caused us to seek in the file to a different spot, then sfwfb() would think that the sfiovoffset for our buffer was correct for the next write(), but the actual offset within the file (as known to the operating system) would be different. The optimization in the sfwfb() OSD routine causes us to skip the llseek() that would have taken us to the correct position in the file, and we write the buffer to the wrong position in the file, corrupting the archived redo log and losing redo for later media recovery. If LOG_ARCHIVE_BUFFERS is only one, then each read must complete before we allow the write, and the buffers go out to the output file in the correct order (sequential order). We never do an llseek() except at the very beginning. If ASYNC_WRITE = TRUE, then under Solaris 2.x there is no problem, because aiowrite() is implemented through lightweight processes and a pwrite() call, which is a positioned write that unconditionally takes us to the correct offset within the output file. This is probably also true on any platform that provides async I/O, since the concept of the "current" offset in the file doesn't make any sense in the context of asynchronous I/O requests. If ASYNC_WRITE = FALSE but so is ASYNC_READ = FALSE, then again there is no problem because each read must finish before we allow the write, and again we write the buffers in a sequential order that can cause no corruption. In this case, we effectively have LOG_ARCHIVE_BUFFERS = 1, because the logic in kcrzar() causes only a single buffer to get used. The bug does not affect the reading action of kcrzar(), even though the input files (online logs) can have multiple copies of the IOV, because the reading is always done in a forward manner, so we always have to seek forward in the input file(s) before issuing the read request. By way of contrast, the writing can take place in almost any order when reading is asynchronous, because a given read request may not complete until after a later one has completed and already been written to the file. Resolution: Before arriving at the final bug fix, we evaluated four completely different fixes to the bug. The ultimate root cause is in the code that handles the sfiov structure, because most of the time it is treated like a structure that describes an I/O request, but the sfiovoffset member makes it look like a structure that is used to describe an open file (i.e., a channel). However, fixing this the Right Way in any Unix OSD would be quite complicated, possibly unsafe, and in any case inappropriate for a bug fix which is never going to appear as anything but a patch on the terminal release of 7.3.4, or a backport to an earlier version. I will now describe the four bug fixes. (1) The Right Thing. This solution requires creating a new SF structure to describe a channel, and modifying sf.h and the routines that handle IOV's and I/O requests to use this structure in the correct manner. However, this would require rather significant changes to sf.h, sf.c, and ssf.c, and possibly routines that call them (which would be a lot). By itself, the changes to sf.h alone would require recompilation of 46 modules (based on cscope results of who includes sf.h). We rejected this solution based on its complexity, possible lack of safety due to the complexity, and the difficulty of repeating the changes across many Unix ports. It is inappropriate for a patch-based bug fix. (2) The Easy Way Out. This solution removes the optimization that avoids the call to llseek() before read() and write() calls in sfrfb() and sfwfb() (module ssf.c). In other words, do a seek before every I/O request. Note that at least one platform already does things this way (Sequent). This fix affects the synchronous case only (since async requests already take the byte offset into the file as one of their arguments), but it could have a performance impact because of the extra system call for every read() and write(). In addition to removing the optimization, we would also remove all references to the sfiovoffset member of the sfiov structure. We rejected this solution based on its performance cost, although it has a certain appeal because of its simplicity and safety. (3) The Intermediate Layer Hack. This solution makes the fix in sksa.h and sksa.c, which is an intermediate OSD layer between the archiver driver in kcrzar() and the low-level I/O layer in sf.c. An existing comment in sksa.h, where we implement sksadup() as a macro that copies one IOV to another, indicates that someone recognized that this problem might exist, and that there would be a need to "link the new IOV to the master". This fix does exactly that. The IOV that gets filled in by sksacre() (to create and open the new archived redo log for output) becomes the master IOV, and we link the copies made by sksadup() back to this master. We consider the sfiovoffset in the master as the correct and up-to-date version. When we need the offset, we copy it from the master to the current IOV prior to passing that IOV to sfqio() (this happens in sksawr() and sksard()). When an I/O request completes in sksadn(), we copy the offset that the SF layer filled in from the current IOV back to the master copy. For symmetry, we do the same thing for reading and writing, creating a master IOV in sksaopn() (where we open the online redo logs for input) and pointing the master back to itself. This is the fix that we ultimately chose as the one that fixes the problem closest to the source of the problem (in the SF layer), but which is sufficiently simple and avoids any additional performance penalty. (4) The Generic Code Hack. This fix is admittedly a hack, because it includes a platform-specific section in the generic code. The reason for this has to do with the confusing semantics of the sfiov structure. On some platforms (such as VMS), there must be a copy of the IOV for each I/O request, because the IOV contains status information specific to each request. On other platforms (such as Unix), there can be only one copy of the IOV structure, because the sfiovoffset member must be kept up to date accurately across all buffers, and so they must share a single copy. On yet other platforms (such as MVS), the IOV gets copied, and then it must later be "released" (routine sksarls) in order to copy certain status information (e.g., I/O statistics) back to the "master" (i.e., the original of the IOV from which the copies were made). Hence the bug fix leaves the original code in place, protected by an #ifdef for the macro symbol SKSA_COPY_ARCH_IOV. At the beginning of the module, we #define this symbol if we are compiling under VMS. Each of these sections has a #else section with the new, bug fix code, which is as follows. We change the definition of the kcrzio structure so that the members called kcrziriv and kcrziwiv are now pointers to an sksaiov, instead of an sksaiov, which is a buffer that is large enough to hold an sfiov. All kcrzio buffers now share the same IOV for the output file (the one created on the stack and filled in by sksacre() when we create the target archived redo log). They also share one IOV per member of the online redo log group. Ultimately, each synchronous I/O request will update the single sfiovoffset to the correct position within the corresponding IOV after a read() or write() has completed. This fix is certainly simpler, safer, and more expedient than the Right Thing, especially in view of the potential seriousness of the bug and the short projected life of 7.3.4. In addition, there is precedent for this type of fix. The kcbl.c module, which implements parallel direct operations (?) such as SQL*Loader and CTAS, faces the same issues with the OSD as does kcrz.c, in that it has multiple buffers from which it issues write requests to the output. It calls routine kcflbi() to do the writes, which contains a VMS-specific section. However, we rejected it based on its requirement of putting port-specific code into the generic code. Testing Done: Bug-specific test, short regress, tkamsuit. Olint has been run. There were several errors in system headers: stdlib.h string.h limits.h sys/time.h time.h stdio.h errno.h unistd.h signal.h grp.h pwd.h There were also many, many existing warnings and errors, mostly relating to argument type conflicts and possible accuracy loss. This bug fix introduced no new Olint errors. Files changed: sksa.h and sksa.c in /vobs/rdbms/src/server/osds/solaris Procedures changed (pertinent): sksard(), sksawr(), sksadn(), sksacre(), sksaopn(). Integration: The fix has been made off the 7.3.4.0.1 release code line. There is no need to merge forward to Oracle 8 because at this point I do not believe that this bug exists there. The kcrzar() routine has been completely rewritten and uses an entirely different OSD interface. NDE Transaction ID: transaction: mjaeger_bug-554516 branch: /main/st_rdbms_7.3_CDM/st_rdbms_mjaeger_bug-554516 Files in /vobs/rdbms/src/server/osds/solaris/: sksa.h@@/main/st_rdbms_7.3_CDM/st_rdbms_mjaeger_bug-554516/1 sksa.c@@/main/st_rdbms_7.3_CDM/st_rdbms_mjaeger_bug-554516/1 Reviewed by: Sumanta Chatterjee (sksa file owner) Herbert van den Bergh (DDR management) Greg Pongracz (recovery development, file owner backup) Andy Hering (DDR recovery) Can this be backported to older releases? (if not, justification) Yes, backporting should be fairly straightforward. --- Defect Analysis ---- History: It appears that this bug has always been there in Oracle 7, because the duplication of the IOV in kcrzar() with sksadup() exists in the 7.1.6 integration code. It is a wonder to me that this has not been a constant problem, although I suppose the combination of async reads and synchronous writes is somewhat unusual. Also, customers only rarely need their archived logs, and most likely rarely if ever verify their validity before they are needed. In all likelihood, any customer with the required configuration has had many corrupted archived redo logs, and never knew it. There are quite a number of bug reports with ORA-354 errors. Bugs 508224, 561281, and 436707 are definitely duplicates. Bugs 393341, 295369, and 244662 don't have enough evidence to tell what the cause was. Bugs 413640 and 398885 were due to user error (moving the archived logs to tape before ARCH was done with its copying). Bugs 452182 and 461458 were corruptions in online redo logs. Sub-component: RECOVERY Suggested Contact: Mark Jaeger (mjaeger.us), Andy Hering (ahering.us), Bill Bridge (wbridge.us), Herbert van den Bergh (hbergh.us), Sumanta Chatterjee (schatter.us). Defect Type: Interface (to OSD). Coding. Developer time to fix defect: 13 days: 4 days to diagnose and determine the problem and find the fix. 5 or more days analyzing previous bug reports to determine the nature of the corruption, which led us to this point. 4 days to decide on the correct fix, perform the fix, test it, and do the necessary regression testing prior to checkin. Additional action taken: None. Lessons learned for bug prevention: Don't make such complicated OSD interfaces. The reason for this bug was the ambiguity of the semantics of the IOV structure. On some platforms (e.g., VMS) it contains per-I/O-request information, and so there must be one IOV per outstanding I/O request; on others it contains per-file information (e.g., Solaris), and so the same data must be shared among multiple I/O requests (namely the "current" offset into the file). This is obviously very problematic for any kind of parallel operation, where there can be multiple I/O requests outstanding for a single file. Fortunately, this interface has been completely rewritten for Oracle 8. Unfortunately, the new interface is also quite complicated. This was a very subtle bug. I don't know how well known are the semantics of the IOV structure and its use within the OSD, which would have allowed one to predict this scenario. In general, it is always hazardous to have information that is maintained in more than one place, because the danger always exists that they will get out of sync. In this case, we tried to remember the position in the file for the next write(), in order to avoid having to do an llseek() each time. This underscores the importance of having only a single copy of the structure that describes the file and its current position. Lessons learned for bug detection: Don't forget to use truss when you suspect a problem with the operating system, in order to find out what we are really asking it to do. Again, a reproducible test case was indispensable. In this case, there was a customer running on Solaris who was able to create corrupted logs almost at will (about half the time).