Subject: Interpreting checkpoints errors Creation Date: 26-JAN-2000 By C. BOURE Oracle Support Services INTERPRETING CHECKPOINT NOT COMPLETE AND CANNOT ALLOCATE NEW LOG MESSAGES IN ALERT.LOG I. Introduction This document explains the mechanism of checkpoints in Oracle7 and Oracle8, 8i and the performance problems tied to their bad init.ora parameter settings. II. What is a Checkpoint. A checkpoint is an Oracle event which synchronizes the content of the current redo log file with the database buffer cache. It offers Oracle the mean for ensuring the consistency of data modified by transactions. The mechanism of writing modified blocks on disk in Oracle is is not synchronized with the commit of the corresponding transactions. Oracle writes the dirty buffers to disk only on certain conditions : - a shadow process must scan more than + of the db_block_buffer parameter. - Every three seconds. - When a checkpoint is produced. The checkpoint must ensure that all the modified buffers in the cache are really written to the corresponding datafiles. It avoids the loss of data which may occur with a crash (instance or disk failure). A checkpoint is realized on four types of events : - At each switch of redo log files. - When the delay for LOG_CHECKPOINT_TIMEOUT is reached. - When the size in bytes corresponding to (LOG_CHECKPOINT_INTERVAL* size of IO OS blocks) is written on the current redo log file. - Directly by the ALTER SYSTEM SWITCH LOGFILE command. - Directly with the ALTER SYSTEM CHECKPOINT command. A checkpoint event may be broken up in two specific tasks: - The DBWR writes the modified buffers in the cache on Oracle datafiles. - The LGWR updates the header of the datafiles and the controlfile. When LGWR realizes this task, he cannot do its normal work, which is to write transactions on redo log files. A specific background process CKPT, may be used to free LGWR from this task. It is started by the CHECKPOINT_PROCESS parameter set at TRUE value in the init.ora file. In 7.3.3 and 7.3.4, it is automatically launched when certain conditions are met : - large value of DB_FILES (50 or greater). - large value of DB_BLOCK_BUFFER (10000 or higher). It is automatically started, on the Oracle8 version, III Tuning Dilemma. The mechanism of checkpoints presents a dilemma for the Oracle DBA, who must find a trade-off point between the global performance of the instance and speed of recovery. The duration of recovery is directly tied to the frequency of checkpoints. The more checkpoints you put, the shorter the recovery is. The recovery is dependant of the redo log writes size since the last checkpoints. When an instance/crash recovery on a database is required, only the transactions written since the last checkpoints must be applied. Depending on the context, you have to choose between security with frequent checkpoints and short recovery time, or global performance of the instance with less frequent checkpoints. The usually performance target performance is choosen, because it represents the main activity on an Oracle production database. IV Checkpoints Errors. As we have seen it previously, the checkpoint activity is realized on four distinct events, and particularly, when the current redo log file is full. When this event is up, Oracle starts the checkpoint, and increment the log sequence. The next log file usable will become the current log file. The following is an example of log switches from the alert log: Fri May 16 17:15:43 1997 Thread 1 advanced to log sequence 1272 Current log# 3 seq# 1272 mem# 0: /prod1/oradata/logs/redologs03.log Thread 1 advanced to log sequence 1273 Current log# 1 seq# 1273 mem# 0: /prod1/oradata/logs/redologs01.log Fri May 16 17:17:25 1997 Thread 1 advanced to log sequence 1274 Current log# 2 seq# 1274 mem# 0: /prod1/oradata/logs/redologs02.log Thread 1 advanced to log sequence 1275 Current log# 3 seq# 1275 mem# 0: /prod1/oradata/logs/redologs03.log Fri May 16 17:20:51 1997 Thread 1 advanced to log sequence 1276 Current log# 1 seq# 1276 mem# 0: /prod1/oradata/logs/redologs01.log Redo log files are used in a circular manner. Depending on the number of redo log groups, Oracle will use successively the different redo log files contain in each group. When it has used the last group, it returns to the first one which must be free. Sometimes, you can see in your alert.log file, the corresponding messages : Thread 1 advanced to log sequence 248 Current log# 2 seq# 248 mem# 0: /prod1/oradata/logs/redologs02.log Thread 1 cannot allocate new log, sequence 249 Checkpoint not complete This message indicates that Oracle want reuse a redo log file, but the corresponding checkpoint associated isn-t terminated. In this case, Oracle must wait until the checkpoint is completely realized. Oracle must be sure that all the modified data associated with the current log file are really written on the disk. If it is not the case, Oracle cannot ensure the consistency of the database. This situation may be encountered particularly when the transactional activity is important. In this case the next condition is verified : REDO_LOG WRITE TIME < CHECKPOINT WRITE TIME In this case, a new checkpoint begins before the end of each previous checkpoint. In this case, the previous checkpoint is linked to the next, and so on. If the transactional activity is constant, this mechanism will occur at each log file switch. The alert log will reflect this situation by displaying the -checkpoint not complete- error message. If we consider having N redo log files, when the N log files are full, the database hangs until the (1,N-1) will be ends. At this condition, the first redo log file will then be available and the checkpoint N may be started. This situation may be also checked by tracing two statistics in BSTAT/ESTAT report.txt file. The two statistics are : - Background checkpoint started. - Background checkpoint completed. This two statistics must be not different by more than once. If it is not the case, your database will be hanging on checkpoints. LGWR will be unable to continue writing the next transactions until the checkpoints is complete. Two reasons may explain this difference : - A frequency of checkpoints which is too high. - A DBWR which writes too slowly. The frequency of checkpoints is regulated by ; - LOG_CHECKPOINT_INTERVAL which is the number of redo log blocks written in the redo log current file before the next checkpoint. - LOG_CHECKPOINT_TIMEOUT which is the duration between two succesive checkpoints. - The size of each redo log file. To reduce the frequency of checkpoints, we must set : - LOG_CHECKPOINT_INTERVAL = Number of OS blocks identical to the redo log file size. It will start a checkpoint only on each switch log file. - LOG_CHECKPOINT_TIMEOUT = 0. It will disable the checkpointing based on time interval. After those two parameters are set correctly, only the switch of redo log will start a new checkpoint. A correct frequency is normally between 15 and 30 minutes. If your frequency is below this level, increase the size of your redo logs until you can obtain this value. When the -Cannot allocate new log- error message isn-t present constantly, we can also add additional redo files which will increase the global time before reusing the same redo file. when the error becomes constant, it will be necessary to increase also the file size of each redo log. Another means of solving this error will be for DBWR to write quickly the dirty buffers on disk. Two parameters are linked in this task : - _DB_BLOCK_WRITE_BATCH - DB_BLOCK_CHECKPOINT_BATCH. _DB_BLOCK_WRITE_BATCH is a hidden parameter which defines the maximum number of blocks which must be written in one batch operation. Depending on the Oracle version, it will be initialized differently by default. DB_BLOCK_CHECKPOINT_BATCH specifies the number of blocks which are dedicated inside the batch size for writing checkpoints. When you want accelerate the checkpoints, it will be necessary to increase this value. The upper boundary will be _DB_BLOCK_WRITE_BATCH. This value must allow to write the current checkpoint before the next checkpoint begins.