Switched over or failed over Standby is compromised by ora-1555.
Migrated from 10.2.0.4
Thousands of standby index pages (blocks) corrupted by “Page 424697 failed with check code 6056″ as reported by dbv and the corrupted page count increasing. No active standby is involved. The corruptions were first detected on activated standby (on the mirror copy of the standby actually) and confirmed by dbv scan on the standby.
Opening standby in read only mode (no active data guard involved) and querying the corrupted indexes throws ora-1555 too.
So far only index blocks are corrupted.
No problems on primary. dbv report was clean on the primary.
set _ktb_debug_flags = 8 on both primary and standby to enable the fix. Metalink doc Id: 8895202.8
THE FOLLOWING DISCOVERIES TOOK TIME AND EFFORT TO MAKE, BETTER READ THEM NOW:
DISCOVERY N# 1: set compatible=11.2.0 to enable the fix. The metalink document does not mention that.
DISCOVERY N# 2: set _ktb_debug_flags = 8 BEFORE querying corrupted standby database index. After the first ora-1555 pops out the fix has no effect (the block cleaning finalizes the corruption?). The error will continue to pop out. The index will have to be rebuilt.
DISCOVERY N# 3: That allone does not stop the corruptions. Corrupted blocks count is still increasing. To stop it:
1. Rebuild affected indexes (might use online option. it worked for us) on the primary database.
2. Format datafile blocks not part of dba_extents after index rebuild – this makes dbv report readable.
Root cause is unknown. Our observation and version of events are:
primary was compatible=10.2.0 and waiting for maintenance window to increase to 11.2.0
standby was set to compatible=11.2.0 as advised by Oracle support after the corruptions were detected. Corruptions are on the standby and the count is increasing.
indexes were rebuilt, corruptions stopped from occurring. No problems for a few weeks.
during the maintenance window:
primary was bounced with compatible=11.2.0
a switchover was performed.
the corruptions reappeared on new standby and the count increasing.
Some corrupted indexes where the same as the first time, some were a new ones.
Remember: no problems on old standby(now primary) after index rebuild on previous primary (now the current standby)
But the index rebuilt on the previous primary was done with compatible=10.2.0
ok, lets try index rebuilt one more time: it helped. Corruptions are cleaned and stopped occurring.
11g redo apply on standby corrupts index blocks if 11g software works on 10g migrated database blocks. Parameter compatible=10.2.0 is a prerequisite for the corruption to occur.
Rebuilding indexes on 11g primary does something what prevents the corruptions being introduced on standby and clears existing corruptions on standby.
However: if primary is compatible=10.2.0 then the rebuild still leaves some structural issues in the index. After the switchover the redo apply process starts corrupting the previous standby indexes. Note that no problems occur on primary itself. It takes redo apply to introduce the corruption.
Rebuilding indexes on the new primary stopped the errors.
Not all indexes are affected. It looks like only indexes which are rather heavily used are corrupted. How heavily: for example, some our indexes have INIT_TRANS bumped from default to 20 for a reason of high ITL waits.
Why this problem should not occur during restore and recovery from backup ? I see no reason why it should not.
while compatible=10.2.0 is such a tempting solution to keep the gates open to downgrade in case the migration is failure it comes at a cost for data guard.