11g ora-1555 on activated standby after migration from 10g

Switched over or failed over Standby is compromised by ora-1555.
Version 11.2.0.4
Migrated from 10.2.0.4

Thousands of standby index pages (blocks) corrupted by “Page 424697 failed with check code 6056″ as reported by dbv and the corrupted page count increasing. No active standby is involved. The corruptions were first detected on activated standby (on the mirror copy of the standby actually) and confirmed by dbv scan on the standby.
Opening standby in read only mode (no active data guard involved) and querying the corrupted indexes throws ora-1555 too.

So far only index blocks are corrupted.

No problems on primary. dbv report was clean on the primary.

Solution:

set _ktb_debug_flags = 8 on both primary and standby to enable the fix. Metalink doc Id: 8895202.8

THE FOLLOWING DISCOVERIES TOOK TIME AND EFFORT TO MAKE, BETTER READ THEM NOW:

DISCOVERY N# 1: set compatible=11.2.0 to enable the fix. The metalink document does not mention that.

DISCOVERY N# 2: set _ktb_debug_flags = 8 BEFORE querying corrupted standby database index. After the first ora-1555 pops out the fix has no effect (the block cleaning finalizes the corruption?). The error will continue to pop out. The index will have to be rebuilt.

DISCOVERY N# 3: That allone does not stop the corruptions. Corrupted blocks count is still increasing. To stop it:

1. Rebuild affected indexes (might use online option. it worked for us) on the primary database.

2. Format datafile blocks not part of dba_extents after index rebuild – this makes dbv report readable.

Root cause:

Root cause is unknown. Our observation and version of events are:

primary was compatible=10.2.0 and waiting for maintenance window to increase to 11.2.0

standby was set to compatible=11.2.0 as advised by Oracle support after the corruptions were detected. Corruptions are on the standby and the count is increasing.

indexes were rebuilt, corruptions stopped from occurring. No problems for a few weeks.

during the maintenance window:

primary was bounced with compatible=11.2.0
a switchover was performed.
the corruptions reappeared on new standby and the count increasing.
Some corrupted indexes where the same as the first time, some were a new ones.

Remember: no problems on old standby(now primary) after index rebuild on previous primary (now the current standby)

But the index rebuilt on the previous primary was done with compatible=10.2.0

ok, lets try index rebuilt one more time: it helped. Corruptions are cleaned and stopped occurring.

Our take:

11g redo apply on standby corrupts index blocks if 11g software works on 10g migrated database blocks. Parameter compatible=10.2.0 is a prerequisite for the corruption to occur.
Rebuilding indexes on 11g primary does something what prevents the corruptions being introduced on standby and clears existing corruptions on standby.
However: if primary is compatible=10.2.0 then the rebuild still leaves some structural issues in the index. After the switchover the redo apply process starts corrupting the previous standby indexes. Note that no problems occur on primary itself. It takes redo apply to introduce the corruption.
Rebuilding indexes on the new primary stopped the errors.
Not all indexes are affected. It looks like only indexes which are rather heavily used are corrupted. How heavily: for example, some our indexes have INIT_TRANS bumped from default to 20 for a reason of high ITL waits.

Question:
Why this problem should not occur during restore and recovery from backup ? I see no reason why it should not.

Final word:
while compatible=10.2.0 is such a tempting solution to keep the gates open to downgrade in case the migration is failure it comes at a cost for data guard.

Afterword:
after switchover of 11.2.0 compatible standby and primary dbv detected failures again on the standby. The failure count is increasing on each dbv run (2000 and counting in our case). dbv had found not failures on both standby and primary before the switchover. Rebuild of indexes cleans the errors for good.

Out take:
Compatible 11.2.0 or 10.2 the block ITL corruptions (and ora-1555) are still introduced into the standby after switchover.  Setting _ktb_debug_flags=8 is mandatory before activating the standby: failure to do so results in permanent ora-1555 errors as was described above.

 

About these ads
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to 11g ora-1555 on activated standby after migration from 10g

  1. Jakub Wartak says:

    Same story here, but:
    - happened with compatible 11.2
    - the only corrupted idxes on standby were there those with really “heavy-duty usage” patterns (most busiest one actually)
    - not-scientific fact: IMHO “shutdown abort” on standby before switchover (actually srvctl stop database which ended in abort) had something to do with it
    - set _ktb_debug_flags=8 saves the world on prod/standby
    - additional fact: after restore on testing env with older PSU version and without _ktb_debug_flags=8 set, you still get the errors :o)

    • laimisnd says:

      sorry for late notice.
      Confirmed: it happens with compatible 11.2 too.
      Rebuild of indexes on primary clears the errors.
      I’d think restore of standby datafiles from primary should clear errors too.
      I have SR opened with oracle regarding the root cause and what should we do with dbv-utility reported block failures but no activity yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s