Handling memory corruption
- Minaz Amin
- Oct 29, 2015
- 4 min read
I think you will all agree with me that database corruption is one of the nightmares in the life of a DBA. . A database corruption issue not only puts the data at risk, it also threatens to take a toll on businesses and revenues. To detect corruption we uses dbcc checkdb command.
DBCC CHECKDB interprets the page contents (for example, validating row structures, validating column values, checking linkages in indexes), and so should always be able to detect a corrupt page, even if the corruption happened while in memory.
This means that, in addition to using WITH CHECKSUM when performing backups, we need to run regular consistency checks, but not necessarily on the production server. A good alternative is to restore backups on another server and then run consistency checks on them, or to use a virtual restore tool and then run consistency checks on the virtually mounted database.
Page checksums help detect page corruption within the I/O subsystem; if when a page enters memory its checksum value does not match what it was when SQL Server wrote it to disk, then it knows that page corruption occurred within the I/O subsystem.
What happens, however, if there is in-memory corruption of data file pages? Consider the following scenario:
SQL Server reads a data file page into memory and then modifies it, because an update statement changes a column value of a table row that is stored on the data file page. Unfortunately, after the page is modified, but before the next checkpoint operation, a faulty memory chip in the physical server causes a corruption in the 8KB block of memory that is holding the modified data file page. When the checkpoint occurs, it calculates a page checksum over the data file page contents, including the portion corrupted by the faulty memory chip, writes the page checksum into the page header, and the page is written out to disk.
Later, a query causes SQL Server to read this data file page from disk and so it validates the page checksum and, assuming nothing went wrong at the I/O-subsystem level, the checksum values will match, and it will not detect any problem. The page is corrupt, but the page checksum algorithm cannot detect it, as the page contents have not changed since SQL Server first wrote the page to disk.
Similarly, a backup operation that validates page checksums will not detect in-memory corruption; the backup operation simply reads the data pages, calculates the checksum, compares the value to what it was when the page was last written to disk, finds the values match, and writes the in-memory corrupted page into the backup file.
Page checksums, by design, only detect corruption that occurred in the I/O subsystem, and will not catch in-memory page corruptions. As such, we cannot rely on a valid page checksum to show, categorically, that a data file page is free of corruption. The purpose of DBCC CHECKDB is to detect a corrupt page, even if the corruption happened while in memory.
In addition to DBCC CHECKDB, SQL Server actually has two other features that can aid with finding memory corruptions, but which only apply if SQL Server has not yet changed the data file page in memory.
Since SQL Server 2005, the buffer pool has conducted page checksum checks for data file pages that are already in memory. If a page resides in memory, it has a page checksum, and it has not changed since SQL Server read it from disk, then the buffer pool will revalidate the page checksum, occasionally, to make sure that no in-memory corruption has occurred. If a corruption is found, an 832 error is reported.
SQL Server 2012 running on Windows Server 2012 has a new feature that will detect and fix memory corruptions by re-reading data file pages from disk. Again, this only applies to a page that remains unchanged since SQL Server read it in from disk. This can help prevent SQL Server writing corrupted pages to disk. This new feature in SQL Server 2012 is called Memory Error Recovery. This feature allows SQL Server 2012 to repair clean pages in the buffer pool by reading the pages again from disk.
There was a presentation at TechEd 2012, called “The Path to Continuous Availability with Windows Server 2012” that talked about this being a new feature in Windows Server 2012, which implies that you will need to be running SQL Server 2012 on top of Windows Server 2012 to get this functionality.
In Windows Server 2012, the feature is called Application Assisted Memory Error Recovery, and it requires the application (such as SQL Server 2012) to register for notifications of bad memory page events using CreateMemoryResourceNotification(). It also requires SQL Server 2012 to use the API QueryWorkingSetEx() to scan the memory for bad pages.
If you have the hardware support, along with both Windows Server 2012 and SQL Server 2012, you will see a message like this in your SQL Server error log:
Machine supports memory error recovery. SQL memory protection is enabled to recover from memory corruption.
Comments