By Gianni Tedesco <gianni at scaramanga dot co dot uk>
Below are some quick notes and discussion on writing reliable software, specifically dealing with problems concerning file corruption, possibly one of the most serious types of software errors.
Bugs involving file corruption are surprisingly common once you know how to spot them. Even free software (which is usually vastly more reliable than proprietary equivalents) can go horribly awry when a condition as benign as a full harddisk arises.
This guide was thrown together in a few hours, there are probably omissions. If you can think of any other little tips or tricks that I missesd out just email me and I'll incorperate your changes and will be glad to credit you fully.
This applys to write(2) and anything that calls it (eg: fwrite(3) and chums). A serious problem arises if a call to write(2) fails; further (successful) writes to that file descriptor will put the data in the wrong position in the output file which for nearly all applications means that the data will be corrupt. Even if the program only does one write, the problem still occurs because your program will signal the user that the file was written correctly but it will actually be zero length , again for most programs that means that the data is corrupt.
You might argue that write(2) is unlikely to fail, so why even bother checking? That is a disturbing misconseption here are some examples of why write(2) will fail:
write(2) can also fail or be interrupted half way through writing, to indicate this error the call returns a value smaller than the 'len' parameter you passed. This can happen (benignly) for example when writing over NFS using the soft/interruptible mode in the Linux kernel and your process recieves an unmasked signal. You must handle this error by incrementing your buffer by ret, decrementing len by ret, and calling write(2) again. Of course, this means you can't depend on the "atomicity" of write(2).
If you use substandard I/O (stdio) Remember that fwrite(3) buffers I/O, if fwrite(3) fails, it could mean that data you passed to it much earlier has not been written, so don't depend that unless you fflush(3) after each write of setbuf(3) to _IONBF.
Now to scare you, write(2) can sometimes completely fail to report underlying I/O errors, this is because most UNIX systems defer writing to a convenient time (to make the I/O scheduler more efficient). Many writes to the same file (even from different processes) can sometimes be coalesced in to one write to the underlying hardware (ie: when the same page in the page-cache gets dirtied many times before being flushed), making it very difficult for OS implementors to actually report errors anyway. I know for sure that Linux (even 2.5) does not report these errors, and probably never will, I am quite sure that the situation is the same for all modern kernels. There is nothing you can do... Are you making backups yet? ;)
The comparison of the return code against len is tricky, because len is size_t and ret is ssize_t. If ret is taken as an int and then compared to the len parameter as a size_t you will have a bug. The compiler will promote ret to the larger unsigned type and you will end up with a signedness bug (eg: -1 > 1234 instead of -1 < 1234). This will make the output of expressions like if (ret < len) evaluate incorrectly subsequently hiding the I/O error and corrupting the file.
In summation you MUST check the return code of write(2), pwrite(2), writev(2), fwrite(3), and anything even sounding similar. You must also handle all recoverable errors if you want your code to be robust. The sequence of events when making these calls should be something like the following example (for write(2)):
Below is roughly the correct sequence expressed in C:
ssize_t ret;
size_t len;
if ( len > SSIZE_MAX ) {
/* This is a very bad idea */
}
again:
ret = write(fd, buf, len);
if ( ret < 0 ) {
if ( errno == EINTR )
goto again;
if ( errno == EAGAIN ) {
/* Wait with poll(2) for POLLOUT */
goto again;
}
/* A real error - signal the user */
}else if ( (size_t)ret < len ) {
buf += (size_t)ret;
len -= (size_t)ret;
goto again;
}
/* success */
This is a VERY common mistake. Some filesystems cannot report errors straight away in the write(2) system call, so they defer sending the error back to userspace until the file descriptor is closed with the close(2) system call. This happens on NFS, coda, and probably all network filesystems, and any filesystem with disk quotas. The close(2) system call may return any of the errors that write(2) can and should be dealt with in the same way.
As a note there is a patch floating around for Linux which returns error codes on close when an underlying hardware failure occurs in a previous write, so in future, even writes to regular files on disk-based filesystems without quotas or any funny business may fail on close(2).
As I mentioned earlier, nearly all systems defer writing to a convenient time, this means that even after closing a file descriptor the data is not necissarily all on the disk, it can be sat in the kernel caches for 60 seconds or more before the kernel flushes it.
This presents a problem because after the user is informed that all went well, there could be a power cut or a catastrophic software failure that renders the machine unable to flush the disk buffers safely. In this case the users data is lost. Although this doesn't sound like a big problem, remember that if your program takes its input from another program (eg: over the network, or the program is to be used in a script) the input data may deleted as soon as your program completes, if the output is then lost, work will be lost and the user will be very upset.
The solution to this is to call the fsync(2) call on the file descriptor before closing the file. This ensures that all data and even meta-data (inode information) is stored on the physical medium before the call returns. Remember that if you guarantee fsyncs to your users, you should notify them with an error if the fsync(2) fails.
To increase fsync(2) performance, you can try fdatasync(2) which doesn't sync meta-data (eg: access time). This is good on filesystems which aren't mounted with 'noatime'. Check your system documentation beforehand and consider whether performance is really so important compared to guaranteeing meta-data consistency.
Remember that hard-disks often have a cache of their own, so if there is a powercut or some other grevious hardware failure the data could still be gone, regardless of whether fsync(2) succeeded. On something important like a mail server, consider recommending your users disable the cache (on Linux you can use hdparm to achieve this). This is less of a problem with the more expensive SCSI disks/controllers which can use a battery backup to sync the data and park the heads.
Lets say your program edits a list of users in a file, for example /etc/passwd. When you want to edit a users shell, you probably read in the whole file, modify the users details in ram, truncate the file, and write it all back out again with the new details.
If you are nodding, you failed the test. Your password database is gone forever. Heres why...
Imagine that the disk is full of web logs from the web server. In between you truncating the password file and starting to write again, the webserver quickly fills up the block with more log messages. You start writing the new user database out again, but write(2) returns an error code indicating that the disk is full. Being a good boy you checked the error code and quite correctly you informed the user of the nasty error and exited.
Too late, the data is already gone. When modifying files in this fashion you must never remove the input data until the output data is safely written to disk. I know it takes more disk space but there is no other way to do it reliably. The way to do it is to write the output file to a temporary file and when finished, rename(2) it over the input file. When using this strategy there are a few things to bear in mind:
By now you should realise the importance of backups. If your data is structured or written in such a way that it is hard to make usable (ie: restorable) backups then you are increasing the chances of data loss.
As an example, if your application is a database server you will want to make sure that backups can be made without shutting down the server. A stressed database admin will not be able to shut down the business critical databases each day and she will simply backup the live data which will probably be inconsistent (unrestorable).
The solution to this will always be dependant on what the program is so I cannot offer any other advice other than to think about it and plan the software around backups. Bear in mind also that users generally don't want to learn complex and convoluted backup procedures and will probably just copy your files with cp(1) or rsync(1).
You can't stop I/O errors from occuring and you cannot write bug free software. Nothing angers a user more than data loss and if you don't want to end up with irate geeks planting explosive devices in your car you can follow these five simple rules of thumb:
@(#) $Id: corruption.html 122 2003-11-19 15:13:06Z scara $
Copyright (c) Spanish Inquisition 1478-1834. All rights reversed.