Shortly after the release of Word 2007, I managed to successfully recover a corrupted document by editing the XML outside Word, and allowed myself to be fooled into believing that the new file format was more resilient than the old, not least because of the textual nature of the XML components. Two years on, having seen several corrupted documents, and managing to repair some, but by no means all, of them, I now realise that I was wrong.
To understand exactly why I now feel the way I do, one must delve into the internals of Word documents. It’s not rocket science, but it is computer science, and, if bits and bytes scare you, it may not be for you. You have been warned!
Word 97‑2003 format (.doc) documents are held as binary data, in a proprietary format. Although documentation of that format has now been released under Microsoft’s Open Specification Promise, it is still very difficult to make sense of the files. If you are interested, you can find more information on the Microsoft web sitethe Microsoft web site [link to Microsoft’s File Format documentation at http://msdn.microsoft.com/en-us/library/cc313105.aspx].
The logical structure of a Word 97‑2003 format document is one of a series of elements arranged in a hierarchy, much like a mini file system. As an example, here is the structure of a simple Word 97‑2003 (.doc) format document:
The components are actually called Storages and Streams, but MyDocument.doc can be considered to be a folder, and the other names can be considered as files within that folder. There are two points worthy of note at this stage: firstly, the names here are genuine names of components of Word 97‑2003 format documents, with the exception that the asterisks represent non‑printing characters. Secondly, although this looks like, and, indeed, is, a flat structure, more complex documents do have a hierarchy, potentially many levels deep.
Very loosely, the Word Document element contains, largely, document content, and the Table element contains, largely, supporting data: formats and the like; the two Information elements contain document properties, visible and changeable outside Word, and the CompObj element contains OLE‑related information about the file (which I am not going to explain further here).
The physical structure of the complete file bears little relation to the logical structure; it is, again, of a proprietary design, a compound, or structured storage, file. Briefly, and loosely, the separate logical elements of the file are broken up into blocks; these blocks are treated as individual units, which units are then organised without regard for their logical arrangement, and catalogued, catalogue and organisation detail being held alongside the blocks themselves, to enable recombination into logical components when necessary.
Most of the time, blocks of data that logically belong together are physically held together, but there is no requirement for, or guarantee of, this. This often means that if you look inside a Word 97‑2003 format document in a hex editor, or even a text editor, you can see large parts of your document content. Just to give you a flavour, here are some views of three small parts of such a document, viewed in a hex editor:
Views of a Word 97-2003 format Document
The first view is of the beginning of the file. The first 512 bytes of a structured storage file are a control block that provides information needed to interpret the rest of the file. Unlike the rest of the file, this control block is of a fixed format, and all Word, and other Office, documents look pretty similar. The second view, towards the end of this example file, is part of the catalogue, or directory, of the file. The entries in the directory hold mixed control information, but essentially tell Word where to find the beginning of each logical element within the physical structure.
The third view is of somewhere in the middle of the file, where the document text begins. The actual text you see in this example is part of one of many drafts of this article that I simply cut and pasted into a new document in order to produce these images. Although what you see in the first two views may not make much sense to you, the third view clearly shows that the document content, held as plain text. It isn't necessarily quite as obvious as shown in this example (the document content is actually held in a format called UTF-8, which makes normal English text easily readable, but which is somewhat less kind to, say, Asian texts), but it is usually, at least partially, readable, and, either manually, or with a relatively easy to write program, anybody can extract any remains of this text if the file is otherwise corrupt. Word, itself, has a “recover text from any file” process that simply opens the file, as though in a text editor, allowing you to retrieve whatever may be there to retrieve.
Word 2007 format (.docx and .docm) documents are held as packages of xml documents, in an open format. Documentation of the standard for the format, registered as ISO standard 29500 can be downloaded from the ISO web sitethe ISO web site [link to the ISO downloads page at http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html]. Like that for the earlier format, this documentation is not easy to understand.
The logical structure of a Word 2007 format document is one of a series of elements arranged in a hierarchy, much like a mini file system. As an example, here is the structure of a simple Word 2007 (.docx) format:
Just as with the older format, MyDocument.doc can be considered to be a folder, and the other names can be considered files or folders as indicated by the icons. Again, these are genuine names as created by Word, although, with a couple of exceptions, the actual names are not important.
As briefly as before, the [Content_Types] file and the _rels folders, along with the subordinate files therein, contain information about the logical structure, and the two files in the docProps folder contain much the same as the two Information files in the old format. The document.xml element within the word folder holds the bulk of the document content and the other files within that same folder hold formatting details.
So, you might say, the internal structure of a document has changed a little, so what? There are, however, other changes that make a bigger difference. The first is that, although both logical formats are conceptually similar, they are wrapped up in completely different ways to make a single file. Instead of the proprietary physical structure used for Word 97‑2003 format documents, a fairly standard, and open, Zip Archive format is used for Word 2007 format documents. The second change is that instead of using obscure binary codes, everything in Word 2007 format documents, well almost everything, is held in XML format.
All data held as XML? In a standard Zip Package? It should be much easier to work with, then? Judge for yourself; here are some views of parts of a Word 2007 format document taken from a hex editor:
Views of a Word 2007 format Document
These views are roughly equivalent to those you saw earlier from the Word 97‑2003 format document. The first view is of the beginning of the file; there is no big control block here, it’s almost straight into the data, and this is the beginning of the compressed [Content_Types].xml file. The second view, towards the end of this example file, is part of the directory of the file. The entries in the directory hold mixed control information, but essentially tell Word where to find the beginning of each logical element within the physical structure.
The third view is of somewhere in the middle of the file, where the document text begins. Now, you may have noticed that the way I have described the two formats is very similar; I have not really deliberately done this but, at a conceptual level, the two formats are very similar. At the physical level, when you actually look at the bits and bytes in the file, you now come to what I consider one of the more significant differences: the text in the Word 2007 format document file is not readable, because it is compressed. Most people would not be able, either manually, or even with a fairly complex program, to extract any remains of this if the file were otherwise corrupt. Word’s “recover text from any file” option, which still exists, is no longer any help. So what can you do if you have a corrupt Word 2007 format document? The answer, unfortunately, is, probably: not much.
Corruption, almost by definition, is random. The most common cause is saving, or copying, to removable media. In times past, low storage capacities and physically fragile media made floppy discs, for example, particularly insecure; today, the causes of potential problems are rather different but the end results are the same. The result, whatever the cause, is usually a portion of the file, effectively overwritten with arbitrary data, no longer containing a part of the document it needs. Exactly what has been overwritten determines how difficult it is to recover anything of what remains.
In a Word 97‑2003 format file, if some of the control information is lost, the body of the document may well remain intact. Or perhaps some VBA code, or an embedded image, might be lost whilst, again, leaving the body of the document intact. Alternatively some of the body content might be overwritten leaving VBA code or embedded images intact. It is usually possible, though not entirely straightforward, to recover document components that remain intact and, as already mentioned, Word’s “recover text from any file” will aid in recovery of any text remaining in a corrupt component, even if some of the text is no longer available.
In a Word 2007 format file, much of the above is still true: VBA code, for example, or embedded images, might be lost while the body of the document is maintained, or vice‑versa. Unless you are very unlucky, it is a relatively simple process to recover the intact elements, easier than in the earlier binary format. But, despite the fact that the file is split into many more components than before, the document.xml component, which contains the body of the document, also contains information that was held separately in the old format, and it is the biggest, or, at least, the major component of many Word 2007 format files. Consequently it is the most likely part to be corrupted and, when this happens, it cannot usually even be unzipped, so no part of it is recoverable, leading to greater losses and less recoverability.
What can sometimes happen, and this was what happened to me two years ago, leading to my erroneous conclusion about resilience, is that a Word 2007 format file can be successfully unzipped but Word can then fail to be able to interpret the XML contents, even though there may be nothing obviously wrong with them. This differs significantly from the physical corruption of the file by forces outside Word, so far discussed, although it can be caused by occasional environmental conditions on your computer. This situation is, most likely, caused by errors within Word itself. The file format is wholly new and, given what Microsoft term the realities of the software development lifecycle, some errors undoubtedly slipped through the net at the time Word 2007 was released. Some of these will have already been fixed, and more will, I’m sure, be fixed in the next release of Word; meanwhile there remains a legacy of minor problems.
When I drafted this page in Word, I gave the above paragraph a temporary heading: “better heading needed”, as a prompt to myself to think of a better one. Sometime later I glanced at the page and read: “better healing needed”, a phrase that struck me as so apposite that I decided to use it, and add this brief note so as not to leave my readers completely baffled.