Document Internals

Internals

Inside Your Document

Introduction

Readers with long memories may remember that a purported strength of the COBOL computer language was that it was ‘English-like’ and, thus, in some way, understandable by normal mortals: this claim was, of course, nonsense. Jump forward 50 years and you find XML, with a similar purported strength, that of being ‘human-readable’: this claim, too, is patent nonsense. Nonetheless, XML is the flavour of the 21st century, and software developers seem to want to use it at every opportunity, whether or not it is suitable: it is the hammer that makes everything else look like a nail.

So it is with Microsoft Word Documents (and, more generally, with Microsoft Office). The .doc format, in which Word documents were stored prior to Word 2007, was a proprietary binary format, the details of which were not public; trying to make sense of document files outside Word was, shall we say, challenging. Although details of the format have now been published by Microsoft under their Open Specification PromiseOpen Specification Promise [link to Microsoft’s site at http://www.microsoft.com/openspecifications], it is still pretty hard to understand.

The .docx format, in which Word documents are now stored is called Office Open XML, and it has been formally adopted as an industry standard, having been agreed firstly by ECMAECMA [link to The ECMA site at http://www.ecma-international.org], whose full name is rather a mouthful, as ECMA-376ECMA-376 [link to The ECMA-376 standard at http://www.ecma-international.org/publications/standards/Ecma-376.htm] and, latterly, by ISOISO [link to The ISO site at http://www.iso.org/iso], the International Organization for Standardization, as ISO 29500. If you will allow me, I would like to be your human guide through the human-readable jungle.

Lest my rhetoric gives the wrong impression let me say, finally, in this introduction, that, although I often cast a cynical eye over things, I actually like XML. For years I have watched large organisations try to beat their hierarchical data into submission and force it into relational structures; at last they have an approved hierarchical storage mechanism. Whether documents could be considered hierarchical is, perhaps, moot.

A First Look

In this article I am going to talk a walk through a genuine, but arbitrary, document, and explain some of what is there. I hope that you will learn something along the way; I’m fairly sure I will. Further articles will wander down whatever byways take my fancy. The first walk begins here ...

If a file contains text, the obvious thing to do is to open it in a text editor: Notepad, for example. If you try this you will find, as I’m sure you know, that your document files are not text at all, just what appears to be a meaningless collection of letters and symbols and what look like blank spaces.

A Word Document viewed in a Text Editor
A Word Document viewed in a Text Editor

Your document files are, as I suspect you know, compressed; this allows Microsoft to claim that Word 2007‑format files are smaller than Word 2003‑format files. There are two reasons for the compression: the first is that it is a convenient, indeed the only really practical, way to combine all the components of a document into a single file, and the second is that, without it, they would be huge! Before you can see inside your Word document, it must be uncompressed, and, if you need details, the process of uncompressing is described separately herehere [link to Uncompress.php]. If you uncompress a simple document you will end up with a structure that looks like this:


Icon for a folder MyDocument.docx
Icon for a folder _rels
Icon for a file rels
Icon for a folder docProps
Icon for a file app.xml
Icon for a file core.xml
Icon for a folder word
Icon for a folder _rels
Icon for a file document.xml.rels
Icon for a folder theme
Icon for a file theme1.xml
Icon for a file document.xml
Icon for a file fontTable.xml
Icon for a file settings.xml
Icon for a file styles.xml
Icon for a file stylesWithEffects.xml
Icon for a file webSettings.xml
Icon for a file [Content_Types].xml

The matter of understanding this collection of files will be addressed later. For now, you can assume (and, please, remember, this is just an assumption, and not necessarily true) that the main content in the body of your document will be in the word\document.xml file. If you open the word\document.xml file in Notepad, you will find that it is now text, but not, perhaps the text you might have been expecting.

Some of the 'text' in a Word Document
Some of the 'text' in a Word Document

This is, in fact, XML. To even begin to understand it you have to know what it is. On the assumption that many readers will know the basics of XML, and not want to be bothered by an explanation, I have taken the said explanation and put it on a separate page: if you are interested, it is herehere [link to XMLBasics.php].

Knowing that you have XML is but a small victory in the first of many battles, but a victory nonetheless. As you can see from the picture above, the XML is rather a mess and not, in any real sense, ‘human readable’. It has none of the features you normally take for granted in something that is meant to be readable: it has no redundant white space, and it does not have many line breaks; in short it is not formatted for reading. If you want to read it, you must, first, make it readable; you could edit it in, Notepad or some other basic text editor, it being text after all, but you would probably give up in despair long before you got to anything that resembled the content of your document as you could understand it. To save your sanity, here is what the beginning of the file would look like if you were to format it - I have removed some text to help to make it readable.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<w:document xmlns:wpc  = "http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
            xmlns:mc   = "http://schemas.openxmlformats.org/markup-compatibility/2006" 
            xmlns:o    = "urn:schemas-microsoft-com:office:office" 
            xmlns:r    = "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
            xmlns:m    = "http://schemas.openxmlformats.org/officeDocument/2006/math" 
            xmlns:v    = "urn:schemas-microsoft-com:vml" 
            xmlns:wp14 = "http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
            xmlns:wp   = "http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
            xmlns:w10  = "urn:schemas-microsoft-com:office:word" 
            xmlns:w    = "http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
            xmlns:w14  = "http://schemas.microsoft.com/office/word/2010/wordml" 
            xmlns:wpg  = "http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
            xmlns:wpi  = "http://schemas.microsoft.com/office/word/2010/wordprocessingInk" 
            xmlns:wne  = "http://schemas.microsoft.com/office/word/2006/wordml" 
            xmlns:wps  = "http://schemas.microsoft.com/office/word/2010/wordprocessingShape" 
            mc:Ignorable = "w14 wp14">

  <w:body>

    <w:p w:rsidR="000407E4" w:rsidRDefault="000407E4" w:rsidP="000407E4">

      <w:pPr>
        <w:pStyle w:val="Heading2"/>
      </w:pPr>

      <w:bookmarkStart w:id="0" w:name="_GoBack"/>
      <w:bookmarkEnd   w:id="0"/>

      <w:r>
        <w:t>Introduction</w:t>
      </w:r>

    </w:p>

    <w:p w:rsidR="000407E4" w:rsidRDefault="000407E4" w:rsidP="007236E6">

      <w:r>
        <w:t xml:space="preserve">Readers with long ... [text removed] ... find XML, with a </w:t>
      </w:r>

      <w:r w:rsidR="007236E6">
        <w:t>similar purported  ... [text removed] ... flavour of the 21</w:t>
      </w:r>

      <w:r w:rsidR="007236E6" w:rsidRPr="000407E4">

        <w:rPr>
          <w:vertAlign w:val="superscript"/>
        </w:rPr>

        <w:t>st</w:t>

      </w:r>

      <w:r w:rsidR="007236E6">
        <w:t xml:space="preserve"> century ... [text removed] ... like a nail.</w:t>
      </w:r>

    </w:p>

You can probably read this, and you might even be able to take a stab at understanding it, but it would only be a stab – there is nothing there that is any kind of aid to understanding. Perhaps the documentation can help; there is that standards document referenced earlier: 6000 or so, somewhat repetitive, pages of, largely, technical mumbo-jumbo, and Microsoft have even published some details of the extent to which Office conforms to it (see conformance notes on MDSNconformance notes on MDSN [link to http://msdn.microsoft.com/en-us/library/ee908652(office.12).aspx] if you are interested). Not much joy so far!

You will have read, in my introduction to XMLmy introduction to XML [link to XMLBasics.php], about namespaces, and you will, of course, be able to see that all the tags shown belong to the “w” namespace, and that “w” is defined as an alias for the “http://schemas.openxmlformats.org/wordprocessingml/2006/main” namespace, but you probably don’t find any of that information useful; I certainly don’t! The name of the namespace is a technical gimmick, and no more: you might as well forget it! You're still on your own.

The file begins with what is called a Processing Instruction, the line that begins “<?xml”. This just tells the XML consumer (Word, in this case) something about the file, and is not anything to do with the document content. This is followed by an opening “<document>” tag; it lists aliases for all the namespaces used in the document, and one slightly interesting attribute of the file: the “mc:Ignorable” line: this tells Word it can ignore any parts of the XML that are tagged as being in the namespaces aliased by either “w14” or “wp14”. I will opine more on this at a later stage. Document is actually a rather misleading name for this tag, as, indeed, document.xml is for the file. For the moment, however, you can continue on the basis that this tag, and everything contained within it, is, in effect, your document.

The document tag is followed by an opening “<body>” tag, which, you might guess, represents the body of your document. The body of your document is made up of paragraphs. If you’ve ever coded any web pages, you may recognise the “<p>” tag that represents a paragraph, but, what you may wonder, do all the attributes mean? “rsidR”? “000407E4”? Deep within the bowels of the new, improved, Options dialogs you may be aware of a setting: “Store random numbers to improve Combine accuracy” (this can be found by taking the File tab from the Ribbon and going Backstage, selecting “Options”, then Selecting “Trust Center” from the list on the left of the “Word Options” dialog, clicking on the “Trust Center Settings” button on the right to invoke the “Trust Center” dialog, and looking on the “Privacy Options” tab, under “Document-specific settings”). This setting has been around for many a long year; it has no visible effect on the document surface, but it does help Word do what it does behind the scenes. The ‘random numbers’ that are stored are represented by these rather opaque values, and, for the moment, at least, they can be ignored.

Paragraphs have “properties”, represented in this case, and generally, by a “Pr” suffix. The “<pPr>” tag represents paragraph properties. This paragraph has a single property, its Style represented by the “<pStyle>” tag. The style of this paragraph is “Heading2”, a standard Word built-in Style: at last something that describes part of your document – or my document, anyway – that you can understand. It is worth pointing out that this tag is self-contained, ending in “/>”, whereas the earlier tags have all been simple opening tags of elements of an ever deeper hierarchical structure.

The meaning of the next two tags: “<bookmarkStart>” and “<bookmarkEnd>”, is, I hope, fairly obvious. The two are separate stand-alone tags for good reasons that are not relevant here; the “id” – “0” – is what ties the two together. There must exist both start and end bookmark with matching ids: if not the document is considered what is called non-conformant; this is not as bad as not being well formed, but still, strictly, a problem; news of this fact does not seem to have reached Word. The name of this bookmark is, as you can see, “_GoBack”. This is a special hidden bookmark that Word uses to keep track of the previous edit position so that the user can return to it. In this instance it happens to be at the beginning of the document; it isn't part of your document content but, once again, it is something, behind the scenes, that enables Word to do what it does.

Following the bookmark is an “<r>” tag. You can now begin to sense some of your document: “<r>” means “run”, and a Run is an arbitrary part of your document. Paragraphs are made up of Runs of text. This run has no special attributes and is immediately followed by a “<t>” tag. “<t>” means “text”, and text is, well, text: part of your document content. The heading paragraph that you are currently seeing contains the single word, “Introduction”, and this is followed, in short order, by a “</t>” end tag, ending the text, a “</r>” end tag, ending the run, and a “</p>” end tag, ending the paragraph. It's taken some time to get here but now, at last, you see the full markup for a single, short, simple, paragraph.

With the knowledge gained so far you can more easily scan the next paragraph, the one subordinate to the heading, containing some meaningful text. I have removed a lot of the actual text to make the whole thing easier to read, but you may recognise what you see as being a Word document containing the beginning of this web page.

The first thing to note is that this second paragraph is split into four runs. The first run is a simple one, like the one in the previous paragraph except for the fact that the text element (an element being that bounded by matching start and end tags) has an attribute of “xml:space="preserve"”. Normally, in XML, as in HTML if you are familiar with that, any leading and trailing space, or, more generally, white space is ignored, and multiple embedded spaces are collapsed to single spaces. There are good reasons for this and it is generally what one would want, but when sentences are split into arbitrary runs, however, as here, leading or trailing space is likely to be significant. The designers of XML knew this and provided a general mechanism for specifying it, which is what you see here. space="preserve" means, in case you hadn’t guessed, ‘leave all space in this string exactly as it is’, and the space attribute is defined in the “xml” namespace, that comes with XML, and needs no special declarations to be able to be used. In this particular bit of text, it ensures that the single space following “a” at the end is maintained, to stop the word running straight into the word “similar” at the start of the next bit of text.

The next run is almost entirely straightforward. In contrast to the previous one, it does not specify xml:space="preserve", as there is no space to preserve. The “21” at the end should run straight into the “st” that follows. If you, now, look at the run containing that “st” you can see it contains an “<rPr>” tag introducing an extra element. You will remember, from the first paragraph, that a “Pr” suffix generally introduces properties of the parent element, and so it is here. The property of this run is the slightly specialist “superscript”;specified as a ‘vertical alignment’, a “<vertAlign>”. The “st” ordinal indicator is superscripted immediately after the number, 21.

Finally, the rest of the paragraph is specified as another run. This run has no special properties other than that the leading space, and consequently, all space in the text, is to be preserved as entered.

Time to Draw Breath!

Truth is, I haven’t said very much, for all the text I’ve written! When one delves behind the scenes one often discovers a wealth of interesting information and, with nothing to guide one, all that one can really do is go through it step by step. I have chosen an arbitrary document and described exactly what it contains, for better or worse. Now it's time to extract the wheat from the chaff, so to speak, and to summarise the important points, so that you will be able to make some sense of the next document you see.

There is a basic structure underlying the XML that makes up a large part of a Word Document. The <document> contains a <body> which, in turn, contains <p>aragraphs, each made up of <r>uns, which, finally, may contain <t>ext. Many elements may also contain properties, often using tags with names that are the same as the name of the element itself, but with “Pr” appended.

Clearly there is much more to come, but what you need now is a way to look at, and be able to read, the documents yourself. I will not always be around to reformat them for you!

There are no suitable tools built into either Windows or Office, but if you are unable to install anything else, an ordinary browser, Internet Explorer, for example, will do some basic formatting to make XML readable. If you are able to install software, a quick web search shows there to be several free, potentially helpful, editors available. As this is Microsoft data, a Microsoft editor seems like a good bet for a first attempt, and one, called XML Notepad can be found herehere [link to download Microsoft’s XML Notepad at http://www.microsoft.com/download/en/details.aspx?&id=7973]. If you download and install this (you may have to also install a version of the .NET Framework, which I am not going to explain here, but you should just be able to follow whatever instructions you are given) you should then be able to use it open the word\document.xml file that looked so awful in Notepad, and, if you do, you should see something like this:

A first view in XML Notepad
A first view in XML Notepad

Things, at least, look a bit better now - but only a bit! What you can see is readable, but not wonderfully informative. This just shows the first part of what you have already seen, and none of the content of your document at all. Do not despair! At the bottom of the left hand pane, there is a little ⊞ symbol; click on this, and, as you probably expect, the “w:body” element will be expanded - to something like this:

An expanded view in XML Notepad
An expanded view in XML Notepad

This does show, quite clearly, that the document is made up of lots of paragraphs. There could be other elements – tables, for example – but in this simple document there are just paragraphs. To see what's actually in those paragraphs you must expand all the elements. Do this for the first couple of paragraphs and this is what you will see:

An even more expanded view in XML Notepad
An even more expanded view in XML Notepad

This is just the XML I formatted somewhere way up the page, just shown from a slightly different perspective. It is, perhaps, slightly easier to understand than the XML itself, but only slightly, and with the truncated strings of text it is probably not the most useful view. XML Notepad has one more trick up its sleeve; it has a mechanism for translating the XML using what is called an XML StyleSheet. Before going into any further detail, if you click on the “XSL Output” tab near the top of the window you will see a different view of the document:

A different view in XML Notepad
A different view in XML Notepad

The message in yellow at the top may not mean very much to you, but XML StyleSheets can be used to do many things with XML, and you can write, and use, your own if you wish. In the absence of any further information, XML Notepad does as the message says: it tries to reformat your XML to be readable, and it does as good a job as can be expected. You still see all the metadata, as you must, but the colouring does make the text – the document content – reasonably clear.

I hope that, with what I have told you already, you will be able to, relatively easily, scan this and extract information from it. The eye can quite quickly come to ignore the common tags and attributes and, the more one gets to know, the more one tends to take in information subconsciously. I don't want to make light of this, and documents are, or can be, extremely complex; you will not get familiar with this overnight, but if you are keen to discover what makes Word documents the playgrounds they are, the fluff will quickly blend into the background, although it will remain all too easy to miss something critical!

If you scroll down a page or, depending on your window size, two, to the second paragraph, you will see that, although structurally similar to the first, it contains some tags you haven’t seen before and a whole new construct, that of a hyperlink:

The second paragraph in XML Notepad
The second paragraph in XML Notepad

You can easily spot the bold black text, split over several runs, but what you haven’t seen before are the properties of the third run: “<w:i/>”, and “<w:iCs/>”. What do these cryptic codes mean? If you scroll back to the begining of this article, you will see that “.doc” is in italics; even if you don’t know about the “<i>” tag in HTML, it isn't a great leap of faith to think that, perhaps, <w:i> might mean “italics”, and it does. There are some rather complex rules for italic formatting when applied via styles but when applied, as here, as direct formatting to the text, everything is as straightforward as it appears.

The “<w:iCs/>” tag is applying no obvious formatting to the text, and, indeed, is applying no ‘un-obvious’ formatting either! Although it is not impossible to guess, it would be quite inspired guesswork that led you to the discovery that the “Cs” suffix on the tag refers to Complex Scripts. Word has separate styling attributes for what it considers simple (or Latin) scripts, and complex (generally, Asian) scripts, as can be seen in, for example, the “Format Font” Dialog. Although there is no text here in any complex script, and no instruction to format as though the script were complex (which can happen), Word has, nonetheless, and probably correctly, explicitly set italics to be applied to any complex script that may be inserted here. Although there are no examples in my document, bold formatting uses similar tags: “<w:b/>”, and <w:bCs/>

Move a little further through the paragraph and you will see the last thing I want to look at here: the hyperlink. This is a run of text like all the others, with a property of the “Hyperlink” style, but it is enclosed within a “<w:hyperlink/>” element. There is nothing here that indicates where the hyperlink goes to. This information is buried elsewhere in another file within the zip archive and it is time to examine the contents of some of the other files and how all the elements that make up a Word Document hang together.

That is enough for a first look. I will write more about hyperlinks, as aready said, and Styles, later, but, in order to understand you need to know something of the basic structure of what are called Packages. I have already mentioned them briefly in my notes about uncompressing Word documents, but will explain more fully in my next outing (which I will link to from here when I have written it!)