Compound Binary Files

Old Formats

Old-Format Documents

Introduction

Whilst writing in a somewhat ad-hoc fashion about the new, OOXML, Documents, I responded to a forum post about how to read the old format documents outside of Office applications. I began my reply with "This should probably be a web page", and so this, the first of several pages on the subject, was born.

This may be of limited interest, but it is not just a historical note. Windows still uses this format of file in some other situations, and VBA projects within OOXML documents are still held in this format, so, if you need to work with VBA code outside Office, read on. Word Documents, themselves, are held in a special format that I hope to expand upon in a further page, later, but they are held in a container, and it is that container that is the subject of this page.

Compound Binary Files

Prior to Office 2007, Word Documents were held in what were called, amongst other things, Compound Binary Files. A Compound Binary File is, essentially, a file system within a file, made up, logically, of Storages (analogous to folders) and Streams (analogous to files).

Viewed as files and folders, this is what an arbitrary Word document might, logically, look like in a compound file:


Icon for a folder MyDocument.doc
Icon for a file 1Table
Icon for a folder Macros
Icon for a folder VBA
Icon for a file dir
Icon for a file Module1
Icon for a file _SRP_0
Icon for a file _SRP_1
Icon for a file _SRP_2
Icon for a file ThisDocument
Icon for a file _VBA_PROJECT
Icon for a file PROJECT
Icon for a file PROJECTwm
Icon for a file !CompObj
Icon for a file WordDocument
Icon for a file !SummaryInformation
Icon for a file !DocumentSummaryInformation

Here, “Macros” and “VBA” are Storages, and the other elements are Streams. The Storages do not, themselves, hold data, but they provide a structure within which the various Streams exist. The names themselves, which are genuine, probably mean little to you at the moment, although you can surely guess what some of them mean. The exclamation marks represent non-printable characters, but it is not yet time to explain the individual elements.

Physical Sectors

Physically, following a 512‑byte header, Compound Binary Files are made up of sequential Sectors, which can be any size that is a power of two, although, in practice, Word documents almost always have 512‑byte sectors, and I shall ignore other possible sizes here. The physical order of the sectors is arbitrary but they are sequentially numbered, starting from zero, and logically arranged in chains. There is a ‘next sector number’ for each sector, giving the number of the sector that logically follows it.

The next sector numbers, themselves, are stored as a stream, in a chain of sectors in the file. Microsoft, in what appears to be a deliberate attempt to confuse, call this stream a FAT, or File Allocation Table. In a structure where Streams are likened to Files, having a File Allocation Table for Sectors is perverse, and I will call it a Sector Allocation Table, or SAT. The SAT is a fundamental component of the Compound File and, rather obviously, must be located before it can be used; for this reason its sectors cannot be chained through the SAT itself, and an independent chain of pointers is maintained for the sectors of the SAT. This Master SAT begins at a fixed location in the file (offset 76 (0x4c)), and continues to the end of the header. There is an internal chaining mechanism for extending the Master SAT if the file is bigger than can be accomodated using the header alone.

An example may help, so here is the beginning of an arbitrary Word document, as it might be seen in a hex editor:

000000   d0 cf 11 e0  a1 b1 1a e1   00 00 00 00  00 00 00 00 
000010   00 00 00 00  00 00 00 00   3e 00 03 00  fe ff 09 00
000020   06 00 00 00  00 00 00 00   00 00 00 00  04 00 00 00
000030   2e 00 00 00  00 00 00 00   00 10 00 00  30 00 00 00
000040   02 00 00 00  fe ff ff ff   00 00 00 00  2d 00 00 00
000050   6c 00 00 00  e3 00 00 00   80 01 00 00  ff ff ff ff
000060   ff ff ff ff  ff ff ff ff   ff ff ff ff  ff ff ff ff
000070   ff ff ff ff  ff ff ff ff   ff ff ff ff  ff ff ff ff

The sector numbers are doubleword values, 32‑bit numbers. Highlighted in yellow, you can see four of these sector numbers, followed by a lot of entries full of binary ones, which simply indicate that those entries are not used. This file has four sectors of the SAT, and this value, itself, is also stored in the header, and shown pink, above.

If you are reading a file like this in code then, just reading four bytes into a 32‑bit numeric variable type will, most likely, give you the correct numeric value (it does depend on exactly how you code, of course). If you are reading the file by eye, as you are here, then you need to know that values are stored in what is called little‑endian format (sorry, I didn’t invent the name, I simply report it). In little‑endian format, the individual bytes of a value are held right‑to‑left. If you look at the first entry highlighted above, you see “2d 00 00 00”; these represent the numeric value “0x0000002d”, which, here, means sector number (decimal) 45.

As described above, sectors, starting from number zero, begin immediately after the file header. Sector number 45, therefore, begins at offset 512 (the length of the header) + (45*512) (the length of blocks zero through 44), which is 23552 (0x5c00). At that offset in this sample file you find:

005c00   01 00 00 00  02 00 00 00   03 00 00 00  04 00 00 00
005c10   05 00 00 00  06 00 00 00   07 00 00 00  08 00 00 00
005c20   8c 01 00 00  0a 00 00 00   0b 00 00 00  0c 00 00 00
005c30   0d 00 00 00  0e 00 00 00   0f 00 00 00  10 00 00 00

The entry at the beginning, “01 00 00 00”, is the number of the sector that logically follows sector 0; the next entry, “02 00 00 00”, is the number of the sector that logically follows sector 1. You can see that, much of the time, sectors are physically ordered in logical order, but not always; if you look at the ninth entry, you can see that, for whatever odd reason Word may have had, the sector that logically follows sector 7 is sector number 396 (0x018c).

The rest of the Sector Allocation Table continues in a similar format, all through the chain of sectors that make up the stream that is the SAT, that is, as you saw above, sectors (45) 0x2d, 108 (0x6c), 227 (0xe3), and 384 (0x0180).

Logical Streams

You, already, know almost enough to be able (with a little guesswork) to work diigently to rearrange all the sectors of the file into the streams that, collectively, represent the real content of the file. You would, however, not have any real clue what those streams were, and be little closer to unscrambling what you were seeing.

The streams fall, loosely, into two categories: those that are structural, those containing information about the file itself, and those that contain file content, which content, of course, depends on the type of file that is actually held withing the compound file.

Before explaining the individual Streams, you really need to know something of how the logical structure is encoded. There might, for example, be two Streams with the same name, differentiated only by being in different Storages; without knowing the structure, you wouldn't know which Stream to look at. All the information you need is held in the Directory. The Directory is, as you may guess, held as a Stream, but, because it is a slightly special Stream, and it is essential to be able to locate it as a first step in reading the file, it is referenced from a fixed location.

The Directory

The Sector Number of the first sector in the Directory is located at offset 48 (0x30). If you look back at the dump of the beginning of my document, you will see that the value at this location is “2e 00 00 00”, representing the numeric value 46 (0x2e). In this file, the directory begins in sector 46, which can be found at offset 24064 (0x5e00), the location immediately following the header and sectors 0 through 45. At this location you will find:

005e00   52 00 6f 00  6f 00 74 00   20 00 45 00  6e 00 74 00   R.o.o.t. .E.n.t.
005e10   72 00 79 00  00 00 00 00   00 00 00 00  00 00 00 00   r.y.............
005e20   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005e30   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005e40   16 00 05 01  ff ff ff ff   ff ff ff ff  03 00 00 00
005e50   06 09 02 00  00 00 00 00   c0 00 00 00  00 00 00 46
005e60   00 00 00 00  00 00 00 00   00 00 00 00  30 e8 ae 8f
005e70   a9 a1 ca 01  31 00 00 00   40 25 00 00  00 00 00 00

Directory entries are 128 bytes each, and this is the first one. The first 64 bytes are used for the name of the entry; it is held as a null-terminated string, in UCS-2 format, meaning that the maximum length is 31 characters. The number of bytes occupied by the name and the null-terminator together immediately follows the name, at offset 64 within the entry. This is held as a 16-bit unsigned value and, in this example you can see that it has the value 22 (0x0016), which, fortunately, corresponds with the name you can see.

There follow a variety of bit and byte flags and the like, typical of directory entries everywhere, most of which are of little consequence here. One flag you should note is the type of the entry: this is the single byte immediately following the length of the name, at offset 66 within the entry. In this case the type is 0x05, which means that this is Root Storage. The first directory entry, usually, though not necessarily, called Root Entry, is a very special directory entry, for what is both the ultimate parent Storage of the structure, and a very special Stream. The Stream part of it, as you will see shortly, contains several smaller streams, of which I will say nothing more yet – read on, if you wish to find out!

As this is an entry for a stream, albeit a special stream, the two important values you need to take from it are the two 4‑byte entries at offsets 116 (0x74) and 120 (0x78) within the entry. These are the sector number where the stream begins (in this case, sector number 49 (0x31)), and the length of the Stream (in this case, 9536 (0x2540)).

If you look at the next three entries, which immediately follow this one, and which are shown below, you will see entries for three further streams, the names of which you may recognise from the directory structure shown near the beginning of this page. I leave it to you to study these but, please don't take too long about it, and don't yet try to locate the streams because there is something else you need to know first.

005e80   31 00 54 00  61 00 62 00   6c 00 65 00  00 00 00 00   1.T.a.b.l.e.....
005e90   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005ea0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005eb0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005ec0   0e 00 02 00  1b 00 00 00   05 00 00 00  ff ff ff ff
005ed0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
005ee0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
005ef0   00 00 00 00  09 00 00 00   cc 7a 00 00  00 00 00 00

005f00   57 00 6f 00  72 00 64 00   44 00 6f 00  63 00 75 00   W.o.r.d.D.o.c.u.
005f10   6d 00 65 00  6e 00 74 00   00 00 00 00  00 00 00 00   m.e.n.t.........
005f20   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005f30   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................

005f40   1a 00 02 01  01 00 00 00   ff ff ff ff  ff ff ff ff
005f50   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
005f60   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
005f70   00 00 00 00  00 00 00 00   37 64 00 00  00 00 00 00

005f80   05 00 53 00  75 00 6d 00   6d 00 61 00  72 00 79 00   ..S.u.m.m.a.r.y.
005f90   49 00 6e 00  66 00 6f 00   72 00 6d 00  61 00 74 00   I.n.f.o.r.m.a.t.
005fa0   69 00 6f 00  6e 00 00 00   00 00 00 00  00 00 00 00   i.o.n...........
005fb0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................

005fc0   28 00 02 01  02 00 00 00   04 00 00 00  ff ff ff ff
005fd0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
005fe0   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
005ff0   00 00 00 00  1d 00 00 00   00 10 00 00  00 00 00 00

You have now seen four directory entries, each 128 bytes long. If you have been quietly totting up the total length of these (the sort of thing I do, but most people probably don’t!), you will have realised that you have reached the end of a file sector; if you want to see more of the directory, you need to go to the Sector Allocation Table to find the number of the sector that follows sector 46, which in this file, is sector 47. This is my file, and you can't see it, so knowing the sector numbers is rather pointless; suffice to say there are a few, widely separated in the physical file.

Storages

Storages, being analogous to Folders, do not, themselves, contain data; they are, nonetheless, important, and you do need to know enough to read the structure from them. Each directory entry has a type; there are half a dozen possible types but the only ones you should expect to find are type 5, which you saw earlier, for the first directory entry, type 1 for a storage, type 2 for a stream, and, perhaps, type 0 for an unused directory entry.

Primarily for performance reasons, the directory structure for each Storage is maintained in what is called a red‑black tree. This tree structure does not represent anything of the structure of Storages and Streams within the file, and is simply a way of organising the directory. A reference to what is, in effect, the arbitrary child at the top of this tree, is held in the directory entry for the Storage (at offset 76 (0x4c)), and every child of the Storage (whether, itself, another Storage, or a Stream) can be found by navigating the tree. Navigating the tree is done by following a maximum of two pointers to further children of the parent Storage, called the Left Sibling (at offset 68 (0x44)) and Right Sibling (at offset 72 (0x48)). An example will, I hope, help to explain this: here is the first directory entry for a simple document.

005e00   52 00 6f 00  6f 00 74 00   20 00 45 00  6e 00 74 00   R.o.o.t. .E.n.t.
005e10   72 00 79 00  00 00 00 00   00 00 00 00  00 00 00 00   r.y.............
005e20   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005e30   00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00   ................
005e40   16 00 05 01  ff ff ff ff   ff ff ff ff  03 00 00 00
005e50   06 09 02 00  00 00 00 00   c0 00 00 00  00 00 00 46
005e60   00 00 00 00  00 00 00 00   00 00 00 00  60 56 64 7b
005e70   07 39 cd 01  25 00 00 00   80 24 00 00  00 00 00 00

I have highlighted the entry type and the three references, the first two of which you will see as being 0xffffffff (equivalent to ‑1), effectively null references: this entry, you will remember, is for the single root of the whole internal structure and has no siblings. The third reference is “3”, a pointer to directory entry number 3; directory entries are given simple sequential numbers for the purposes of these references, with the root storage being entry number 0. The relevant detail from this entry can be summarised thus (with all the numbers being hexadecimal):

Summary of a Directory Entry
Number Name Type Left Sibling
Reference
Right Sibling
Reference
Root Child
Reference
00 Root Entry 05 (null) (null) 03

There is no point in stepping through the process of walking the sector allocations to find the whole directory. I have done this chore, and extracted all the relevant information, and in my sample document, which has a minimal VBA project, here it is:

Summary of a Directory
Number Name Type Left Sibling
Reference
Right Sibling
Reference
Root Child
Reference
00 Root Entry 5 (null) (null) 03
01 1Table 2 (null) (null) (null)
02 WordDocument 2 05 (null) (null)
03 SummaryInformation 2 02 04 (null)
04 DocumentSummaryInformation 2 (null) (null) (null)
05 Macros 1 01 0d 0c
06 VBA 1 (null) (null) 07
07 ThisDocument 2 08 09 (null)
08 Module1 2 0a (null) (null)
09 _VBA_PROJECT 2 (null) (null) (null)
0a dir 2 (null) (null) (null)
0b PROJECTwm 2 (null) (null) (null)
0c PROJECT 2 06 0b (null)
0d CompObj 2 (null) (null) (null)

You have already seen the Root Entry, with its solitary child reference: this is to entry number 03, which you now see is the SummaryInformation Stream (note that I have left the markers, as I used earlier, off the names of the streams that begin with odd hexadecimal characters - for purely aesthetic reasons). If you look at this entry, you can see that it is for a stream (type 2) and so has no children, but that is has two siblings, entries numbers 02 and 04. Entry number 02 (WordDocument) has a single (left) sibling of entry number 05, whilst entry number 04 (DocumentSummaryInformation) has none. Entry number 05 (Macros), in its turn, is a Storage (type 1) and has a child; it also has two siblings, entries numbers 01 and 0d, neither of which happen to have further siblings. Listing all the elements of this tree as a tree, gives this:


The tree, itself, you will recall, is of no particular importance; it simply contains all the storages and streams that are immediately below the root. The information it provides is better represented as a folder structure. Reading the tree from left to right, formatting the result, and adding a few icons gives:


Icon for a folder (Root Storage container)
Icon for a file 1Table
Icon for a folder Macros
Icon for a file !CompObj
Icon for a file WordDocument
Icon for a file !SummaryInformation
Icon for a file !DocumentSummaryInformation

Look back, now, at the directory entry for the Macros storage. You will recall that, as well as the sibling references that you followed, it also had a root child reference. The arbitrary first child of Macros is entry number 0c, the PROJECT stream; this has two siblings, entries numbers 06 and 0b, the VBA storage and the PROJECTwm stream. Neither of these has further siblings, so that completes the list of elements immediately within the Macros structure. Adding these to the folder structure gives this:


Icon for a folder (Root Storage container)
Icon for a file 1Table
Icon for a folder Macros
Icon for a folder VBA
Icon for a file PROJECT
Icon for a file PROJECTwm
Icon for a file !CompObj
Icon for a file WordDocument
Icon for a file !SummaryInformation
Icon for a file !DocumentSummaryInformation

Repeating the procedure again for the VBA storage, which, I’m sure, you can do yourself, completes the navigation of this, small, directory, and provides a structure of this:


Icon for a folder (Root Storage container)
Icon for a file 1Table
Icon for a folder Macros
Icon for a folder VBA
Icon for a file dir
Icon for a file Module1
Icon for a file ThisDocument
Icon for a file _VBA_PROJECT
Icon for a file PROJECT
Icon for a file PROJECTwm
Icon for a file !CompObj
Icon for a file WordDocument
Icon for a file !SummaryInformation
Icon for a file !DocumentSummaryInformation

Short Streams

The final piece of information you need, to be able to find your way around the compound file that may contain a pre-2007 Word document, is that all streams are not equal. If a stream is less than 4096 bytes long (the actual trigger length is encoded in the file header but it is always 4096 in practice), it is considered to be a Short Stream and not worthy of its own place in the main physical structure. Short streams are held, you won’t be surprised to learn, in Short Sectors.

The Root Storage, that you saw referenced from the Root Entry directory entry above, is almost a little baby compound file all by itself. It is a stream in the main file, but it consists entirely of Short Sectors of 64 bytes each (as with other lengths, the actual length is encoded in the header, but it is always 64). The 64-byte Short Sectors have a Short Sector Allocation Table (SSAT) all to themselves, which Microsoft, in their own inimitable way, call the MiniFAT. This works in exactly the same way as the SAT for the normal sectors, that is it is an array with one entry for each short sector containing the number of the following short sector. Short sectors start at number zero, which is the one at the very beginning (that is, at offset zero) of the Root Storage stream. The SSAT is a normal stream in the file, the number of the first sector of which is held at offset 60 (0x3c) in the main file header; in the case of my sample document, this is sector 48 (0x30).

You may remember, from when you first saw the directory entries, that one of the data stored there was the number of the first sector where Stream data began. I said there was something else you needed to know. That something is that the sector number may be of a normal sector or a short sector, and the only way to tell is by the length of the stream, also held in the directory entry. If the stream is less than 4096 bytes long, the sector number is of a short sector, and, if 4096 bytes or longer, it is of a normal sector.

The Data

Having waded through a couple of screens worth of description of a relatively complex structure that, I hope, has been understandable, you may be itching to find out a little about about the data, the actual content of the file. I'm afraid I'm probably going to disappoint you, because I am not going to give you full details of how the actual content of your document is stored: it is just too complex, and it is fading into history. I do intend, however, to write more about some of the streams. Meanwhile, a brief synopsis of what they are will have to suffice.

Looking at the structure that you extracted from the directory above, which is not in any particular order, you can see streams called WordDocument and 1Table. These represent your document; there may also, or as an alternative to the 1Table stream, be a 0Table stream, and perhaps a Data stream. Although I may write a little more about these streams, I do not intend to provide great detail.

Next, the three streams that have names starting with unprintable characters, which I have represented with exclamation marks. These hold information about the document, in part what you see if you view the properties of the file in Windows Explorer. I will add some details of these shortly.

The remainder of your document is the structure in the Macros storage. I will provide (fairly) full detail of this, as it remains current, but working with it involves several different processes and is too much for one page. The links below will take you to pages with more detail and will be added to as I complete more pages

Just a stylised bullet Extracting the VBA Storage structure

⇒ Extracting the VBA Storage structure [link to VBAStorage.php on this site]

Just a stylised bullet Stream (de‑)compression

⇒ Stream (de‑)compression [link to StreamCompression.php on this site]

The Code

VBA is what I do; I couldn't leave the page without offering some code! The following code will read a Compound Binary File and extract structural details from it. It, then, looks for and extracts the stream called “dir”. This stream was chosen for the demonstration simply because it was relevant to the forum posting that triggered the writing of this page. Decompression, and interpretation, of the “dir” stream will have to wait a week or two. I may provide some more detail later but, for the moment, the comments in the code are all I offer, along, of course, with the article you have just read, which should be sufficient for an understanding.

If you are returning, having been here before, you may notice that the code has changed. There is no functional difference but I have re-organised it as a precursor for using it as a basis for further code on further pages.

Option Explicit

' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' These definitions use field names as documented by Microsoft; I make no comment other '
' than the obvious: that they need explanation. Enumerations in VBA are, implicitly,    '
' Long, so the (Microsoft) one shown here is used for comparands but not for definition '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
Private Enum Directory_EntryType
    STGTY_Invalid
    STGTY_Storage
    STGTY_Stream
    STGTY_Lockbytes
    STGTY_Property
    STGTY_Root
End Enum

Private Type StructuredStorageHeader        ' Microsoft definition for File Header
    abSig(0 To 7)           As Byte         ' 0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1
    clid(0 To 15)           As Byte
    uMinorVersion           As Integer
    uDllVersion             As Integer
    uByteOrder              As Integer      ' 0xFFFE (0xFE 0xFF = Intel 'little-endian')
    uSectorShift            As Integer      ' Log(2) of Sector Size (usually 9, = 512-byte sectors)
    uMiniSectorShift        As Integer      ' Log(2) of Small Sector Size (usually 6, = 64-byte)
    usReserved              As Integer      ' Must be zero
    ulReserved1             As Long         ' Must be zero
    ulReserved2             As Long         ' Must be zero
    csectFAT                As Long         ' Number of sectors used by the SAT
    sectDirStart            As Long         ' First Sector of the Directory
    ulDFsignature           As Long         ' Must be zero
    ulMiniSectorCutoff      As Long         ' Usually 4096
    sectMiniFatStart        As Long         ' First Sector of the SSAT
    csectMiniFat            As Long         ' Number of sectors used by the SSAT
    sectDifStart            As Long         ' First MSAT continuation sector
    csectDif                As Long         ' Number of MSAT continuation sectors
    sectFat(0 To 108)       As Long         ' First 109 entries of the MSAT
End Type

Private Type DirectoryEntry                 ' Microsoft definition for Directory Entry
    ab                      As String * 64  ' Entry Name
    cb                      As Integer      ' Entry Name length (including trailing null)
                                            ' Microsoft explicitly say characters, but it is bytes
    mse                     As Byte         ' Entry Type.
                                            ' Enum above. Can't declare Byte as Enum in VBA
    bflags                  As Byte         ' 0 = Red, 1 = Black
    sidLeftSib              As Long         ' Id of Left Sibling Directory Entry
    sidRightSib             As Long         ' Id of Right Sibling Directory Entry
    sidChild                As Long         ' Id of Root Child Directory Entry
    clsid(0 To 15)          As Byte
    dwUserFlags             As Long
    time(0 To 1, 0 To 7)    As Byte
    sectStart               As Long         ' First Sector (or Short Sector) of Stream
    ulSize                  As Long         ' Length of Stream
    dptPropType             As Long         ' Zero
End Type

' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' VBA does not have very good facilities for working with free format data, so a mixed  '
' bag of definitions of 512 bytes, each to be used when appropriate, are defined.       '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

Private Type NonSpecificSector
    SectorByte(0 To 511)    As Byte
End Type
Private Type SectorOfDWords
    DWord(0 To 127)         As Long
End Type

Private Type ShortSector
    SSByte(0 To 63)         As Byte
End Type
Private Type SectorOfShortSectors
    ShortSector(0 To 7)     As ShortSector
End Type
Private Type StreamOfShortSectors
    ShortSector()           As ShortSector
End Type

Private Type DirectorySector
    DirectoryEntry(0 To 3)  As DirectoryEntry
End Type


Private Type DirectoryEntryLocal
    EntryName               As String
    EntryType               As Directory_EntryType
    RedOrBlack              As Byte
    LeftSibling             As Long
    RightSibling            As Long
    RootChild               As Long
    FirstSector             As Long
    StreamLength            As Long
End Type

' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
' Some module level declarations make life a little easier.                             '
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

Private FileNo              As Long

Private FileHeader          As StructuredStorageHeader
Private SAT()               As Long
Private SSAT()              As Long
Private Directory()         As DirectoryEntryLocal
Private SSStream            As StreamOfShortSectors
 
Sub SimpleTestDriver() Dim FilePath As String Dim FileName As String Dim Stream() As Byte FilePath = "C:\Path\To\Your\" FileName = "Document.doc" ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' There are various ways to read files of binary data, none of which are ' ' entirely satisfactory. Here, the simple built-in mechanism is used. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' FileNo = FreeFile Open FilePath & FileName For Binary As FileNo ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Positioning the file at the beginning for clarity, read the File Header. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Seek FileNo, 1 Get FileNo, , FileHeader ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Gather the structural elements of the file, to enable what follows. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ExtractSAT ExtractSSAT ExtractDirectory ExtractShortSectorStream ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Everything is now in place to extract any Stream we want by following the ' ' appropriate pointers. For this example, the "dir" stream is extracted. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Stream = ExtractStream("dir") ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' For this example, the file is no longer needed, so close it. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Close FileNo ' Do whatever you want with the stream now End Sub
Function ExtractStream(StreamName As String) _ As Byte() Dim UseShortStream As Boolean Dim Stream() As Byte Dim Sectorndx As Long Dim Sector As NonSpecificSector Dim SectorNo As Long Dim SectorLen As Long Dim ndx As Long Dim ndx2 As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' This routine scans the directory for an entry for a requested stream, and, ' ' assuming it's found, then extracts the data for that stream. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' For ndx = LBound(Directory) To UBound(Directory) If Directory(ndx).EntryType = STGTY_Stream Then If Directory(ndx).EntryName = StreamName Then Exit For End If End If Next If ndx > UBound(Directory) Then Exit Function UseShortStream = (Directory(ndx).StreamLength < 4096) SectorLen = IIf(UseShortStream, Len(SSStream.ShortSector(0)), Len(Sector)) ReDim Stream(0 To SectorLen - 1) Sectorndx = 0 SectorNo = Directory(ndx).FirstSector Do While SectorNo > 0 If UseShortStream Then For ndx2 = 0 To SectorLen - 1 Stream(Sectorndx + ndx2) = _ SSStream.ShortSector(SectorNo).SSByte(ndx2) Next Else Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector For ndx2 = 0 To SectorLen - 1 Stream(Sectorndx + ndx2) = Sector.SectorByte(ndx2) Next End If Sectorndx = Sectorndx + SectorLen ReDim Preserve Stream(LBound(Stream) To UBound(Stream) + SectorLen) If UseShortStream Then SectorNo = SSAT(SectorNo) Else SectorNo = SAT(SectorNo) End If Loop ReDim Preserve Stream(LBound(Stream) To Directory(ndx).StreamLength - 1) ExtractStream = Stream End Function
Sub ExtractSAT() Dim MSAT() As Long Dim MSATndx As Long Dim SATndx As Long Dim Sector As SectorOfDWords Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' First get the Master Sector Allocation Table. Make the array big enough for ' ' the maximum possible entries because the code logic is easier that way. ' ' Read the first 109 entries from the header, then as many complete sectors ' ' as there are, before redefining the array to drop the excess entries. ' ' Note: The construct processed here is not explained in the article. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ReDim MSAT(0 To FileHeader.csectDif * 127 + 108) MSATndx = 0 For MSATndx = 0 To 108 MSAT(MSATndx) = FileHeader.sectFat(MSATndx) Next MSATndx SectorNo = FileHeader.sectDifStart Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * 512 + 1 Get FileNo, , Sector For ndx = 0 To 126 MSAT(MSATndx + ndx) = Sector.DWord(ndx) Next ndx MSATndx = MSATndx + 127 SectorNo = Sector.DWord(127) Loop ReDim Preserve MSAT(0 To FileHeader.csectFAT - 1) ' Right size ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Read the sectors in the MSAT to build up the full Sector Allocation Table. ' ' The final sector may be padded with values of -1, but they should never be ' ' referenced. Once the SAT has been built, the MSAT is no longer needed. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 128 ' could calculate ReDim SAT(0 To SectorEntries * (UBound(MSAT) - LBound(MSAT) + 1) - 1) SATndx = 0 For MSATndx = LBound(MSAT) To UBound(MSAT) Seek FileNo, (MSAT(MSATndx) + 1) * SectorLen + 1 Get FileNo, , Sector For ndx = LBound(Sector.DWord) To UBound(Sector.DWord) SAT(SATndx + ndx) = Sector.DWord(ndx) Next SATndx = SATndx + SectorEntries Next MSATndx ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The final sector won't be full, but will be padded, as on file. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' End Sub
Sub ExtractSSAT() Dim SSATndx As Long Dim Sector As SectorOfDWords Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The first sector of the Short Sector Allocation Table is held in the Header.' ' The rest are then chained through the SAT, built in the previous step. The ' ' end-of-chain is indicated by a value of -2 (0xFFFFFFFE) in the SAT. As with ' ' the SAT, the final sector may be padded with values of -1, but they should ' ' never be referenced. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 128 ' could calculate ReDim SSAT(0 To SectorEntries * FileHeader.csectMiniFat - 1) SSATndx = 0 SectorNo = FileHeader.sectMiniFatStart Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector For ndx = LBound(Sector.DWord) To UBound(Sector.DWord) SSAT(SSATndx + ndx) = Sector.DWord(ndx) Next SSATndx = SSATndx + SectorEntries SectorNo = SAT(SectorNo) Loop ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The final sector won't be full, but will be padded, as on file. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' End Sub
Sub ExtractDirectory() Dim Sector As DirectorySector Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long Dim var As Variant ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The first sector of the Short Sector Allocation Table is held in the Header.' ' The rest are then chained through the SAT, built in the previous step. The ' ' end-of-chain is indicated by a value of -2 (0xFFFFFFFE) in the SAT. As with ' ' the SAT, the final sector may be padded with values of -1, but they should ' ' never be referenced. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 4 ' could calculate Erase Directory SectorNo = FileHeader.sectDirStart Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector For ndx = LBound(Sector.DirectoryEntry) To UBound(Sector.DirectoryEntry) With Sector.DirectoryEntry(ndx) Select Case .mse: Case STGTY_Root, STGTY_Storage, STGTY_Stream If (Not Directory) = True Then ReDim Directory(0 To 0) Else ReDim Preserve Directory(LBound(Directory) To UBound(Directory) + 1) End If Directory(UBound(Directory)).EntryName _ = StrConv(Left(.ab, .cb - 2), vbFromUnicode) Directory(UBound(Directory)).EntryType = .mse Directory(UBound(Directory)).LeftSibling = .sidLeftSib Directory(UBound(Directory)).RightSibling = .sidRightSib Directory(UBound(Directory)).RedOrBlack = .bflags Directory(UBound(Directory)).RootChild = .sidChild Directory(UBound(Directory)).FirstSector = .sectStart Directory(UBound(Directory)).StreamLength = .ulSize End Select End With Next SectorNo = SAT(SectorNo) Loop End Sub
Sub ExtractShortSectorStream() Dim SSndx As Long Dim Sector As SectorOfShortSectors Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The first sector of the Short Sector Stream is pointed to from the Root ' ' Storage directory entry; the rest are chained through the SAT. It seems as ' ' if one extra, unused, sector is allocated at the end of the stream; this is ' ' simple observation, and I cannot confirm it to always be the case. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 8 ReDim SSStream.ShortSector(0 To SectorEntries - 1) SSndx = 0 SectorNo = Directory(0).FirstSector Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector With SSStream For ndx = LBound(Sector.ShortSector) To UBound(Sector.ShortSector) .ShortSector(SSndx + ndx) = Sector.ShortSector(ndx) Next SSndx = SSndx + SectorEntries ReDim Preserve .ShortSector(LBound(.ShortSector) To UBound(.ShortSector) + 8) End With SectorNo = SAT(SectorNo) Loop End Sub