Compound Binary Files
Old Formats
Whilst writing in a somewhat ad-hoc fashion about the new, OOXML, Documents, I responded to a forum post about how to read the old format documents outside of Office applications. I began my reply with "This should probably be a web page", and so this, the first of several pages on the subject, was born.
This may be of limited interest, but it is not just a historical note. Windows still uses this format of file in some other situations, and VBA projects within OOXML documents are still held in this format, so, if you need to work with VBA code outside Office, read on. Word Documents, themselves, are held in a special format that I hope to expand upon in a further page, later, but they are held in a container, and it is that container that is the subject of this page.
Prior to Office 2007, Word Documents were held in what were called, amongst other things, Compound Binary Files. A Compound Binary File is, essentially, a file system within a file, made up, logically, of Storages (analogous to folders) and Streams (analogous to files).
Viewed as files and folders, this is what an arbitrary Word document might, logically, look like in a compound file:
MyDocument.doc
1Table
Macros
VBA
dir
Module1
_SRP_0
_SRP_1
_SRP_2
ThisDocument
_VBA_PROJECT
PROJECT
PROJECTwm
!CompObj
WordDocument
!SummaryInformation
!DocumentSummaryInformation
Here, “Macros” and “VBA” are Storages, and the other elements are Streams. The Storages do not, themselves, hold data, but they provide a structure within which the various Streams exist. The names themselves, which are genuine, probably mean little to you at the moment, although you can surely guess what some of them mean. The exclamation marks represent non-printable characters, but it is not yet time to explain the individual elements.
Physically, following a 512‑byte header, Compound Binary Files are made up of sequential Sectors, which can be any size that is a power of two, although, in practice, Word documents almost always have 512‑byte sectors, and I shall ignore other possible sizes here. The physical order of the sectors is arbitrary but they are sequentially numbered, starting from zero, and logically arranged in chains. There is a ‘next sector number’ for each sector, giving the number of the sector that logically follows it.
The next sector numbers, themselves, are stored as a stream, in a chain of sectors in the file. Microsoft, in what appears to be a deliberate attempt to confuse, call this stream a FAT, or File Allocation Table. In a structure where Streams are likened to Files, having a File Allocation Table for Sectors is perverse, and I will call it a Sector Allocation Table, or SAT. The SAT is a fundamental component of the Compound File and, rather obviously, must be located before it can be used; for this reason its sectors cannot be chained through the SAT itself, and an independent chain of pointers is maintained for the sectors of the SAT. This Master SAT begins at a fixed location in the file (offset 76 (0x4c)), and continues to the end of the header. There is an internal chaining mechanism for extending the Master SAT if the file is bigger than can be accomodated using the header alone.
An example may help, so here is the beginning of an arbitrary Word document, as it might be seen in a hex editor:
000000 d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00 00 00
000010 00 00 00 00 00 00 00 00 3e 00 03 00 fe ff 09 00
000020 06 00 00 00 00 00 00 00 00 00 00 00 04 00 00 00
000030 2e 00 00 00 00 00 00 00 00 10 00 00 30 00 00 00
000040 02 00 00 00 fe ff ff ff 00 00 00 00 2d 00 00 00 000050 6c 00 00 00 e3 00 00 00 80 01 00 00 ff ff ff ff 000060 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 000070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
The sector numbers are doubleword values, 32‑bit numbers. Highlighted in yellow, you can see four of these sector numbers, followed by a lot of entries full of binary ones, which simply indicate that those entries are not used. This file has four sectors of the SAT, and this value, itself, is also stored in the header, and shown pink, above.
If you are reading a file like this in code then, just reading four bytes into a 32‑bit numeric variable type will, most likely, give you the correct numeric value (it does depend on exactly how you code, of course). If you are reading the file by eye, as you are here, then you need to know that values are stored in what is called little‑endian format (sorry, I didn’t invent the name, I simply report it). In little‑endian format, the individual bytes of a value are held right‑to‑left. If you look at the first entry highlighted above, you see “2d 00 00 00”; these represent the numeric value “0x0000002d”, which, here, means sector number (decimal) 45.
As described above, sectors, starting from number zero, begin immediately after the file header. Sector number 45, therefore, begins at offset 512 (the length of the header) + (45*512) (the length of blocks zero through 44), which is 23552 (0x5c00). At that offset in this sample file you find:
005c00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00 005c10 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 005c20 8c 01 00 00 0a 00 00 00 0b 00 00 00 0c 00 00 00 005c30 0d 00 00 00 0e 00 00 00 0f 00 00 00 10 00 00 00
The entry at the beginning, “01 00 00 00”, is the number of the sector that logically follows sector 0; the next entry, “02 00 00 00”, is the number of the sector that logically follows sector 1. You can see that, much of the time, sectors are physically ordered in logical order, but not always; if you look at the ninth entry, you can see that, for whatever odd reason Word may have had, the sector that logically follows sector 7 is sector number 396 (0x018c).
The rest of the Sector Allocation Table continues in a similar format, all through the chain of sectors that make up the stream that is the SAT, that is, as you saw above, sectors (45) 0x2d, 108 (0x6c), 227 (0xe3), and 384 (0x0180).
You, already, know almost enough to be able (with a little guesswork) to work diigently to rearrange all the sectors of the file into the streams that, collectively, represent the real content of the file. You would, however, not have any real clue what those streams were, and be little closer to unscrambling what you were seeing.
The streams fall, loosely, into two categories: those that are structural, those containing information about the file itself, and those that contain file content, which content, of course, depends on the type of file that is actually held withing the compound file.
Before explaining the individual Streams, you really need to know something of how the logical structure is encoded. There might, for example, be two Streams with the same name, differentiated only by being in different Storages; without knowing the structure, you wouldn't know which Stream to look at. All the information you need is held in the Directory. The Directory is, as you may guess, held as a Stream, but, because it is a slightly special Stream, and it is essential to be able to locate it as a first step in reading the file, it is referenced from a fixed location.
The Sector Number of the first sector in the Directory is located at offset 48 (0x30). If you look back at the dump of the beginning of my document, you will see that the value at this location is “2e 00 00 00”, representing the numeric value 46 (0x2e). In this file, the directory begins in sector 46, which can be found at offset 24064 (0x5e00), the location immediately following the header and sectors 0 through 45. At this location you will find:
005e00 52 00 6f 00 6f 00 74 00 20 00 45 00 6e 00 74 00 R.o.o.t. .E.n.t. 005e10 72 00 79 00 00 00 00 00 00 00 00 00 00 00 00 00 r.y............. 005e20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005e30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
005e40 16 00 05 01 ff ff ff ff ff ff ff ff 03 00 00 00 005e50 06 09 02 00 00 00 00 00 c0 00 00 00 00 00 00 46 005e60 00 00 00 00 00 00 00 00 00 00 00 00 30 e8 ae 8f 005e70 a9 a1 ca 01 31 00 00 00 40 25 00 00 00 00 00 00
Directory entries are 128 bytes each, and this is the first one. The first 64 bytes are used for the name of the entry; it is held as a null-terminated string, in UCS-2 format, meaning that the maximum length is 31 characters. The number of bytes occupied by the name and the null-terminator together immediately follows the name, at offset 64 within the entry. This is held as a 16-bit unsigned value and, in this example you can see that it has the value 22 (0x0016), which, fortunately, corresponds with the name you can see.
There follow a variety of bit and byte flags and the like, typical of directory entries everywhere, most of which are of little consequence here. One flag you should note is the type of the entry: this is the single byte immediately following the length of the name, at offset 66 within the entry. In this case the type is 0x05, which means that this is Root Storage. The first directory entry, usually, though not necessarily, called Root Entry, is a very special directory entry, for what is both the ultimate parent Storage of the structure, and a very special Stream. The Stream part of it, as you will see shortly, contains several smaller streams, of which I will say nothing more yet – read on, if you wish to find out!
As this is an entry for a stream, albeit a special stream, the two important values you need to take from it are the two 4‑byte entries at offsets 116 (0x74) and 120 (0x78) within the entry. These are the sector number where the stream begins (in this case, sector number 49 (0x31)), and the length of the Stream (in this case, 9536 (0x2540)).
If you look at the next three entries, which immediately follow this one, and which are shown below, you will see entries for three further streams, the names of which you may recognise from the directory structure shown near the beginning of this page. I leave it to you to study these but, please don't take too long about it, and don't yet try to locate the streams because there is something else you need to know first.
005e80 31 00 54 00 61 00 62 00 6c 00 65 00 00 00 00 00 1.T.a.b.l.e..... 005e90 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005ea0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005eb0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
005ec0 0e 00 02 00 1b 00 00 00 05 00 00 00 ff ff ff ff 005ed0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 005ee0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 005ef0 00 00 00 00 09 00 00 00 cc 7a 00 00 00 00 00 00 005f00 57 00 6f 00 72 00 64 00 44 00 6f 00 63 00 75 00 W.o.r.d.D.o.c.u. 005f10 6d 00 65 00 6e 00 74 00 00 00 00 00 00 00 00 00 m.e.n.t......... 005f20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005f30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005f40 1a 00 02 01 01 00 00 00 ff ff ff ff ff ff ff ff 005f50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 005f60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 005f70 00 00 00 00 00 00 00 00 37 64 00 00 00 00 00 00 005f80 05 00 53 00 75 00 6d 00 6d 00 61 00 72 00 79 00 ..S.u.m.m.a.r.y. 005f90 49 00 6e 00 66 00 6f 00 72 00 6d 00 61 00 74 00 I.n.f.o.r.m.a.t. 005fa0 69 00 6f 00 6e 00 00 00 00 00 00 00 00 00 00 00 i.o.n........... 005fb0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005fc0 28 00 02 01 02 00 00 00 04 00 00 00 ff ff ff ff 005fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 005fe0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 005ff0 00 00 00 00 1d 00 00 00 00 10 00 00 00 00 00 00
You have now seen four directory entries, each 128 bytes long. If you have been quietly totting up the total length of these (the sort of thing I do, but most people probably don’t!), you will have realised that you have reached the end of a file sector; if you want to see more of the directory, you need to go to the Sector Allocation Table to find the number of the sector that follows sector 46, which in this file, is sector 47. This is my file, and you can't see it, so knowing the sector numbers is rather pointless; suffice to say there are a few, widely separated in the physical file.
Storages, being analogous to Folders, do not, themselves, contain data; they are, nonetheless, important, and you do need to know enough to read the structure from them. Each directory entry has a type; there are half a dozen possible types but the only ones you should expect to find are type 5, which you saw earlier, for the first directory entry, type 1 for a storage, type 2 for a stream, and, perhaps, type 0 for an unused directory entry.
Primarily for performance reasons, the directory structure for each Storage is maintained in what is called a red‑black tree. This tree structure does not represent anything of the structure of Storages and Streams within the file, and is simply a way of organising the directory. A reference to what is, in effect, the arbitrary child at the top of this tree, is held in the directory entry for the Storage (at offset 76 (0x4c)), and every child of the Storage (whether, itself, another Storage, or a Stream) can be found by navigating the tree. Navigating the tree is done by following a maximum of two pointers to further children of the parent Storage, called the Left Sibling (at offset 68 (0x44)) and Right Sibling (at offset 72 (0x48)). An example will, I hope, help to explain this: here is the first directory entry for a simple document.
005e00 52 00 6f 00 6f 00 74 00 20 00 45 00 6e 00 74 00 R.o.o.t. .E.n.t. 005e10 72 00 79 00 00 00 00 00 00 00 00 00 00 00 00 00 r.y............. 005e20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 005e30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
005e40 16 00 05 01 ff ff ff ff ff ff ff ff 03 00 00 00 005e50 06 09 02 00 00 00 00 00 c0 00 00 00 00 00 00 46 005e60 00 00 00 00 00 00 00 00 00 00 00 00 60 56 64 7b 005e70 07 39 cd 01 25 00 00 00 80 24 00 00 00 00 00 00
I have highlighted the entry type and the three references, the first two of which you will see as being 0xffffffff (equivalent to ‑1), effectively null references: this entry, you will remember, is for the single root of the whole internal structure and has no siblings. The third reference is “3”, a pointer to directory entry number 3; directory entries are given simple sequential numbers for the purposes of these references, with the root storage being entry number 0. The relevant detail from this entry can be summarised thus (with all the numbers being hexadecimal):
Number | Name | Type | Left Sibling Reference |
Right Sibling Reference |
Root Child Reference |
---|---|---|---|---|---|
00 | Root Entry | 05 | (null) | (null) | 03 |
There is no point in stepping through the process of walking the sector allocations to find the whole directory. I have done this chore, and extracted all the relevant information, and in my sample document, which has a minimal VBA project, here it is:
Number | Name | Type | Left Sibling Reference |
Right Sibling Reference |
Root Child Reference |
---|---|---|---|---|---|
00 | Root Entry | 5 | (null) | (null) | 03 |
01 | 1Table | 2 | (null) | (null) | (null) |
02 | WordDocument | 2 | 05 | (null) | (null) |
03 | SummaryInformation | 2 | 02 | 04 | (null) |
04 | DocumentSummaryInformation | 2 | (null) | (null) | (null) |
05 | Macros | 1 | 01 | 0d | 0c |
06 | VBA | 1 | (null) | (null) | 07 |
07 | ThisDocument | 2 | 08 | 09 | (null) |
08 | Module1 | 2 | 0a | (null) | (null) |
09 | _VBA_PROJECT | 2 | (null) | (null) | (null) |
0a | dir | 2 | (null) | (null) | (null) |
0b | PROJECTwm | 2 | (null) | (null) | (null) |
0c | PROJECT | 2 | 06 | 0b | (null) |
0d | CompObj | 2 | (null) | (null) | (null) |
You have already seen the Root Entry, with its solitary child reference: this is to entry number 03, which you now see is the SummaryInformation Stream (note that I have left the markers, as I used earlier, off the names of the streams that begin with odd hexadecimal characters - for purely aesthetic reasons). If you look at this entry, you can see that it is for a stream (type 2) and so has no children, but that is has two siblings, entries numbers 02 and 04. Entry number 02 (WordDocument) has a single (left) sibling of entry number 05, whilst entry number 04 (DocumentSummaryInformation) has none. Entry number 05 (Macros), in its turn, is a Storage (type 1) and has a child; it also has two siblings, entries numbers 01 and 0d, neither of which happen to have further siblings. Listing all the elements of this tree as a tree, gives this:
The tree, itself, you will recall, is of no particular importance; it simply contains all the storages and streams that are immediately below the root. The information it provides is better represented as a folder structure. Reading the tree from left to right, formatting the result, and adding a few icons gives:
(Root Storage container)
1Table
Macros
!CompObj
WordDocument
!SummaryInformation
!DocumentSummaryInformation
Look back, now, at the directory entry for the Macros storage. You will recall that, as well as the sibling references that you followed, it also had a root child reference. The arbitrary first child of Macros is entry number 0c, the PROJECT stream; this has two siblings, entries numbers 06 and 0b, the VBA storage and the PROJECTwm stream. Neither of these has further siblings, so that completes the list of elements immediately within the Macros structure. Adding these to the folder structure gives this:
(Root Storage container)
1Table
Macros
VBA
PROJECT
PROJECTwm
!CompObj
WordDocument
!SummaryInformation
!DocumentSummaryInformation
Repeating the procedure again for the VBA storage, which, I’m sure, you can do yourself, completes the navigation of this, small, directory, and provides a structure of this:
(Root Storage container)
1Table
Macros
VBA
dir
Module1
ThisDocument
_VBA_PROJECT
PROJECT
PROJECTwm
!CompObj
WordDocument
!SummaryInformation
!DocumentSummaryInformation
The final piece of information you need, to be able to find your way around the compound file that may contain a pre-2007 Word document, is that all streams are not equal. If a stream is less than 4096 bytes long (the actual trigger length is encoded in the file header but it is always 4096 in practice), it is considered to be a Short Stream and not worthy of its own place in the main physical structure. Short streams are held, you won’t be surprised to learn, in Short Sectors.
The Root Storage, that you saw referenced from the Root Entry directory entry above, is almost a little baby compound file all by itself. It is a stream in the main file, but it consists entirely of Short Sectors of 64 bytes each (as with other lengths, the actual length is encoded in the header, but it is always 64). The 64-byte Short Sectors have a Short Sector Allocation Table (SSAT) all to themselves, which Microsoft, in their own inimitable way, call the MiniFAT. This works in exactly the same way as the SAT for the normal sectors, that is it is an array with one entry for each short sector containing the number of the following short sector. Short sectors start at number zero, which is the one at the very beginning (that is, at offset zero) of the Root Storage stream. The SSAT is a normal stream in the file, the number of the first sector of which is held at offset 60 (0x3c) in the main file header; in the case of my sample document, this is sector 48 (0x30).
You may remember, from when you first saw the directory entries, that one of the data stored there was the number of the first sector where Stream data began. I said there was something else you needed to know. That something is that the sector number may be of a normal sector or a short sector, and the only way to tell is by the length of the stream, also held in the directory entry. If the stream is less than 4096 bytes long, the sector number is of a short sector, and, if 4096 bytes or longer, it is of a normal sector.
Having waded through a couple of screens worth of description of a relatively complex structure that, I hope, has been understandable, you may be itching to find out a little about about the data, the actual content of the file. I'm afraid I'm probably going to disappoint you, because I am not going to give you full details of how the actual content of your document is stored: it is just too complex, and it is fading into history. I do intend, however, to write more about some of the streams. Meanwhile, a brief synopsis of what they are will have to suffice.
Looking at the structure that you extracted from the directory above, which is not in any particular order, you can see streams called WordDocument and 1Table. These represent your document; there may also, or as an alternative to the 1Table stream, be a 0Table stream, and perhaps a Data stream. Although I may write a little more about these streams, I do not intend to provide great detail.
Next, the three streams that have names starting with unprintable characters, which I have represented with exclamation marks. These hold information about the document, in part what you see if you view the properties of the file in Windows Explorer. I will add some details of these shortly.
The remainder of your document is the structure in the Macros storage. I will provide (fairly) full detail of this, as it remains current, but working with it involves several different processes and is too much for one page. The links below will take you to pages with more detail and will be added to as I complete more pages
Extracting the VBA Storage structure
⇒ Extracting the VBA Storage structure [link to VBAStorage.php on this site]
⇒ Stream (de‑)compression [link to StreamCompression.php on this site]
VBA is what I do; I couldn't leave the page without offering some code! The following code will read a Compound Binary File and extract structural details from it. It, then, looks for and extracts the stream called “dir”. This stream was chosen for the demonstration simply because it was relevant to the forum posting that triggered the writing of this page. Decompression, and interpretation, of the “dir” stream will have to wait a week or two. I may provide some more detail later but, for the moment, the comments in the code are all I offer, along, of course, with the article you have just read, which should be sufficient for an understanding.
If you are returning, having been here before, you may notice that the code has changed. There is no functional difference but I have re-organised it as a precursor for using it as a basis for further code on further pages.
Option Explicit ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' These definitions use field names as documented by Microsoft; I make no comment other ' ' than the obvious: that they need explanation. Enumerations in VBA are, implicitly, ' ' Long, so the (Microsoft) one shown here is used for comparands but not for definition ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
Private Enum Directory_EntryType STGTY_Invalid STGTY_Storage STGTY_Stream STGTY_Lockbytes STGTY_Property STGTY_Root End Enum Private Type StructuredStorageHeader ' Microsoft definition for File Header abSig(0 To 7) As Byte ' 0xd0 0xcf 0x11 0xe0 0xa1 0xb1 0x1a 0xe1 clid(0 To 15) As Byte uMinorVersion As Integer uDllVersion As Integer uByteOrder As Integer ' 0xFFFE (0xFE 0xFF = Intel 'little-endian') uSectorShift As Integer ' Log(2) of Sector Size (usually 9, = 512-byte sectors) uMiniSectorShift As Integer ' Log(2) of Small Sector Size (usually 6, = 64-byte) usReserved As Integer ' Must be zero ulReserved1 As Long ' Must be zero ulReserved2 As Long ' Must be zero csectFAT As Long ' Number of sectors used by the SAT sectDirStart As Long ' First Sector of the Directory ulDFsignature As Long ' Must be zero ulMiniSectorCutoff As Long ' Usually 4096 sectMiniFatStart As Long ' First Sector of the SSAT csectMiniFat As Long ' Number of sectors used by the SSAT sectDifStart As Long ' First MSAT continuation sector csectDif As Long ' Number of MSAT continuation sectors sectFat(0 To 108) As Long ' First 109 entries of the MSAT End Type Private Type DirectoryEntry ' Microsoft definition for Directory Entry ab As String * 64 ' Entry Name cb As Integer ' Entry Name length (including trailing null) ' Microsoft explicitly say characters, but it is bytes mse As Byte ' Entry Type. ' Enum above. Can't declare Byte as Enum in VBA bflags As Byte ' 0 = Red, 1 = Black sidLeftSib As Long ' Id of Left Sibling Directory Entry sidRightSib As Long ' Id of Right Sibling Directory Entry sidChild As Long ' Id of Root Child Directory Entry clsid(0 To 15) As Byte dwUserFlags As Long time(0 To 1, 0 To 7) As Byte sectStart As Long ' First Sector (or Short Sector) of Stream ulSize As Long ' Length of Stream dptPropType As Long ' Zero End Type ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' VBA does not have very good facilities for working with free format data, so a mixed ' ' bag of definitions of 512 bytes, each to be used when appropriate, are defined. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Private Type NonSpecificSector SectorByte(0 To 511) As Byte End Type Private Type SectorOfDWords DWord(0 To 127) As Long End Type Private Type ShortSector SSByte(0 To 63) As Byte End Type Private Type SectorOfShortSectors ShortSector(0 To 7) As ShortSector End Type Private Type StreamOfShortSectors ShortSector() As ShortSector End Type Private Type DirectorySector DirectoryEntry(0 To 3) As DirectoryEntry End Type Private Type DirectoryEntryLocal EntryName As String EntryType As Directory_EntryType RedOrBlack As Byte LeftSibling As Long RightSibling As Long RootChild As Long FirstSector As Long StreamLength As Long End Type ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Some module level declarations make life a little easier. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Private FileNo As Long Private FileHeader As StructuredStorageHeader Private SAT() As Long Private SSAT() As Long Private Directory() As DirectoryEntryLocal Private SSStream As StreamOfShortSectors
Sub SimpleTestDriver() Dim FilePath As String Dim FileName As String Dim Stream() As Byte FilePath = "C:\Path\To\Your\" FileName = "Document.doc" ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' There are various ways to read files of binary data, none of which are ' ' entirely satisfactory. Here, the simple built-in mechanism is used. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' FileNo = FreeFile Open FilePath & FileName For Binary As FileNo ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Positioning the file at the beginning for clarity, read the File Header. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Seek FileNo, 1 Get FileNo, , FileHeader ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Gather the structural elements of the file, to enable what follows. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ExtractSAT ExtractSSAT ExtractDirectory ExtractShortSectorStream ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Everything is now in place to extract any Stream we want by following the ' ' appropriate pointers. For this example, the "dir" stream is extracted. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Stream = ExtractStream("dir") ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' For this example, the file is no longer needed, so close it. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Close FileNo ' Do whatever you want with the stream now End Sub
Function ExtractStream(StreamName As String) _ As Byte() Dim UseShortStream As Boolean Dim Stream() As Byte Dim Sectorndx As Long Dim Sector As NonSpecificSector Dim SectorNo As Long Dim SectorLen As Long Dim ndx As Long Dim ndx2 As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' This routine scans the directory for an entry for a requested stream, and, ' ' assuming it's found, then extracts the data for that stream. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' For ndx = LBound(Directory) To UBound(Directory) If Directory(ndx).EntryType = STGTY_Stream Then If Directory(ndx).EntryName = StreamName Then Exit For End If End If Next If ndx > UBound(Directory) Then Exit Function UseShortStream = (Directory(ndx).StreamLength < 4096) SectorLen = IIf(UseShortStream, Len(SSStream.ShortSector(0)), Len(Sector)) ReDim Stream(0 To SectorLen - 1) Sectorndx = 0 SectorNo = Directory(ndx).FirstSector Do While SectorNo > 0 If UseShortStream Then For ndx2 = 0 To SectorLen - 1 Stream(Sectorndx + ndx2) = _ SSStream.ShortSector(SectorNo).SSByte(ndx2) Next Else Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector For ndx2 = 0 To SectorLen - 1 Stream(Sectorndx + ndx2) = Sector.SectorByte(ndx2) Next End If Sectorndx = Sectorndx + SectorLen ReDim Preserve Stream(LBound(Stream) To UBound(Stream) + SectorLen) If UseShortStream Then SectorNo = SSAT(SectorNo) Else SectorNo = SAT(SectorNo) End If Loop ReDim Preserve Stream(LBound(Stream) To Directory(ndx).StreamLength - 1) ExtractStream = Stream End Function
Sub ExtractSAT() Dim MSAT() As Long Dim MSATndx As Long Dim SATndx As Long Dim Sector As SectorOfDWords Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' First get the Master Sector Allocation Table. Make the array big enough for ' ' the maximum possible entries because the code logic is easier that way. ' ' Read the first 109 entries from the header, then as many complete sectors ' ' as there are, before redefining the array to drop the excess entries. ' ' Note: The construct processed here is not explained in the article. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ReDim MSAT(0 To FileHeader.csectDif * 127 + 108) MSATndx = 0 For MSATndx = 0 To 108 MSAT(MSATndx) = FileHeader.sectFat(MSATndx) Next MSATndx SectorNo = FileHeader.sectDifStart Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * 512 + 1 Get FileNo, , Sector For ndx = 0 To 126 MSAT(MSATndx + ndx) = Sector.DWord(ndx) Next ndx MSATndx = MSATndx + 127 SectorNo = Sector.DWord(127) Loop ReDim Preserve MSAT(0 To FileHeader.csectFAT - 1) ' Right size ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Read the sectors in the MSAT to build up the full Sector Allocation Table. ' ' The final sector may be padded with values of -1, but they should never be ' ' referenced. Once the SAT has been built, the MSAT is no longer needed. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 128 ' could calculate ReDim SAT(0 To SectorEntries * (UBound(MSAT) - LBound(MSAT) + 1) - 1) SATndx = 0 For MSATndx = LBound(MSAT) To UBound(MSAT) Seek FileNo, (MSAT(MSATndx) + 1) * SectorLen + 1 Get FileNo, , Sector For ndx = LBound(Sector.DWord) To UBound(Sector.DWord) SAT(SATndx + ndx) = Sector.DWord(ndx) Next SATndx = SATndx + SectorEntries Next MSATndx ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The final sector won't be full, but will be padded, as on file. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' End Sub
Sub ExtractSSAT() Dim SSATndx As Long Dim Sector As SectorOfDWords Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The first sector of the Short Sector Allocation Table is held in the Header.' ' The rest are then chained through the SAT, built in the previous step. The ' ' end-of-chain is indicated by a value of -2 (0xFFFFFFFE) in the SAT. As with ' ' the SAT, the final sector may be padded with values of -1, but they should ' ' never be referenced. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 128 ' could calculate ReDim SSAT(0 To SectorEntries * FileHeader.csectMiniFat - 1) SSATndx = 0 SectorNo = FileHeader.sectMiniFatStart Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector For ndx = LBound(Sector.DWord) To UBound(Sector.DWord) SSAT(SSATndx + ndx) = Sector.DWord(ndx) Next SSATndx = SSATndx + SectorEntries SectorNo = SAT(SectorNo) Loop ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The final sector won't be full, but will be padded, as on file. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' End Sub
Sub ExtractDirectory() Dim Sector As DirectorySector Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long Dim var As Variant ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The first sector of the Short Sector Allocation Table is held in the Header.' ' The rest are then chained through the SAT, built in the previous step. The ' ' end-of-chain is indicated by a value of -2 (0xFFFFFFFE) in the SAT. As with ' ' the SAT, the final sector may be padded with values of -1, but they should ' ' never be referenced. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 4 ' could calculate Erase Directory SectorNo = FileHeader.sectDirStart Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector For ndx = LBound(Sector.DirectoryEntry) To UBound(Sector.DirectoryEntry) With Sector.DirectoryEntry(ndx) Select Case .mse: Case STGTY_Root, STGTY_Storage, STGTY_Stream If (Not Directory) = True Then ReDim Directory(0 To 0) Else ReDim Preserve Directory(LBound(Directory) To UBound(Directory) + 1) End If Directory(UBound(Directory)).EntryName _ = StrConv(Left(.ab, .cb - 2), vbFromUnicode) Directory(UBound(Directory)).EntryType = .mse Directory(UBound(Directory)).LeftSibling = .sidLeftSib Directory(UBound(Directory)).RightSibling = .sidRightSib Directory(UBound(Directory)).RedOrBlack = .bflags Directory(UBound(Directory)).RootChild = .sidChild Directory(UBound(Directory)).FirstSector = .sectStart Directory(UBound(Directory)).StreamLength = .ulSize End Select End With Next SectorNo = SAT(SectorNo) Loop End Sub
Sub ExtractShortSectorStream() Dim SSndx As Long Dim Sector As SectorOfShortSectors Dim SectorNo As Long Dim SectorLen As Long Dim SectorEntries As Long Dim ndx As Long ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' The first sector of the Short Sector Stream is pointed to from the Root ' ' Storage directory entry; the rest are chained through the SAT. It seems as ' ' if one extra, unused, sector is allocated at the end of the stream; this is ' ' simple observation, and I cannot confirm it to always be the case. ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' SectorLen = 512 SectorEntries = 8 ReDim SSStream.ShortSector(0 To SectorEntries - 1) SSndx = 0 SectorNo = Directory(0).FirstSector Do While SectorNo >= 0 Seek FileNo, (SectorNo + 1) * SectorLen + 1 Get FileNo, , Sector With SSStream For ndx = LBound(Sector.ShortSector) To UBound(Sector.ShortSector) .ShortSector(SSndx + ndx) = Sector.ShortSector(ndx) Next SSndx = SSndx + SectorEntries ReDim Preserve .ShortSector(LBound(.ShortSector) To UBound(.ShortSector) + 8) End With SectorNo = SAT(SectorNo) Loop End Sub