VBA Project Compression

Compression

VBA Project Compression

Introduction

As described in the first articlefirst article [link to StructuredStorage.php], prior to Word 2007, Word Documents (and Excel Workbooks, amongst others) were held in a containing structure that mimics a mini file system. Despite creating an entirely new XML‑based format for documents themselves, Microsoft still use the old format for storing VBA Projects within the new structure. I have written about the VBA structureVBA structure [link to VBAStorage.php], and now turn to the compression used, supposedly to save space.

It is not compulsory for you to follow along, but I will describe what I do in a way that you can if you wish. I will say now, and may say again, that VBA is not really the language of choice for this but I like to use VBA to demonstate because it is available to all users of Word: you don’t need special software to copy what I do.

A VBA Project

As an arbitrary starting point, create a new document, and insert a new module, “Module1”; copy the VBA code from the first page of this seriesVBA code from the first page of this series [link to StructuredStorage.php#TheCode] into the new module and save the document, in Word 2007‑format as a macro‑enabled document, calling it, say, “Arbitrary.docm”. I was writing an article on the 2007‑format files before I allowed myself to be sidetracked into writing this one, and do not propose to provide any specific details here, other than those necessary for the matter at hand. If you rename the document file as “Arbitrary.docm.zip”, open the resultant zip folder, and navigate to the “word” directory inside it, the VBA project will be, by default, in the file “vbaProject.bin”; if you extract this file you will have something to work with. When you have extracted the file, you can close the folder and rename the file back to “Arbitrary.docm”. You don’t actually have to do this yourself: I have done it for you and you can download a zip folder containing the two files, by clicking here: Stylised text masquerading as a button [link to the file on this site at files/ArbitrarySample.zip]

I should, perhaps, say that the above description and download are exactly the same as presented on the page dedicated to the structure of the VBA Projectpage dedicated to the structure of the VBA Project [link to VBAStorage.php]. Whilst the two processes could, clearly, be combined, there is no dependency and I have chosen not to combine them for the purposes of the demonstration I offer here.

The sample code that I posted is just some scaffolding on which to build; it shows how to navigate the physical file, but it doesn’t really offer much help when you want to work with the logical contents of the file that are held inside the physical wrapper. The navigation presented ends with the extraction of what is called the dir Stream. The dir stream contains essential information about the Project, and must be read, and interpreted, before any other stream. On this page, I will explain how to decompress the stream, and on the next (NOT WRITTEN YET!), how to understand the decompressed contents.

A Brief Note on Numbers, for the non‑Technical

This page is technical and most of what marketers would call its target audience are expected to be able to understand it; if you are a programmer, you can almost certainly skip this section. I do, however, try to make my writings accessible to all, and I feel an explanation of the numbers used here is in order.

Numbers (in modern Western systems) are written, to base 10, using a positional notation; each individual digit has a particular meaning that depends on its position in the context of surrounding digits. As I’m quite sure you know, the “9” in “95” means 90, that is 9 × 10 (9 times the base), whilst the “9” in “3917” means 900: 9 × 10 × 10 (9 times the base times the base). This, however, is largely meaningless to computers, which work with binary numbers, numbers to the base 2.

Binary numbers are, similarly, written using a positional notation but, in the binary system, there are only two digits: zero and one. The “1” in the binary number “10” means 2, that is 1 × 2 (1 times the base), and the “1” in “1000” means 8: 1 × 2 × 2 × 2 (1 times the base times the base times the base). For clarity, whenever binary numbers are used here they are written with a prefix of “0b”, so the decimal number 25 would be written as 0b11001.

Binary numbers can get quite long, and easy to misread. For example the binary equivalent of 4000 would be 0b111110100000. To make life easier for people (computers have no difficulties), number systems based on higher numbers are better. A system based on 8, called octal is sometimes seen, but a system based on 16, called hexadecimal, is more normally used.

In hexadecimal, there are 16 different digits, and the letters A through F are used for the digits 10 through 15. Hexadecimal is, like the other systems, written positionally, and the “B” in “B0” means 176, that is B × 16 (B, or 11, times the base), whilst the “B” in “1B0F” means 2816, that is B × 16 × 16 (B times the base times the base). Hexadecimal (or, more usually, just hex) numbers used here are written with a prefix of “0x”, so the decimal number 4000, 0b111110100000 in binary, you’ll remember, would be written as 0xFA0. When presenting data in ‘dump’ format, however, the numbers are always hexadecimal and are just presented as numbers; to give each of them an “0x” prefix would make them unreadable.

The “dir” Stream

Before you can run the code in the document, assuming you downloaded, or otherwise created, it, you need to make one small changes to make it read the right file. If you have the document and the file in the same folder, as the download presents it, then you don't need to know, or hard code, exactly where it is. Just changing these two lines:

    FilePath = "C:\Path\To\Your\"
    FileName = "Document.doc"

.. to this:

    FilePath = MacroContainer.Path & Application.PathSeparator
    FileName = "vbaProject.bin"

.. will suffice. If, having done that, you run the code and put a breakpoint on the Close FileNo statement, you will be able to look at the contents of the stream after it has been extracted, but it isn’t easy looking at a byte array like that, so here is a hex version of the complete stream.

000000   01 31 B2 80  01 00 04 00   00 00 03 00  30 2A 02 02    .1.........0*..
000010   90 09 00 70  14 06 48 03   00 82 02 00  64 E4 04 04    ...p..H.....d..
000020   00 07 00 1C  00 50 72 6F   6A 65 63 74  05 51 00 28    .....Project.Q.(
000030   00 00 40 02  14 06 02 14   3D AD 02 0A  07 02 6C 01    ..@.....=....l.
000040   14 08 06 12  09 02 12 80   35 CC 87 51  10 00 0C 02    ........5.Q....
000050   4A 12 3C 02  0A 16 00 01   72 73 74 64  10 6F 6C 65    J.<.....rstd.ole
000060   3E 02 19 73  00 74 00 00   64 00 6F 00  6C 00 65 50    >..s.t..d.o.l.eP
000070   00 0D 00 68  00 25 5E 00   03 2A 00 5C  47 7B 30 30    ...h.%^..*.\G{00
000080   30 32 30 B0  34 33 30 2D   00 08 04 04  43 00 0A 03    020430-....C...
000090   02 0E 01 12  30 30 34 36   7D 23 00 32  2E 30 23 30    ....0046}#.2.0#0
0000A0   23 43 3A 00  5C 57 69 6E   64 6F 77 73  00 5C 53 79    #C:.\Windows.\Sy
0000B0   73 74 65 6D  33 04 32 5C   03 65 32 2E  74 6C 62 00    stem3.2\.e2.tlb.
0000C0   23 4F 4C 45  20 41 75 74   80 6F 6D 61  74 69 6F 6E    #OLE Aut.omation
0000D0   00 60 03 00  02 83 45 4E   6F 72 6D 61  6C 05 83 45    .`....ENormal..E
0000E0   4E 80 43 72  00 6D 00 61   51 80 46 0E  00 20 80 11    N.Cr.m.aQ.F.. ..
0000F0   09 80 01 2A  0C 5C 43 03   12 0A 06 B8  7D 3F 51 04    ...*.\C....}?Q.

000100   04 00 83 21  4F 66 66 69   63 11 84 67  4F 00 66 80    ...!Offic..gO.f.
000110   00 69 00 63  15 82 67 9E   80 1F 94 82  21 47 7B 32    .i.c..g.....!G{2
000120   00 44 46 38  44 30 34 43   2D 00 35 42  46 41 2D 31    .DF8D04C-.5BFA-1
000130   30 31 40 42  2D 42 44 45   35 80 67 41  6A 41 80 65    01@B-BDE5.gAjA.e
000140   34 80 05 32  88 67 80 BA   67 00 72 61  6D 20 46 69    4..2.g.g.ram Fi
000150   6C 65 00 73  5C 43 6F 6D   6D 6F 6E 01  04 06 4D 69    le.s\Common...Mi
000160   63 72 6F 73  6F 00 66 74   20 53 68 61  72 65 00 64    croso.ft Share.d
000170   5C 4F 46 46  49 43 45 00   31 35 5C 4D  53 4F 2E 44    \OFFICE.15\MSO.D

000180   18 4C 4C 23  87 10 83 4D   20 31 35 20  2E 30 20 4F    .LL#...M 15 .0 O
000190   62 81 E3 20  4C C0 69 62   72 61 72 79  80 25 80 00    b. Library.%..
0001A0   22 0F 82 7A  02 00 13 C2   01 15 80 02  19 42 65 54    "..z........BeT
0001B0   68 69 73 44  6F 00 63 75   6D 65 6E 74  47 00 0A 18    hisDo.cumentG...
0001C0   C0 09 54 C0  66 69 00 73   00 22 44 C0  48 63 00 75    .Tfi.s."DHc.u
0001D0   40 49 65 00  AA 6E C0 6E   1A CE 0B 32  DA 0B 1C C0    @Ie.nn..2..
0001E0   12 A8 00 00  48 42 01 31   42 89 0D 40  A1 16 1E 42    ...HB.1B..@..B
0001F0   02 01 05 2C  C2 21 11 1D   22 15 42 08  2B 42 01 19    ...,!..".B.+B..

000200   82 A1 4D 6F  64 80 75 6C   65 31 47 00  0E 00 05 1A    .Mod.ule1G.....
000210   4D 80 21 64  80 21 81 8D   31 00 1A 0D  09 08 32 10    M.!d.!..1.....2.
000220   08 4F 1D E1  57 00 00 B1   4D 1D A8 D2  21 C2 1B 43    .O.W..M.!.C
000230   1D 10 C2 02  00                                        ....

You can see bits of recognisable text there but, overall, it isn't readable; this is because it is compressed. From what I have seen, I doubt whether the compression saves as much space as is wasted by the unnecessary inclusion of extra data, but what I think is of little consequence; the data is compressed and must be decompressed before going any further.

Decompression

A fairly consistent feature of the documentation that Microsoft has released is that it makes everything appear more complicated than it really is; so it is with the compression that is used here. What Microsoft call the Compressed Container contains a signature byte (0x01) followed by a series of “Chunks”. The term, Chunk, is one I despise (should anybody who worked with me in the late nineties happen to be reading, you know what I mean!) and, although I know it can be difficult to come up with distinct terminology, I do think that the use of such terms smacks of desperation.

Each chunk, with the exception of the last one in a Container, which will usually be smaller, contains 4096 bytes of data, and will be compressed if space can be saved by so doing. Each chunk begins with a 16‑bit ‘header’, the low order 12 bits of which have a value three less than the length of the chunk; the reason for this is that the maximum possible size of a chunk (the 2‑byte header followed by 4096 bytes of data) is 4098, three gretaer than the maximum value that can be held in 12 bits, which is 4095. The high order bit of the header is a flag indicating whether or not the chunk is compressed, and the other three bits have a fixed value of 0b011.

The compressed data consists of a series of what are called Token Sequences. Each Token Sequence consists of a byte, to be viewed as eight separate flag bits, followed by eight Tokens, the type of each token being indicated by the corresponding flag bit. If a flag bit is 0, the corresponding token is a single byte, a Literal Token, to be taken ‘as is’; if the flag bit is 1, the token is a two byte code, a Copy Token, which, after being unscrambled, gives the position and length of a sequence earlier in the decompressed chunk, which must be copied.

The Copy Token is made up of two parts, an offset code and a length code. The offset code is one less than the number of bytes to the left of the current position in the decompressed chunk from where to start copying, and the length code is three less than the number of bytes to copy. There is no good reason for these small increments but they are used and you need to know about them. The number of bits (of the 16 in the copy token) used for the offset (and, thus, those remaining that are used for the length) is calculated as being the smallest integer that is equal to or greater than the logarithm to base 2 of the length, so far, of the decompressed chunk, subject to it never being less than four or greater than 12.

If you are one of those unfortunate people who shiver at the mere mention of numbers, do not fear. None of this is difficult and I am here to explain it, and to give you some code so that you don’t even need to understand it. The logarithm (or log) of a number is just a way of saying how many copies of another number (the base of the log) have to be multiplied together to get the (first) number. For example, the logarithm (to base 2) of 8 is 3, because three copies of 2 are needed to make 8: 8 = 2 × 2 × 2, and the log (to base 2) of 16 is 4, because four copies of 2 are needed to make 16: 16 = 2 × 2 × 2 × 2. So what, you may ask, about 12, or 14? Well, the answer is three and a bit; those numbers are more than 8, and less than 16, so the logs of those numbers are more than 3 and less than 4, and that is all you need to know for the purposes of this decompression.

Logs are most usually presented as logs to base 10, and the unqualified term, “Log”, usually means log to base 10, but mathematicians are more likely to use natural logarithms, logs to the base e. e is a special number in mathematics, equal, if you are interested, to approximately 2.71828, and formally defined as the limit of (1 + 1/n)n as n tends to infinity, which means that the bigger the value of n, the closer to e, the result is. The term, “Ln”, is generally used for natural logarithms. Logs to base 2 are sometimes useful in computing, where everything is in the binary system, and based on 2, but there is no special term for such logs.

VBA is not the best language for working with logs; it does have what it calls a Log function, but it should really be named Ln, as it returns natural logarithms, not logs to base 10. There is no function to return a log to base 2, but there is a way to convert logs from one base to another: you do this by dividing the log of your number to the first base by the log of the second base to the first base. If you want to use VBA to get the log to base 2 of 7, you do this:

LogBaseTwoOfSeven = Log(7) / Log(2)

I must just say, before moving on, that I have been plagued by “Expression too complex” errors when I use this code in anger, and, indeed, with other code involving mathematical calculations, and have used a different mechanism, one that VBA does seem able to cope with, for the code you will see later on this page.

An Example

Working, manually, through the beginning of the dir stream here, will, I hope, make everything clear. As seen in the picture above, the stream begins with 0x01: the expected signature byte. This is followed, immediately by the first compressed chunk, the first two bytes of which, the ‘header’, are 0x31 and 0xB2. I’m sure I explained the ‘little‑endian’ format on the previous page, so you should know that these two bytes represent the 16‑bit value 0xB231, the low order 12 bits of which are 0x231. In this case, with a single chunk, it is relatively easy to verify this from the view above. The stream length is 0x235, one more than the chunk length of 0x234 (0x231 plus 3). The high order four bits of the header are 0xB, or 0b1011: the high order “1” signifies that this chunk is compressed and the low order 0b011 is the fixed value it should be.

After the chunk header comes the first Token Sequence. The first byte of the token sequence is 0x80, which, in binary, is 0b10000000. When reading bits from a byte it is usual to work from low order to high order, that is, right to left, so the eight flags represented by this byte are 0, 0, 0, 0, 0, 0, 0, and 1. This means that the first seven tokens of this sequence are literal tokens, single bytes to be copied (the actual bytes, here, are 0x01, 0x00, 0x04, 0x00, 0x00, 0x00, and 0x03), and the eighth token is a copy token, two bytes (0x00 and 0x30) to be interpreted. The interpretation depends on the length of the decompressed chunk so far, and, so far, it is:

000000   01 00 04 00  00 00 03

Just seven bytes. The logarithm to base 2 of 7 is a bit less than 3. You saw, above, how the log, to base 2, of 8 is exactly 3, and as 7 is a little less than 8, so the log of 7 is a little less than the log of 8. If you remember from my description, all you want is the smallest possible integer (whole number) that is at least as large as the log; given a log of 2 and a bit, that whole number is, I hope you can see, 3. Again, if you remember, the number you want is subject to a constraint of not being less than 4, so the actual number of bits of the Copy Token used for the offset code in this case, is 4.

The two bytes of the Copy Token were 0x00 and 0x30, a 16‑bit value of 0x3000. You now know that 4 bits of this are used for the offset, and the remaining 12 for the length code. Despite what I said earlier about usually reading bits from right to left, it is the leftmost (high order) bits of the copy token that are taken as the offset code, and the rightmost, or low order, ones that make up the length code. The first 4 bits are 0b0011 (a value of 3, representing an offset, one greater, of 4 bytes), and the remaining 12 bits are 0b000000000000 (a value of 0, representing a length 3 bytes greater, of 3). The token tells you to go back four bytes and copy three bytes from there. Doing this gives a decompressed chunk that now looks like this, with the copied bytes highlighted:

000000   01 00 04 00  00 00 03 00   00 00

Phew! I have tried to explain everything in detail; I hope I have succeeded. From here on it should be plain sailing. Going back to the compressed chunk, the next token sequence begins with a flag byte of 0x2A, equal, in binary, to 0b00101010. First a simple literal token: copy the next byte (0x02) and the decompressed stream becomes:

000000   01 00 04 00  00 00 03 00   00 00 02

Next a copy token, 0x9002. The decompressed chunk is now 11 bytes long but the number of bits for the offset code is still subject to the minimum of 4. The offset code is 9, so the offset is 10 bytes, and the length code is 2, giving a number of bytes to copy of 5. With the copied bytes highlighted as before, here is the result:

000000   01 00 04 00  00 00 03 00   00 00 02 00  04 00 00 00

Another literal token (0x09), is followed by another copy token of 0x7000. The length of the decompressed chunk is now 17 bytes, so the number of bits dedicated to the offset code is 5, because log base 2 of 17 is 4 and a bit. The first five bits of the copy token are 0b01110, a decimal value of 14, indicating an offset of 15; the remaining bits are all zero, indicating a length of 3. Adding the literal, and then copying the appropriate bytes, extends the result, so far, to:

000000   01 00 04 00  00 00 03 00   00 00 02 00  04 00 00 00
000010   09 04 00 00

The next literal token is 0x14, and the copy token that follows is of 0x4806. The length of the decompressed chunk is now 21 bytes, but the number of bits dedicated to the offset code is still 5; the first five bits of the copy token are 0b01001, a decimal value of 9, indicating an offset of 10, and the remaining bits (0b00000000110) have a value of 6, indicating a length of 9. Adding this literal, and then copying the nine bytes, gives:

000000   01 00 04 00  00 00 03 00   00 00 02 00  04 00 00 00
000010   09 04 00 00  14 00 04 00   00 00 09 04  00 00

To finish this token sequence there are two literal tokens, 0x03 and 0x00, to be added to the decompressed chunk. If you really felt inspired, you could continue like this all the way to the end, but I rather suspect you are more interested in the end result than the laborious process, and, so, I have done it for you, and this is that end result:

000000   01 00 04 00  00 00 03 00   00 00 02 00  04 00 00 00   ................
000010   09 04 00 00  14 00 04 00   00 00 09 04  00 00 03 00   ................
000020   02 00 00 00  E4 04 04 00   07 00 00 00  50 72 6F 6A   ...........Proj
000030   65 63 74 05  00 00 00 00   00 40 00 00  00 00 00 06   ect......@......
000040   00 00 00 00  00 3D 00 00   00 00 00 07  00 04 00 00   .....=..........
000050   00 00 00 00  00 08 00 04   00 00 00 00  00 00 00 09   ................
000060   00 04 00 00  00 35 CC 87   51 10 00 0C  00 00 00 00   .....5.Q.......
000070   00 3C 00 00  00 00 00 16   00 06 00 00  00 73 74 64   .<...........std
000080   6F 6C 65 3E  00 0C 00 00   00 73 00 74  00 64 00 6F   ole>.....s.t.d.o
000090   00 6C 00 65  00 0D 00 68   00 00 00 5E  00 00 00 2A   .l.e...h...^...*
0000A0   5C 47 7B 30  30 30 32 30   34 33 30 2D  30 30 30 30   \G{00020430-0000
0000B0   2D 30 30 30  30 2D 43 30   30 30 2D 30  30 30 30 30   -0000-C000-00000
0000C0   30 30 30 30  30 34 36 7D   23 32 2E 30  23 30 23 43   0000046}#2.0#0#C
0000D0   3A 5C 57 69  6E 64 6F 77   73 5C 53 79  73 74 65 6D   :\Windows\System
0000E0   33 32 5C 73  74 64 6F 6C   65 32 2E 74  6C 62 23 4F   32\stdole2.tlb#O
0000F0   4C 45 20 41  75 74 6F 6D   61 74 69 6F  6E 00 00 00   LE Automation...

000100   00 00 00 16  00 06 00 00   00 4E 6F 72  6D 61 6C 3E   .........Normal>
000110   00 0C 00 00  00 4E 00 6F   00 72 00 6D  00 61 00 6C   .....N.o.r.m.a.l
000120   00 0E 00 20  00 00 00 09   00 00 00 2A  5C 43 4E 6F   ... .......*\CNo
000130   72 6D 61 6C  09 00 00 00   2A 5C 43 4E  6F 72 6D 61   rmal....*\CNorma
000140   6C B8 7D 3F  51 04 00 16   00 06 00 00  00 4F 66 66   l}?Q........Off
000150   69 63 65 3E  00 0C 00 00   00 4F 00 66  00 66 00 69   ice>.....O.f.f.i
000160   00 63 00 65  00 0D 00 9E   00 00 00 94  00 00 00 2A   .c.e...........*
000170   5C 47 7B 32  44 46 38 44   30 34 43 2D  35 42 46 41   \G{2DF8D04C-5BFA

000180   2D 31 30 31  42 2D 42 44   45 35 2D 30  30 41 41 30   -101B-BDE5-00AA0
000190   30 34 34 44  45 35 32 7D   23 32 2E 30  23 30 23 43   044DE52}#2.0#0#C
0001A0   3A 5C 50 72  6F 67 72 61   6D 20 46 69  6C 65 73 5C   :\Program Files\
0001B0   43 6F 6D 6D  6F 6E 20 46   69 6C 65 73  5C 4D 69 63   Common Files\Mic
0001C0   72 6F 73 6F  66 74 20 53   68 61 72 65  64 5C 4F 46   rosoft Shared\OF
0001D0   46 49 43 45  31 35 5C 4D   53 4F 2E 44  4C 4C 23 4D   FICE15\MSO.DLL#M
0001E0   69 63 72 6F  73 6F 66 74   20 4F 66 66  69 63 65 20   icrosoft Office 
0001F0   31 35 2E 30  20 4F 62 6A   65 63 74 20  4C 69 62 72   15.0 Object Libr

000200   61 72 79 00  00 00 00 00   00 0F 00 02  00 00 00 02   ary.............
000210   00 13 00 02  00 00 00 15   80 19 00 0C  00 00 00 54   ...............T
000220   68 69 73 44  6F 63 75 6D   65 6E 74 47  00 18 00 00   hisDocumentG....
000230   00 54 00 68  00 69 00 73   00 44 00 6F  00 63 00 75   .T.h.i.s.D.o.c.u
000240   00 6D 00 65  00 6E 00 74   00 1A 00 0C  00 00 00 54   .m.e.n.t.......T
000250   68 69 73 44  6F 63 75 6D   65 6E 74 32  00 18 00 00   hisDocument2....
000260   00 54 00 68  00 69 00 73   00 44 00 6F  00 63 00 75   .T.h.i.s.D.o.c.u
000270   00 6D 00 65  00 6E 00 74   00 1C 00 00  00 00 00 48   .m.e.n.t.......H

000280   00 00 00 00  00 31 00 04   00 00 00 0D  03 00 00 1E   .....1..........
000290   00 04 00 00  00 00 00 00   00 2C 00 02  00 00 00 11   .........,......
0002A0   1D 22 00 00  00 00 00 2B   00 00 00 00  00 19 00 07   .".....+........
0002B0   00 00 00 4D  6F 64 75 6C   65 31 47 00  0E 00 00 00   ...Module1G.....
0002C0   4D 00 6F 00  64 00 75 00   6C 00 65 00  31 00 1A 00   M.o.d.u.l.e.1...
0002D0   07 00 00 00  4D 6F 64 75   6C 65 31 32  00 0E 00 00   ....Module12....
0002E0   00 4D 00 6F  00 64 00 75   00 6C 00 65  00 31 00 1C   .M.o.d.u.l.e.1..
0002F0   00 00 00 00  00 48 00 00   00 00 00 31  00 04 00 00   .....H.....1....

000300   00 E1 57 00  00 1E 00 04   00 00 00 00  00 00 00 2C   .W............,
000310   00 02 00 00  00 A8 D2 21   00 00 00 00  00 2B 00 00   .....!.....+..
000320   00 00 00 10  00 00 00 00   00                         .........

You probably still can’t make much sense of this, but it is easier to read than the compressed version. I will explain all the contents in due course: just be patient! You have now seen an explanation, and an example. As I’m quite sure you realise, mad as I may be, I did not decompress that whole stream by hand. VBA may not be the best language for the job, but it can do it, and you have it at your fingertips, so now it’s time to find out how to use VBA for this task.

Some VBA Code

Here is a routine based on the notes you have just read. There are some comments in it, but they, largely, just repeat what you already know. Place it somewhere in the module - at the end is as good as anywhere.

Sub DecompressContainer(ByRef CompressedContainer() As Byte, _
                        ByRef Compndx As Long, _
                        ByRef DeCompressedData() As Byte)

    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    ' This routine receives a Stream as a byte array, and an index into it, which '
    ' points to the start of a Compressed Container. There is nothing to indicate '
    ' where the container ends, so the only possible assumption, that it runs all '
    ' the way to the end of the Stream, is taken. The routine must also be passed '
    ' an empty byte array, which it will resize and fill with decompressed data.  '
    ' It is done this way to avoid the necessity of copying afterwards.           '
    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    
    Dim Decompndx               As Long
    Dim DecompLen               As Long
    
    Dim ChunkHeader             As Long
    Dim ChunkSignature          As Long
    Dim ChunkFlag               As Long
    Dim ChunkSize               As Long
    Dim ChunkEnd                As Long

    Dim BitFlags                As Byte
    
    Dim Token                   As Long
    Dim BitCount                As Long
    Dim BitMask                 As Long
    Dim CopyLength              As Long
    Dim CopyOffset              As Long
    
    Dim ndx                     As Long
    Dim ndx2                    As Long
    
    Dim PowerOf2(0 To 16)       As Long
    
    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    ' A d****d irritating bit of initialisation. I have been having no end of     '
    ' trouble with "Expression too complex" errors, always seeming to be when I   '
    ' use exponentiation. To avoid them I pre-calculate the values and index into '
    ' the resulting array. Perchance this is actually an unintended optimisation, '
    ' although it would be better done, once, rather than every time, here.       '
    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    
    PowerOf2(0) = 1
    For ndx = 1 To UBound(PowerOf2)
        PowerOf2(ndx) = PowerOf2(ndx - 1) * 2
    Next
    
    Do  ' Once per chunk

        If (Not DeCompressedData) = True Then
            ReDim DeCompressedData(0 To 4095)
            Decompndx = 0
        Else
            ReDim Preserve DeCompressedData(UBound(DeCompressedData) + 4096)
        End If

        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
        ' The 16-bit chunk header contains the length of the chunk, and four flag     '
        ' bits. The high order bit is a flag (0 = uncompressed, 1 = compressed), and  '
        ' the next three bits must be 0b011.                                          '
        '                                                                             '
        ' VBA really isn't the language for bit twiddling and I am not going to fully '
        ' explain the code; you'll have to trust me when I say that these statements  '
        ' grab the desired bits and right align them!                                 '
        '                                                                             '
        ' If the Chunk Signature does not have a value of 3 (0b011), the chunk is     '
        ' invalid; the possibility of this is not considered in this routine.         '
        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
        
        ChunkHeader = CompressedContainer(Compndx) + _
                      256& * CompressedContainer(Compndx + 1)
        Compndx = Compndx + 2
    
        ChunkSize = (ChunkHeader And &HFFF)
        ChunkEnd = Compndx + ChunkSize
        ChunkSignature = (ChunkHeader And &H7000) \ &H1000&
        ChunkFlag = (ChunkHeader And &H8000) \ &H8000&

        If ChunkFlag = 0 Then
        
            ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
            ' This just copies 4096 bytes from input to output. I would normally use the  '
            ' RtlMoveMemory API, but prefer not to use it in demonstration code, so this  '
            ' is a simple loop that copies a byte at a time. I have never seen a chunk    '
            ' that is not compressed, so am not unduly concerned about the inefficiency.  '
            ' I am - a little - concerned about what might happen when there are less     '
            ' than 4096 bytes but the compression routine decides not to compress; the    '
            ' documentation is silent on the issue so, maybe, it can't happen.            '
            ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

            For ndx2 = 0 To 4095
                DeCompressedData(Decompndx + ndx2) = CompressedContainer(Compndx + ndx2)
            Next ndx2
            Compndx = Compndx + 4096
            Decompndx = Decompndx + 4096
    
        Else
        
            Do
            
                ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                ' The data in a chunk is a series of what are called Token Sequences.     '
                ' Each Token Sequence consists of a byte to be viewed as eight separate   '
                ' flag bits, followed by eight elements, the type of each being indicated '
                ' by the individual flag bits. If a flag bit is 0, the corresponding      '
                ' element is a single byte to be taken 'as is'; if the flag bit is 1, the '
                ' element is a two byte code, which, after being unscrambled, gives the   '
                ' position and length of a sequence earlier in the (decompressed) stream, '
                ' which must be copied.                                                   '
                ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
            
                BitFlags = CompressedContainer(Compndx)
                Compndx = Compndx + 1
        
                For ndx = 0 To 7
            
                    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                    ' The final token sequence is not padded, and the chunk could end at  '
                    ' any point. Loop control, therefore, is here, rather than at the end.'
                    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    
                    If Compndx > ChunkEnd Then Exit Do
                
                    If (BitFlags And PowerOf2(ndx)) = 0 Then

                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                        ' A Literal Token: just copy the single-byte literal.             '
                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    
                        DeCompressedData(Decompndx) = CompressedContainer(Compndx)
                        Compndx = Compndx + 1
                        Decompndx = Decompndx + 1

                    Else
                    
                        Token = CompressedContainer(Compndx) + _
                                CompressedContainer(Compndx + 1) * 256&
                        Compndx = Compndx + 2
                        
                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                        ' A 16-bit Token consists of an offset (to the left, from the current '
                        ' position in the decompressed data), and a length (the number of     '
                        ' bytes to copy). The number of bits used for the offset (and, thus,  '
                        ' those used for the length) is the smallest integer that is greater  '
                        ' than the logarithm to base 2 of the length, so far, of the current  '
                        ' decompressed chunk subject to it never being less than 4 or greater '
                        ' than 12. Rather than use logs, this little loop has the constraints '
                        ' built in and stops at the appropriate point. As each chunk (bar the '
                        ' last) is exactly 4096 bytes long, the length so far of the current  '
                        ' decompressed chunk is as shown.                                     '
                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                    
                        DecompLen = Decompndx Mod 4096
        
                        For BitCount = 4 To 11
                            If DecompLen <= PowerOf2(BitCount) Then Exit For
                        Next BitCount

                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                        ' Having determined the number of bits dedicated to each component of '
                        ' the token, some bit twiddling is needed to extract the numbers. The '
                        ' offset first, then the length. No further explanation; work it out! '
                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

                        BitMask = PowerOf2(16) - PowerOf2(16 - BitCount)
                        CopyOffset = (Token And BitMask) \ PowerOf2(16 - BitCount) + 1

                        BitMask = PowerOf2(16 - BitCount) - 1
                        CopyLength = (Token And BitMask) + 3
                        
                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
                        ' Given the offset and the length, the copy can be done.              '
                        ' Note that the source and target may overlap.                        '
                        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    
                        For ndx2 = 0 To CopyLength - 1
                            DeCompressedData(Decompndx + ndx2) _
                                    = DeCompressedData(Decompndx - CopyOffset + ndx2)
                        Next ndx2
                        Decompndx = Decompndx + CopyLength
                    
                    End If  ' Literal Token or Copy Token
                    
                Next    ' Token
            
            Loop    ' For next Token Sequence
        
        End If  ' Was chunk compressed?
        
        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
        ' If not yet at the end of the Stream, the assumption is that there is        '
        ' another chunk: there is no possible information to the contrary.            '
        ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '

        If Compndx > UBound(CompressedContainer) Then Exit Do
    
    Loop
    
    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    ' Only after having finished decompressing the final chunk is the final size  '
    ' known. Now the output array can be correctly sized.                         '
    ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' '
    
    ReDim Preserve DeCompressedData(0 To Decompndx - 1)

End Sub

To run the code, there are a couple of minor changes required to the driver module. Firstly, two new variables must be declared, one an index into the stream, the other a container for the decompressed data. Although the declarations can go anywhere, I prefer to follow convention and place them at the start of their procedure, so change:

    Dim Stream()            As Byte

.. to this:

    Dim Stream()            As Byte
    Dim DeCompressedData()  As Byte

    Dim Compndx             As Long

Declarations in place, you just need two more lines to actually run the new routine. After the Stream = ExtractStream("dir") line, add a line to set the index variable to 1 (the position after the Signature 0x01 byte), and a call to the new routine:

    Compndx = 1
    Call DecompressContainer(Stream, Compndx, DeCompressedData) 

If you do this, you won’t see anything dramatic but you will have decompressed the stream and will need to read my next page to understand it. The next page (when it has been written!) will build on this code. To make things as easy as possible for you, I have taken the “Arbitrary.docm” document as available for downloading at the start of this article, added the extra code detailed here, and saved it as a file called “Decompress.docm”. I have zipped this up with the same “vbaProject.bin” file as before, and you can download it from here: Stylised text masquerading as a button [link to the file on this site at files/DecompressSample.zip]