Microsoft Office File Format Internals: A given MS Office document is organized internally using OLE Structure Storage. OLE Structured Storage is defined as a systematic organization of components of any MS Office document. Each document has a root component which contains storage and stream components. The OLE Structured Storage is synonymous with the file system structure, such that 'storage' components are equivalent to directories and 'stream' components are equivalent to files. A storage component may exist as a standalone component. Each storage component may have one or more sub-storage components and stream components. Also the root component may have stream components directly within it.
The actual implementation details are defined in The Windows Compound Binary File Format specification.
Most of my research on MS Office File Format was conducted using the Ruby OLE library which allows easy and abstract read-write on the various streams and storages packed in the internal OLE structures. Install the Ruby-OLE gem before trying out any of the examples below.
Examples:
Dumping the OLE structure of a given word document:
user@sigsegv$ oletool --tree sample2.doc
- #<Dirent:"Root Entry">
|- #<Dirent:"1Table" size=34907 data="^\004\032\000\022...">
|- #<Dirent:"\001CompObj" size=121 data="\001\000\376\377\003...">
|- #<Dirent:"MsoDataStore">
| \- #<Dirent:"F\303\223\303\216\303\226U\303\2261\303\2305U4\303\217\303\2201BEKP\303\235N\303\203\303\200==">
| |- #<Dirent:"Item" size=216 data="<b:So...">
| \- #<Dirent:"Properties" size=341 data="<?xml...">
|- #<Dirent:"WordDocument" size=15429 data="\354\245\301\000}...">
|- #<Dirent:"\005SummaryInformation" size=4096 data="\376\377\000\000\005...">
\- #<Dirent:"\005DocumentSummaryInformation" size=4096 data="\376\377\000\000\005...">
user@sigsegv$
Sample code to display the size of the WordDocument stream inside a doc file:
#!/usr/bin/ruby
require 'rubygems'
require 'ole/storage'
ole = Ole::Storage.new("sample2.doc")
buf = ole.file.read("/WordDocument")
ole.close
puts "WordDocument stream size: #{buf.size}"
Sample code to display only the text part of a doc file:
Reverse Engineering a Microsoft Office Patch: The patches against Microsoft Office Suite as distributed by Microsoft usually consists of self extractable MSP or MSI packages extracting which is not exactly same as that of other patches.
require 'rubygems'
require 'ole/storage'
require 'lib/fib'
if __FILE__ == $0
if ARGV.size != 1
exit
end
ole = Ole::Storage.new(ARGV[0])
docbuf = ole.file.read("/WordDocument")
fib = Word::FIB.load(ole)
off_start = fib.fcMin
off_end = fib.fcMac
puts "Text Offset start: #{off_start}"
puts "Text offset end: #{off_end}"
text = docbuf[off_start, off_end - off_start]
puts text.inspect
end
Step1:
After fetching the patch installer executable, the first thing to do is to have to the installer extract the MSI/MSP installer programs:
officexp-KB-XXX.exe /C /T:e:\ms08-042-extracted\The above command will extract the actual patch installer files to e:\ms08-042-extracted\ directory. Among the extracted files, there will be an MSI or MSP file which is the main patch installer program.
Step2:
The MSI/MSP files are special OLE structured installer programs. Details can be found here, here. There is also an utility for extracting MSI/MSP files here.
msix.exe WINWORD.msp /out:e:\ms08-042-extracted\ /extThis should extract all the table data and other relevant information along with a CAB file containing the actual patch binaries which we are interested in. Find the CAB file among the extracted files and extract it normally using WinZIP/WinRAR etc. and BANG!
Bug Hunting: A good number of bugs, including theoretically Security Vulnerabilities where discovered using very trivial bit-byte alteration fuzzing of various structures including the File Information Block (FIB) in Word Documents, random structures in the TableStream etc. There are a no. of structures in the File Formats particularly the Word File Format whose sizes are also read from the document itself, these areas can be good vectors for fuzzing particularly when there are multiple structure load from file with size value read from the file itself.