Small Windows Executables
Chris Dragan
It looks like we have a chain reaction, since this article is a reaction
to SunmaN's article from Hugi#20 about 4K intros for Win32, which also
was a reaction to a reaction. *grin*
Usually we face the fact that our Win32 programs are enormous. Today
it is not a problem to write a small 4096-byte "Hello, world!" program
for Win32; which may surprise many Visual Basic programmers. However,
4096 bytes for a program that doesn't do anything is TOO MUCH !
Tools won't help, that available.
There's no linker that is able
to make an executable
with a size that's reasonable.
This little "poem" illustrates one thing, known from the very beginning:
when you want to have something done good - do it yourself.
To create a small windows executable, we first have to learn the Windows
executable format, which is know as Portable Executable (PE). Naturally,
the portability of this format is questionable. We don't have too many
resources:
- The description of the PE format, released before the first Win32,
contains some useful information, though it is not sufficient.
You can find it at wotsit.org.
- A very well known file named winnt.h, distributed with every C(++)
compiler for Win32, contains the structures found in the PE headers,
with a minimal description. Just use your favourite editor's search
command and look for text "image format".
- Also the issue #2 of Assembly Journal (asmjournal.freeservers.com)
contains an interesting article about a tiny PE executable.
Having these three resources we are able to create our own executable
that will be very small. The article in asmjournal shows a Win32 console
program that prints its command line. The program has only 192 bytes!!!!
Unfortunately this is only possible under WinNT4. The app works under
Win2K, but since it calls some kernel routines by their fixed
WinNT4-specific addresses, it crashes.
So here we notice once again that WinNT is a very different OS from the
so-called consumer Windows. WinNT is much less restrictive than Win9x,
concerning the executables. Nevertheless we want to create small executables
that will work with Win9x, too!
Choosing an assembler
At this point we choose an assembler - it will be NASM (the Netwide Assembler).
This assembler allows us to hand-code a PE header with a minimal effort,
using only one directive and no linker.
If you haven't used NASM, it is worth to know that it has
two (and ONLY two) uncomfortable limitations. The following 32-bit code:
jmp here
here: add ecx, 5
...it assembles to these bytes:
E9 00 00 00 00 jmp +00000000h
81 C1 05 00 00 00 add ecx, 00000005h
To make NASM behave kindly, we have to code:
jmp short here
here: add ecx, byte 5
...and then we get:
EB 00 jmp +00h
83 C1 05 add ecx, +05h
As a side note, NASM has multiple advantages:
- it is free,
- it is simple as it has very, very few directives,
- it has a powerful preprocessor, not found in any other assembler,
- it provides you with total control over your code,
- it is extremely portable - you can use your code on any x86 OS.
But we aren't here to talk about NASM pros and cons... *grin*
First things first
How a PE executable works? A PE executable, further referred to as PE,
consists of headers and sections. Headers contain data for the executable
loader, and sections contain the actual program.
In the file, the sections of a PE are aligned. This means that each
section starts at an address divisible by a number which is the base
of the alignment, and also each section has a size divisible by that
number. If it happens that a section is too short, i.e. its size is not
aligned, it is simply padded with zeroes to the alignment boundary.
For example if the required alignment is 32, and our section has only 24
bytes, additional 8 zeroes are added to the end of the section. The same
goes for the headers: if the headers in our PE are too short, they are
zero padded to the alignment boundary, so that the first section that
follows them is aligned.
The docs say that all sections have to be aligned, but practice shows that
the last section in a PE doesn't have to have aligned size - therefore
file size doesn't have to be aligned.
Win9x requires that file alignment be 512 bytes (200h). This means that
all the headers together occupy no less than 512 bytes, or a multiple of
512, and also every section, but the last one, occupies a multiple of 512
bytes. And here we have the first difference between WinNT and Win9x: WinNT
doesn't have this restriction.
PE sections contain miscellaneous things. They may contain code, data,
imports (references to functions imported from other DLLs), exports,
resources, symbol tables, etc. No matter what the docs say and how the
available linkers behave, it is not required that these entities reside
in separate sections. For example, one can put code and data into one
section. One can even put everything into one section - and this is what
we want to do. As you may easily guess, doing this we gain a few bytes
otherwise lost on file alignment.
Somewhere in the PE headers (later about where exactly) there is a number
that tells where the PE will be loaded. A PE can be either a program or
a DLL library (a DLL can be a library with routines, fonts, drivers, etc.).
For DLLs this load address is only a proposed location - the system
can load a DLL at a different location. But executables are (almost) always
loaded
at this address. And for executables this address is usually 400000h (4MB);
using a different address than 400000h we risk our program being loaded
not at the address we want. This fixed nature of a program's load address
shows another difference between programs and DLLs - DLLs need additional
relocation tables that will enable the system to relocate DLL code to
a different location than the one specified, and executables usually don't
have the relocations.
I am not going to explain the mysteries of 32-bit flat addressing mode
here, but it is enough to know that each program has its own address space
in this mode; as you are likely to know Win32 uses flat mode. A program calls
multiple routines that reside in DLLs. These DLLs are mapped to the program's
address space, and each DLL has a unique location - hence the need for
relocating DLLs, while the program can have a fixed load address. Each
module has a unique module handle, which sometimes needs to be passed
to system functions, like CreateWindowEx() for example. As a matter of
fact, this module handle is module load address; programs needlessly
call the GetModuleHandle() routine, which always returns their load address,
which is usually 400000h. Hence another optimisation for us: we won't ever
need to import and call the GetModuleHandle() function.
PEs aren't loaded linearly. The headers are loaded exactly at the load
address, but sections are further relocated. When being loaded, sections
are expanded and aligned to a greater value than file alignment. This
is called section alignment. The following table presents a set of
entities residing in an example PE, and how they get relocated.
| What | In File | In Memory
| | Position | Size | Position (RVA) | Size
| Headers | 0 | 200h | 0 | 1000h
| Section 1 | 200h | 600h | 1000h | 1000h
| Section 2 | 800h | 0 | 2000h | 7000h
| Section 3 | 800h | 121h | 9000h | 1000h
|
Again, the section size in the file is aligned to File Alignment, while
its size in memory after being loaded is aligned to Section Alignment.
In the above example a PE contains 3 sections: first that has some
stuff and is padded to 600h bytes (the actual section size can be 500h for
instance), second that is empty, at least in the file, and third that has
121h bytes - alignment not required. Assuming a default section alignment
of 1000h, the first section is expanded to 1000h bytes - 0A00h zeroes are
added to its end to fulfill the alignment requirements, the second
section is expanded to 7000h bytes and the last section to 1000h. Now we
notice that section sizes in memory are actually bigger than section
sizes in file, and this is a nice way of allocating non-temporal memory.
Win9x requires that the section alignment be no less than 1000h. Of course
both file alignment and section alignment have to be powers of two. Almost
all programs use the default alignment values - 200h and 1000h, respectively.
I wouldn't recommend using any other values than these; who knows what
Microsofters will devise in the future?
It is a must to use subsequent section addresses, i.e. to allocate sections
in memory in the order they were in the file. Not sticking to this rule
may produce undesirable results.
You are probably wondering what the RVA means that is found in our small
example and why the headers are loaded at RVA=0? RVA means Relative
Virtual Address, and it is an offset relative to Image Base - the address
at which our file is loaded. So if Image Base is 400000h, the headers
are loaded at 400000h and the first section from our example at 401000h.
What makes us headerache
The time has come to reveal the structure of PE headers - the aim of
this article. Hopefully having understood how PE files are loaded and
what they consist of, we can learn that all addresses in the headers and
in the load-time portions of sections (e.g. import tables) are relative
to image base, i.e. they are RVAs. All addresses in code and data of a PE
are non-relative, i.e. fixed, unless they are relocated - provided that
a PE contains relocation tables.
Each PE has the following headers, in exactly that order:
- DOS stub,
- PE header,
- optional header,
- section headers.
If there aren't any weird things in a PE, it has some padding after the
headers, and then the sections come.
The DOS stub is a small DOS executable that usually displays some
annoying message when the user tries to run the program in DOS. This stub
is not required to exist, only the MZ header must be there. The MZ
header must consist of two bytes 'M' and 'Z' at offset 0, and a 32-bit
number at offset 3Ch, so our entire MZ header has 64 (40h) bytes. The other
bytes within the MZ header are not important - they can have any value.
The 32-bit number that ends the MZ header is a file-relative offset to
the PE header.
The PE header should be located in file at an address divisible by 8 (must
be 8-byte aligned). It can actually begin within the DOS header, using
up the unused bytes, but it is better to place it after the MZ header,
i.e. at offset 40h. We will take advantage of the extra spare bytes in
the MZ header at a later time.
The PE header contains a bunch of numbers:
|
| Size | Value | Description
| | dword | 'PE' | PE magic number identifying the PE header
| | word | 14Ch | Machine for which this executable is (14Ch is 386)
| | word | 1 | Number of sections - in our case it will be only one
| | dword | ? | Time stamp - this can be any value
| | dword | ? | Pointer to symbol table - we won't use any symbol tables
| | dword | 0 | Number of symbol tables - zero in our case
| | word | X | Size of optional header
| | word | 10Fh | Characteristics - bitflags (10Fh is 32-bit executable)
| |
The ? values are unimportant - let's set them to 0. The size of the optional
header that comes right after the PE header have been marked as X - we
will put appropriate expression in the source file there, so the size of
the optional header will be figured out at compile time. We will also
do similar things later on.
The optional header, which is in fact NOT optional, contains more information,
specific to our executable:
|
| Size | Value | Description
| | word | 10Bh | Optional header magic number
| | word | ? | Linker version - we don't care
| | dword | ? | Size of code - we could give it some real value
| | dword | ? | Size of data
| | dword | ? | Size of uninitialized data - this is usually 0
| | dword | X | Program entry point
| | dword | X | Base of code
| | dword | X | Base of data
| | dword | 400000h | Image base - this is where our PE is loaded
| | dword | 1000h | Section alignment - we agree on 1000h
| | dword | 200h | File alignment - phew!
| | dword | 4 | OS version - better leave it 4.00
| | dword | 0 | Image version - huh?
| | dword | 4 | Subsystem version
| | dword | ? | Win32 version
| | dword | X | Image size IN MEMORY
| | dword | X | Size of all headers - file offset of first section
| | dword | ? | Checksum
| | word | 2 | Subsystem (2 is Win32 GUI)
| | word | ? | DLL characteristics - we have an executable, not a DLL
| | dword | 100000h | Stack size
| | dword | 1000h | Stack commit
| | dword | 100000h | Heap size
| | dword | 1000h | Heap commit
| | dword | 0 | Loader flags
| | dword | 16 | Number of directories
| | 32 dwords - Directories follow
| |
A lot of stuff! All meaningful addresses are of course RVAs. The entry point
is the RVA at which our program will start. Base of code and base of
data aren't too important, but we can set them to some valid values.
The stack and heap sizes have to be set to some useful values. Stacks are
always thread-specific, and in our case each stack will have an initial
size of 1000h bytes, and limits of 100000h bytes. Heaps usually aren't
used, but we can (or must) sacrifice 4KB of memory.
The table of directories found in the executable is a real pain in the a.
Each entry of this table consists of two dwords - a pointer (RVA) to
a directory, residing somewhere in some section, and the size of that directory.
The second directory is the imports directory, and this is what we want to
have. We aren't interested in any other directories, so we set all of their
corresponding entries in the table to 0s - if you want to learn more about
them, refer to the PE document from wotsit.org, or to winnt.h. In our case
there could be only two table entries, as the second of them points to our
beloved import directory. For WinNT it is OK to have only two entries, but
Win9x requires 16, and that's why we don't like it.
With the end of the table of directories the optional header ends. After it
we find section headers. The number of section headers is in the PE header
existing before the optional header. Each section header has the following
structure:
|
| Size | Value | Description
| | qword | ? | ASCII section name - can be anything you want
| | dword | X | Size in memory
| | dword | X | RVAddress in memory (in our case it will be 1000h)
| | dword | X | Size in file
| | dword | X | Offset in file
| | dword | ? | Pointer to relocations - we won't have any
| | dword | ? | Pointer to line numbers - debug info, anyone ?
| | word | 0 | Number of relocations
| | word | 0 | Number of line number entries
| | dword | 0E0000060h | Flags
| |
There are many available flag values (from winnt.h):
| Code | Description (section contains...)
| | 00000020h | Code
| | 00000040h | Data
| | 00000080h | Uninitialized data - this is fiction!!!
| | 01000000h | Extended relocations
| | 02000000h | Section can be discarded
| | 04000000h | Section is not cacheable
| | 08000000h | Section is not pageable
| | 10000000h | Section is shareable
| | 20000000h | Section is executable
| | 40000000h | Section is readable
| | 80000000h | Section is writable
| |
There are even more flags, but many of those presented here as most of those
other are unimportant. For example if we choose our section to be readable
but not writable, we still will be able to write to it, even under WinNT.
Before we seriously go into coding, we yet have to learn how an import table
looks. An import table consists of a set of entries, each of which
corresponds to some DLL from which we import functions. Each import table
entry has the following structure:
|
| Size | Value | Description
| | dword | X | RVA of original thunk
| | dword | ? | Time stamp
| | dword | ? | Forwarder chain (what???)
| | dword | X | RVA of ASCIIZ DLL name
| | dword | X | RVA of replaced thunk
| |
The last entry in an import table is zero-filled, indicating the end of
the import table. The so-called thunk is a zero-terminated array of
dword pointers to imported function names. The replaced thunk is filled
by the PE loader with actual pointers to imported routines. The original
thunk remains untouched, but unlike most of linkers, we can supply the same
RVA for both replaced and original thunk, thus including only one thunk
per imported DLL. The entries of each thunk point to ASCIIZ function names
preceded by a word value called "hint". This was originally meant to serve
as an alternate method of importing functions by indices instead of names,
but it doesn't work anyway, so we can set the hints to 0, and use them
as ASCIIZ function name terminators. Note that all function names should
be word aligned. Also, it is obvious but worth to mention once again, that
the entries of a thunk are RVAs to hints, and those hints are followed
by actual imported function names.
Tips and tricks
As indicated above, we want to create only one single section in our PE,
so we won't lose any bytes on file alignment (i.e. alignment of sections
within the file). Needless to say that we don't care about any export tables,
resources, symbol tables and other weird things that can reside in a PE.
Unfortunately we meet two serious Win9x limitations: we must use file
alignment 200h, what makes our headers occupy 512 bytes, and what's more
we have to include 16 directories, 14 of which are unused - unnecessary
loss of 14*8=112 bytes. As the headers are loaded into memory at the image
base, we can fill their unused parts with useful data, such us imported
function names, for example. The spare places we get, after getting rid
of DOS stub and leaving only 64-byte MZ header, are the unused parts of
this MZ header (58 bytes) plus the padding bytes after section headers
(160 bytes).
The code
The program we want to create will be a skeleton for a 4K intro. Of course
it would be even better to make a simple compressor for our code, but this
is rather a topic for Dario Phong.
In C, the program would look like follows:
WNDCLASS WindowClass; // = { ... };
void main() // NOT WinMain()
{
RegisterClass( &WindowClass );
HWND hWnd = CreateWindowEx( 0, &ClassName, &ClassName,
WS_OVERLAPPEDWINDOW, CW_USEDEFAULT, 0,
CW_USEDEFAULT, 0, 0, 0, 0x400000, 0 );
ShowWindow( hWnd );
UpdateWindow( hWnd );
LPDIRECTDRAW lpDD;
DirectDrawCreate( 0, &lpDD, 0 );
lpDD->Vtbl->SetCooperativeLevel( lpDD, hWnd,
DDSCL_EXCLUSIVE | DDSCL_FULLSCREEN );
if ( lpDD->Vtbl->SetDisplayMode( lpDD, 640, 480, 32 ) )
for (;;) {
WaitMessage(); // Replace this with your frame rendering
MSG msg;
if ( ! PeekMessage( &msg, 0, 0, 0, PM_REMOVE ) )
continue;
if ( msg.message == WM_QUIT )
break;
DefWindowProc( msg.hwnd, msg.message,
msg.wParam, msg.lParam );
}
lpDD->Vtbl->Release( lpDD );
}
LRESULT CALLBACK WndProc ( HWND hwnd, UINT uMsg,
WPARAM wParam, LPARAM lParam )
{
if ( uMsg != WM_DESTROY )
goto DefWindowProc;
PostQuitMessage( 0 );
}
This is a minimal program that switches into 640x480x32bpp mode and
remains in it until the user presses Alt+F4. Note that I didn't include
any surface-creation code here; creating a primary ddsurface is a must if
one wants to display anything. We could also add some code for handling
the Esc key to the main() function, like:
if ( msg.message == WM_KEYDOWN &&
msg.wParam == VK_ESCAPE )
CloseWindow( msg.hwnd );
We actually can afford importing one more function, since we put the
import strings in the PE header's padding areas. You may take a different
approach on your own, but consider that the import function names are
needed only at load time, and we actually don't know what happens with
the headers while the program runs.
Implementation
Whether you know or don't, the Win32 logic assumes that all significant
calls through the system leave registers ebx, esi, edi and ebp untouched.
This concerns not only imported routines we call, but also callbacks
supplied by us, such as WndProc(). The result is always returned in eax
or in eax:edx pair, and eax, ecx and edx may be destroyed. Win32 standard
calling convention is stdcall (reverse order of pushed arguments, arguments
removed by the calee); an exception is function wsprintf which has C calling
convention, since it has a variable amount of arguments.
Many standard Win32 functions are in two versions: ANSI and UNICODE.
This concerns mainly routines that obtain some strings, for example
CreateWindowEx. The actual names of this function are CreateWindowExA
for ANSI and CreateWindowExW for UNICODE. The UNICODE version is rare
and exists only on some versions of WinNT, nevertheless CreateWindowExA
is the valid name we will use. (Note that actually there is no CreateWindow
function, as it is re-defined to CreateWindowExA/W in winuser.h.)
DirectX and other COM-style calls are also simple. It is enough to
notice the difference between calling them from C++ and C:
lpDD->SetDisplayMode( 640, 480, 32 ); // In C++
lpDD->Vtbl->SetDisplayMode( lpDD, 640, 480, 32 ); // In C
Since Vtbl is located at offset 0, the latter has an alternate syntax:
(*lpDD)->SetDisplayMode( lpDD, 640, 480, 32 );
And this is exactly what we are doing in assembly.
For the purpose of our example skeleton, we use registers for storing
common values, like 0 or hWnd. We also keep a pointer to MSG structure
in a register. Because the thunks will lie near this structure, we will
also use addresses of imported function pointers relative to it, gaining
three bytes on each call to an imported routine.
In case you want to create a hardware accelerated 4KB intro: it is possible
to do this with OpenGL, but you probably wouldn't do much effects, since
the import tables would take many precious bytes. A better approach is to
use Direct3D - you do not need any extra imports than the ones used in
our example, and all calls to Direct3D are done through COM.
Final words
There isn't much to say. The source of the example you should find in
the bonus pack. Have fun coding small proggies, and I hope this article
helped you in that matter.
Chris Dragan
|