2024-10-20

Assembly Transformations for Binary Compression

Introduction
E8E9 Filter
Split-Stream Encoding

Introduction

Since the introduction of memory-managed programming languages, the tradeoffs between speed and safety (or the rather the conveniences of a high-level language) have often been the center of attention when concerning the “correct” direction for the next piece of software to be written in. Some modern languages such as C++, and especially Rust have worked towards closing that gap with their “zero-cost abstractions.” However, one of the less notable tradeoffs with these languages is the size of the executable image they produce. Languages such Java, Python, or any .NET language obviously depend on some runtime environment that must be installed to run their programs. But the same happens with any language that has a dependency on libc, for example. With Rust this has become particularly noticeable, with a StackOver user reporting their debug build of “Hello, World!” being 3MB (3 million bytes!). There are many reasons for why their debug build is so large, and there’s an entire GitHub page to help mitigate the size. But how small can we go?

This isn’t exactly a new question, especially to the demoscene. For many years, authors have been showing off visual/audio software demos often restricted to 64k or 4K bytes in size for various platforms. Often this will be done with the help of an executable packing tool. Packers compress the original program into a separate “stub,” a tiny program (often written in assembler) that will decompress and execute the original image with no difference to the end user. UPX is the most famous example of an executable packer, boasting a 50%-70% reduction in size.

This article isn’t going to focus on how packers or stubs work, but rather some of the obscure methods I have personally come across to increase the compression ratio of executable images before being compressed by a context modeling or LZ-based algorithm, along with my take on a modern implementation of them in Rust.

E8E9 Filter

Originally introduced in windows cabinet files, the BCJ algorithm, also known as the E8E9 transform/filter, finds jump and call instructions that are encoded with relative addresses (a “displacement” from the instruction pointer), and converts the target address to it’s absolute counterpart.

Consider the following snippet of a FizzBuzz program in x86:

div_by_3_5:                      	; print fizz buzz
	mov eax, 4                       	; syscall number
	mov ebx, 1                       	; stdout
	mov ecx, fizzbuzz                	; ptr to buffer
	mov edx, 11                      	; buffer size
	int 0x80				
	jmp dword cont
	
div_by_3:                        	; print fizz
	mov eax, 4                       	; syscall number
	mov ebx, 1  				              ; stdout 
	mov ecx, fizz                    	; ptr to buffer
	mov edx, 6  				              ; buffer size
	int 0x80				
	jmp dword cont
	
div_by_5:                        	; print buzz
	mov eax, 4                       	; syscall number
	mov ebx, 1  				              ; stdout 
	mov ecx, buzz                    	; ptr to buffer
	mov edx, 6  				              ; buffer size
	int 0x80				
	jmp dword cont
	
cont:
	pop ecx  				                  ; restore loop counter
	cmp ecx, 100		
	jz exit                          	; end of loop	
	inc ecx                          	; increment loop counter
	jmp loop_main

; Here I am using `jmp dword`, to force the instruction to be encoded as a "near
; jump" (E9 cd, JMP rel32), rather than a short jump (EB cb	JMP rel8).

section .data
	fizz dw "Fizz", 0Ah, 0Dh
	buzz dw "Buzz", 0Ah, 0Dh
	fizzbuzz dw "Fizz Buzz", 0Ah, 0Dh

After being compiled and then linked, we can use objdump to see how all of the jmp instructions look like after being encoded:

nasm -f elf32 -o fizzbuzz.o fizzbuzz.nasm
ld -m elf_i386 -o fizzbuzz fizzbuzz.o
objdump -d fizzbuzz | grep "jmp.*cont"

...

8049046:       e9 51 00 00 00          jmp    804909c <cont>
8049061:       e9 36 00 00 00          jmp    804909c <cont>
804907c:       e9 1b 00 00 00          jmp    804909c <cont>
8049097:       e9 00 00 00 00          jmp    804909c <cont>

Even though we are jumping to the same location, since the jumps are being encoded as “Jump near, relative, RIP = RIP + 32-bit displacement sign extended to 64-bits”, the resulting bytes are slightly different with every jump. We can increase the redundancy in the file by converting the jump from their displacement (offset from the instruction pointer) to their absolute form (offset from the beginning of the file). Ideally, we would want the output to look something more like this:

By converting the relative form to its absolute address (eg. 0x804909c - (0x8049046 + 5) = 51), we would end up with a binary file that looks a little more like :

8049046:       e9 80 49 09 c0          jmp    804909c <cont>
8049061:       e9 80 49 09 c0          jmp    804909c <cont>
804907c:       e9 80 49 09 c0          jmp    804909c <cont>
8049097:       e9 80 49 09 c0          jmp    804909c <cont>

Matt Mahoney’s ZPAQ contains a good reference implementation of this algorithm in C:

It scans the input block backward for 5 byte strings of the form {E8|E9 xx xx xx 00|FF} and adds the offset from the start of the block to the middle 3 bytes interpreted as a little-endian (LSB first) number (mod 2^24). E8 and E9 are the CALL and JMP instructions, followed by a 32 bit relative offset.¹

// E8E9 transform of buf[0..n-1] to improve compression of .exe and .dll.
// Patterns (E8 | E9 xx xx xx 00 | FF) at offset i replace the 3 middle
// bytes with x+i mod 2^24, LSB first, reading backward.
void e8e9(unsigned char* buf, int n) {
  for (int i = n-5; i >= 0; --i) {
    if (((buf[i] & 254) == 0xe8) && ((buf[i + 4] + 1) & 254) == 0) {
      unsigned a = (buf[i + 1] | buf[i + 2] << 8 | buf [i + 3] << 16) + i;
      buf[i + 1] = a;
      buf[i + 2] = a >> 8;
      buf[i + 3] = a >> 16;
    }
  }
}

The implementation is pretty straight forward, with a few notable optimizations. Firstly, on buf[i] & 254 == 0xe8. Since 0xE8 and 0xE9 only have a 1-bit difference at the end, a bitwise AND makes it possible to have one comparison for both operands, for example:

\begin{align} \nonumber && \ 11101001_{2} &\qquad 0xE9 \\ \nonumber && \land \ 11111110_{2} &\qquad 254 \\ \nonumber && \ = 11101000_{2} &\qquad 0xE8 \end{align}

This is trivial for modern compilers to catch for you, and I opted to omit this for clarity in my implementation.

The second check for 0x00 || 0xFF does in fact produce two cmp instructions when implemented as buf[i + 4] == 0xFF || buf[i + 4] == 0x00 rather than relying on the overflow, so I decided to keep the trick in my implementation to reduce the final assembly size.

Finally here is my take on the implementation in Rust:

// The lending iterator trait and WindowsMut structure is just a fancy way of
// being able to iterate over the input buffer without manually indexing it
// C-style. I wouldn't ever include all of this overly complex boilerplate into
// production software, where the simplicity has a much  higher value. It is
// however an example of a (close to) "zero-cost abstraction" that I felt would
// be interesting to include.

trait LendingIterator {
    type Item<'a>
    where
        Self: 'a;

    fn next<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)>;
    fn prev<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)>;
}

pub struct WindowsMut<'window, const WINDOW_LEN: usize, T> {
    slice: &'window mut [T],
    index: usize,
}

impl<'window, const WINDOW_LEN: usize, T> WindowsMut<'window, WINDOW_LEN, T> {
    fn new(slice: &'window mut [T]) -> Self {
        Self { slice, index: 0 }
    }

    fn reverse(self) -> Self {
        let index = self.slice.len() - WINDOW_LEN;

        Self {
            slice: self.slice,
            index,
        }
    }
}

impl<'window, const WINDOW_LEN: usize, T> LendingIterator for WindowsMut<'window, WINDOW_LEN, T> {
    type Item<'a> = &'a mut [T; WINDOW_LEN] where Self: 'a;

    fn next<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)> {
        let Self { slice, index } = self;
        let ans = slice[*index..].first_chunk_mut();

        let res = ans.map(|chunk| (*index, chunk));
        *index += 1;

        res
    }

    fn prev<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)> {
        let Self { slice, index } = self;
        let ans = slice[*index..].first_chunk_mut();

        if *index == 0 {
            None
        } else {
            let res = ans.map(|chunk| (*index, chunk));
            *index -= 1;

            res
        }
    }
}

// This modifies the buffer in place, rather than requiring the user to make a
// seperate allocation to use.
pub fn e8e9(bytes: &mut [u8]) {
    let mut iter = WindowsMut::new(bytes);

    while let Some((i, &mut [a, ref mut b, ref mut c, ref mut d, e])) = iter.next() {
        // We must use e.wrapping_add(1) here rather than e + 1, because
        // overflowing will panic in Debug builds, but we are relying on this
        // behaviour.
        if (a == 0xE8 || a == 0xE9) && (e.wrapping_add(1) & 254 == 0) {
            // u32::from_le_bytes produces a better assembly output over '*b
            // as u32 | ((*c as u32) << 8) | ((*d as u32) << 16)', and is nicer
            // to read.
            let displacement = u32::from_le_bytes([*b, *c, *d, 0]);
            let offset = displacement.wrapping_add(i as u32);

            *b = (offset >> 16) as u8;
            *c = (offset >> 8) as u8;
            *d = offset as u8;
        }
    }
}

And inversely:

pub fn e8e9_inverse(bytes: &mut [u8]) {
    let mut iter = WindowsMut::new(bytes).reverse();

    while let Some((i, &mut [a, ref mut b, ref mut c, ref mut d, e])) = iter.prev() {
        // We must use e.wrapping_add(1) here rather than e + 1, because
        // overflowing will panic in Debug builds, but we are relying on this
        // behaviour.
        if (a == 0xE8 || a == 0xE9) && (e.wrapping_add(1) & 254 == 0) {
            // u32::from_le_bytes produces a better assembly output over '*d
            // as u32 | ((*c as u32) << 8) | ((*b as u32) << 16)', and is nicer
            // to read.
            let displacement = u32::from_le_bytes([*d, *c, *b, 0]);
            let offset = displacement.wrapping_sub(i as u32);

            *b = offset as u8;
            *c = (offset >> 8) as u8;
            *d = (offset >> 16) as u8;
        }
    }
}

Despite the using the Lending Iterator pattern to build a mutable sliding window, and the safety guarantees in the Rust version, the compiled output is only marginally bigger than it’s C counterpart:

; rustc 1.82.0 with -Copt-level=3

example::e8e9::h2b129b97de9d3e96:
        push    rbx
        mov     eax, 1
        mov     rcx, rsi
        jmp     .LBB0_1
.LBB0_4:
        lea     rax, [rdx + 1]
        dec     rdx
        dec     rcx
        cmp     rdx, rsi
        jae     .LBB0_5
.LBB0_1:
        cmp     rcx, 5
        jb      .LBB0_7
        mov     rdx, rax
        movzx   eax, byte ptr [rdi + rax - 1]
        mov     r8d, eax
        and     r8b, -2
        cmp     r8b, -24
        jne     .LBB0_4
        movzx   r8d, byte ptr [rdi + rdx + 3]
        inc     r8b
        cmp     r8b, 2
        jae     .LBB0_4
        movzx   ebx, word ptr [rdi + rdx]
        shl     ebx, 8
        or      ebx, eax
        add     ebx, edx
        add     eax, edx
        mov     byte ptr [rdi + rdx], al
        mov     byte ptr [rdi + rdx + 1], bh
        shr     ebx, 16
        mov     byte ptr [rdi + rdx + 2], bl
        jmp     .LBB0_4
.LBB0_7:
        pop     rbx
        ret

; C Version

e8e9:
        add     esi, -5
.LBB0_1:
        test    esi, esi
        js      .LBB0_6
        mov     eax, esi
        mov     cl, byte ptr [rdi + rax]
        and     cl, -2
        cmp     cl, -24
        jne     .LBB0_5
        mov     cl, byte ptr [rdi + rax + 4]
        inc     cl
        cmp     cl, 1
        ja      .LBB0_5
        movzx   edx, word ptr [rdi + rax + 1]
        movzx   ecx, byte ptr [rdi + rax + 3]
        shl     ecx, 16
        add     edx, esi
        add     ecx, edx
        mov     byte ptr [rdi + rax + 1], dl
        mov     byte ptr [rdi + rax + 2], ch
        shr     ecx, 16
        mov     byte ptr [rdi + rax + 3], cl
.LBB0_5:
        dec     esi
        jmp     .LBB0_1
.LBB0_6:
        ret

The Results

To test the transform, we can create a new Rust program with cargo new e8e9-test, and then add a dependency to lz4_flex with cargo add lz4_flex while in the e8e9-test directory; Then add the following code to main.rs:

fn main() {
    // Get the input argument from the User
    let args: Vec<_> = std::env::args().collect();
    assert!(args.len() == 2, "Bad usage");

    // Read the input binary into a vector
    let mut bytes = std::fs::read(&args[1]).expect("Unable to read file");

    // Compress the image using LZ4 without the transform
    let without_e8e9 = lz4_flex::compress(&bytes);
    println!(
        "Without e8e9:\t{} -> {} ({:.2}%)",
        bytes.len(),
        without_e8e9.len(),
        (without_e8e9.len() as f32 / bytes.len() as f32) * 100.0,
    );

    // Apply the transform
    e8e9(&mut bytes);

    // Try the compression a second time
    let with_e8e9 = lz4_flex::compress(&bytes);
    println!(
        "With e8e9:\t{} -> {} ({:.2}%)",
        bytes.len(),
        with_e8e9.len(),
        (with_e8e9.len() as f32 / bytes.len() as f32) * 100.0,
    );

    // Report the compression ratios
    println!(
        "\t\t{} -> {} = {} ({:.2}%) smaller",
        without_e8e9.len(),
        with_e8e9.len(),
        without_e8e9.len() - with_e8e9.len(),
        100.0 - ((with_e8e9.len() as f32 / without_e8e9.len() as f32) * 100.0),
    );
}

As you can see, we don’t need to parse the executable image for it’s text section (where the code is) due to the improbability of the E8E9 pattern appearing in random data. If the probability of a single byte appearing is $2^{-8} (1/256)$ , the probability of the pattern appearing in random data is $(2^{-8} + 2^{-8})^2 = (2^{-7})^2 = 2^{-14} \approx 0.006\%$ . Furthermore, Matt Mahoney points out that 0xE8 and 0xE9 is out of range for ASCII encoded data, and 0xFF is invalid for UTF-8 encoded data².

Finally place any executable image (PE, ELF, etc.) in the directory with your Makefile.toml, and run cargo run -- filename, I chose nvidia-pcc.exe just due to it’s sheer size.

cargo run -- nvidia-pcc.exe

Without e8e9:   25312776 -> 13390424 (52.90%)
With e8e9:      25312776 -> 12674656 (50.07%)
                13390424 -> 12674656 = 715768 bytes (5.35%) smaller

We can see that this transformation saved another 715,768 bytes in our final compressed buffer, not bad for such a simple and fast algorithm. Personally, this is enough for me - shaving off any more bytes isn’t worth it … something something diminishing returns. Maybe I only say this because I spent the last couple days debugging the next one. Either way, it doesn’t get any simpler.

Split-Stream Encoding

Fabian “ryg” Giesen’s executable packer, kkrunchy uses a slightly more involved approach to compressing x86. Originally based on an MS Research Technical Report paper titled PPMexe, split-stream encoding seperates the components of each instruction, and groups them into “streams” based on their function. In the following example, each stream is denoted by a seperate colour:

    
e8 6a 13 02 00    call  CopyMem4
8b 4b 1c          mov   ecx, dword ptr [ebx+1Ch]
83 c4 0c          add   esp, 0Ch
8b c5             mov   eax, ebp
d1 e8             shr   eax, 1
8b f5             mov   esi, ebp
8d 14 40          lea   edx, dword ptr [eax+eax*2]
8b 44 24 1c       mov   eax, dword ptr [esp+1Ch]
83 e6 01          and   esi, 1
8d 3c 96          lea   edi, dword ptr [esi+edx*4]
39 44 b9 1c       cmp   dword ptr [ecx+edi*4+1ch], eax
75 77             jne   short L15176

Opcode + ModRM — Jump Offset — Displacement — SIB — Immediate 

https://www.farbrausch.de/~fg/seminars/workcompression_download.pdf

  

The idea being that the components of instructions are more alike (read repetitive) in nature, over how they laid out within the sequence of a regular instruction. Notice how the Displacement stream (coloured by red) will now contain three bytes repeated bytes of 0x1c, in contrast to the left-hand column (the original sequence of bytes), where they are laid out further apart from each other.

This does require a disassembler implementation (a quite complex problem for CISC instruction sets) in order to seperate the instructions into their respective streams. Fortunately, the disassembler doesn’t actually have to be very accurate. When x86 is incorrectly disassembled, it tends to synchronize itself shortly after. This can be observed in commercial disassemblers such as IDA Pro or Ghidra, when bytes of data (rather than code) end up in the .text section of an executable image. Dougall J has a more extensive article that goes into deeper detail on the subject.

Although this transform comes at an implementation cost, in terms of both the complexity and the actual size of the compiled decompression routine. This can increase the compression of an executable up to ~20%, approximately 10% higher than the E8E9 transform³.

Althought I can’t fit the entire implementation in the article, I tried to document some of the main ideas in the comments. The following is based on the disfilter source, which is detailed more in depth here. Aside from just splitting the instructions into seperate streams, it utilizes a cache of target addresses (FIFO queue). Whenever a call is encountered, the “target” is added to the cache. Whenever an address already exists in the cache, we can just write the index (1-byte) into the cache rather than the full displacement (4-bytes).

Without further delay, here is my main transform routine:

// Function Table implementation omitted for brevity...
const FN_TABLE_LEN: usize = 256;
#[repr(transparent)]
pub struct FunctionCache {
    table: [u32; FN_TABLE_LEN],
}

struct SplitStreamCtx {
    buffer: InstructionStream,
    fn_table: FunctionCache,
    next_is_func: bool,
    code_start: usize,
    code_end: usize,
}

impl SplitStreamCtx {
    pub fn process_instr(&mut self, instr: &[u8], memory: usize) -> (bool, usize) {
        let mut instr = instr;

        // If there is a jump with N entries
        if let Some(entries) = self.detect_jmp_table(instr, memory) {
            // For every entry in the jump table
            let mut remaining = entries;
            while remaining > 0 {
                // Jump table counts are encoded as u8's, so they are 256
                // entries long per Jump Table code.
                let count = if remaining < 256 { remaining } else { 256 };

                // Emit the jump table opcode, followed the count.
                self.buffer.put::<u8>(Stream::Op, Op::JUMPTAB);
                self.buffer
                    .put::<u8>(Stream::ST_JUMPTBL_COUNT, (count - 1) as u8);

                for _ in 0..count {
                    // Retrieve the target function that the entry leads to
                    let target = DisFilterCtx::fetch::<u32>(&mut instr);

                    // If our function exists in the table
                    if let Some(index) = self.fn_table.find(target) {
                        self.buffer.put(Stream::CallIdx, (index + 1) as u8);
                    } else {
                        // The function was added to the start of the table
                        self.buffer.put::<u8>(Stream::CallIdx, 0);
                        self.buffer.put::<u32>(Stream::Call32, target);
                    }
                }

                remaining -= count;
            }

            // The next instruction is a function, and each entry is
            // 4 bytes long.
            return (true, entries as usize * 4);
        }

        // Save where we start parsing the instruction to calculate how many
        // bytes we have consumed by the end of the routine. Then save the
        // original OpCode for emitting, in case where we are unable to decode
        // the full instruction.
        let start = instr.as_ptr() as usize;
        let start_op = Self::fetch::<u8>(&mut instr);

        // Fetch the instruction's OpCode.
        let mut code = start_op;

        // MSVC pads the space between functions with int3.
        if self.next_is_func && code != Op::OP_INT3 {
            self.fn_table.add(memory as u32);
            self.next_is_func = false;
        }

        let o16 = if code == Op::OP_OSIZE {
            code = Self::fetch::<u8>(&mut instr);
            true
        } else {
            false
        };

        let (code2, mut flags) = if code == Op::OP_2BYTE {
            let code2 = Self::fetch::<u8>(&mut instr);
            (code2 as i32, TABLE_2[code2 as usize] as i32)
        } else {
            (0 as i32, TABLE_1[code as usize] as i32)
        };

        if matches!(code, Op::OP_RETNI | Op::OP_RETN | Op::OP_INT3) {
            // Return, function is going to start next.
            self.next_is_func = true;
        }

        if flags as u8 == F::MEXTRA {
            flags = TABLE_X
                [(((instr[0] >> 3) & 7) | ((code & 0x01) << 3) | ((code & 0x08) << 1)) as usize]
                as _;
        }

        if flags as u8 != F::ERR {
            if o16 {
                self.buffer.put::<u8>(Stream::Op, Op::OP_OSIZE);
            }

            self.buffer.put::<u8>(Stream::Op, code);
            if code == OpCode::OP_2BYTE {
                self.buffer.put::<u8>(Stream::Op, code2 as _);
            }

            if matches!(code, Op::OP_CALLF | Op::OP_JMPF | Op::OP_ENTER) {
                // far call/jump have a *48-bit* immediate address. we deal with
                // it here by copying the segment index manually and encoding
                // the rest as a normal 32-bit direct address. similarly, enter
                // has a word operand and a byte operand. again, we code the
                // word here, and deal with the byte later during the normal
                // flow.
                self.copy::<u16>(Stream::Imm16, &mut instr);
            }

            if flags as u8 & F::MODE == F::MR {
                let modrm = self.copy::<u8>(Stream::MODRM, &mut instr);
                let sib = if (modrm & 0x07 == 4) && (modrm < 0xc0) {
                    self.copy::<u8>(Stream::Sib, &mut instr)
                } else {
                    0
                };

                // Register + Byte Displacement
                if modrm & 0xc0 == 0x40 {
                    let stream = Stream::ST_DISP8_R0 as u8 + (modrm & 0x07);
                    self.copy::<u8>(stream.try_into().unwrap(), &mut instr);
                }

                // Register + Dword Displacement
                if (modrm & 0xc0 == 0x80)
                    || (modrm & 0xc7 == 0x05)
                    || ((modrm < 0x40) && (sib & 0x07 == 5))
                {
                    let stream = if modrm & 0xc7 == 0x05 {
                        Stream::Addr32
                    } else {
                        Stream::Disp32
                    };
                    self.copy::<u32>(stream, &mut instr);
                }
            }

            if flags as u8 & F::MODE == F::AM {
                match flags as u8 & F::TYPE {
                    F::AD => {
                        self.copy::<u32>(Stream::Addr32, &mut instr);
                    }
                    F::DA => {
                        self.copy::<u32>(Stream::AJump32, &mut instr);
                    }
                    F::BR => {
                        self.copy::<u8>(Stream::Jump8, &mut instr);
                    }
                    F::DR => {
                        let mut target = Self::fetch::<u32>(&mut instr);
                        target = target.wrapping_add(
                            ((instr.as_ptr() as usize - start) as u32) + memory as u32,
                        );

                        if code != OpCode::OP_CALLN {
                            self.buffer.put::<u32>(Stream::Jump32, target);
                        } else {
                            // If our function exists in the table
                            if let Some(index) = self.fn_table.find(target) {
                                self.buffer.put::<u8>(Stream::CallIdx, index + 1);
                            } else {
                                // The function was added to the start of the table
                                self.buffer.put::<u8>(Stream::CallIdx, 0);
                                self.buffer.put::<u32>(Stream::Call32, target);
                            }
                        }
                    }
                    _ => {}
                }
            } else {
                match flags as u8 & F::TYPE {
                    F::BI => {
                        self.copy::<u8>(Stream::Imm8, &mut instr);
                    }
                    F::WI => {
                        self.copy::<u16>(Stream::Imm16, &mut instr);
                    }
                    F::DI => {
                        if !o16 {
                            self.copy::<u32>(Stream::Imm32, &mut instr);
                        } else {
                            self.copy::<u16>(Stream::Imm16, &mut instr);
                        }
                    }
                    _ => {}
                }
            }

            let bytes = instr.as_ptr() as usize - start;
            return (false, bytes);
        }
        // We were unable to decode decode instruction.
        else {
            // Emit the escape code, followed by the unrecognized OpCode.
            self.buffer.put::<u8>(Stream::Op, Op::Escape);
            self.buffer.put::<u8>(Stream::Op, start_op);
            return (false, 1);
        }
    }

    fn detect_jmp_table(&self, instr: &[u8], addr: usize) -> Option<u32> {
        assert!(addr < self.code_end);

        let max = (self.code_end - addr) / 4;
        let mut count = 0;

        while count < max {
            let coded_addr = Self::load::<u32>(&instr[(count * 4)..]);

            if coded_addr >= self.code_start as _ && coded_addr < (self.code_end as _) {
                count += 1;
            } else {
                break;
            }
        }

        // If there are less than 3 entries, it's probably not a jump table.
        (count >= 3).then_some(count as u32)
    }
}

pub fn split_stream(src: &[u8], origin: usize) -> Vec<u8> {
    let mut ctx = SplitStreamCtx::new(origin, origin + src.len());

    // Process all instructions, until there are less than 15 bytes (the
    // maximum size of an x86 instruction).
    let mut pos = 0;
    while pos < src.len() - MAX_INSTR_LEN {
        let (was_jump, bytes) = ctx.process_instr(&src[pos..], origin + pos);
        assert!(was_jump || bytes <= MAX_INSTR_LEN);

        pos += bytes;
    }

    // for the last few bytes, be very careful not to read past the end
    // of the input instruction stream. create a check point on every
    // instruction; if PackInstr would've read past the end of the input
    // stream, we undo the last step.
    while pos < src.len() {
        // Copy remaining bytes into buffer
        let mut instr_buf = [0u8; MAX_INSTR_LEN];
        instr_buf[..src.len() - pos].copy_from_slice(&src[pos..]);

        // Save current output size for all streams
        let check_point: [usize; Stream::ST_MAX] = ctx
            .buffer
            .iter()
            .map(|buf| buf.len())
            .collect::<Vec<usize>>()
            .try_into()
            .unwrap();

        // Process the instruction
        let (was_jmp, bytes) = ctx.process_instr(&src, origin + pos);
        assert!(was_jmp || bytes <= MAX_INSTR_LEN);

        // If it was a valid instruction
        if pos + bytes <= src.len() {
            pos += bytes;
        } else {
            // We read past the end, restore the checkpoint
            ctx.buffer
                .iter_mut()
                .enumerate()
                .for_each(|(i, buf)| buf.truncate(check_point[i]));

            break;
        }
    }

    // If there are still bytes left, encode them as escapes
    while pos < src.len() {
        ctx.buffer.put::<u8>(Stream::Op, OpCode::ESCAPE);
        ctx.buffer.put::<u8>(Stream::Op, src[pos]);
        pos += 1;
    }

    // Generate the final output
    ctx.flush()
}

And inversely:

#[repr(transparent)]
pub struct BorrowedInstructionStream<'a> {
    // Stream::ST_MAX is the number of seperate streams the instructions are
    // grouped into (19).
    buffer: [&'a [u8]; Stream::ST_MAX],
}
// BorrowedInstructionStream implementation ommited for brevity ...

pub fn inverse_split_stream(src: &[u8], dst: &mut [u8], mem_start: u32) -> bool {
    // Create a binding that will be "consumed" or incremented, as we write to the destination.
    let mut dst = dst;

    // Intialize the function cache.
    let mut fn_table = FunctionTable::new();

    // Create a table of streams from the contigous stream lengths present in
    // the header.
    let mut stream = {
        use core::mem::{transmute, MaybeUninit};

        // Do not initialize the table, we need to calculate each stream from
        // the length bytes.
        let mut stream: [MaybeUninit<&[u8]>; Stream::ST_MAX] =
            [MaybeUninit::uninit(); Stream::ST_MAX];

        let mut header = src;
        let mut offset = Stream::ST_MAX * size_of::<u32>();
        for slice in &mut stream {
            // Retrieve the length of the slice, then initialize it.
            let len = fetch::<u32>(&mut header) as usize;
            slice.write(&src[offset..offset + len]);

            offset += len;
        }

        // SAFETY: Each stream has been fully initialized and the ABI of
        // MaybeUninit<&[u8]> and &[u8] is the same. Furthurmore, the ABI of
        // [&[u8]; Stream::ST_MAX] and that of BorrowedInstructionStream is
        // the same, as it is marked as #[repr(transparent)].
        transmute::<_, BorrowedInstructionStream>(stream)
    };

    // Assume that the code section starts at a function (it should be the entry
    // point of the executable image).
    let mut next_is_func = true;

    // Keep track of the start of the destination slice, remember we will be
    // "consuming" it as we decode the source and write to the destination.
    // So we will no longer be able to access the beginning of the destination
    // buffer through `dst`.
    let dst_start = dst.as_ptr() as usize;

    // While there are OpCodes still left in the stream.
    while !stream[Stream::Op].is_empty() {
        let start = dst.as_ptr() as usize;
        let memory = mem_start + (dst.as_ptr() as u32 - dst_start as u32);

        // Retrieve the next OpCode
        let mut code = stream.fetch::<u8>(Stream::Op);

        // If there is an OP_JUMPTAB (0xce), decode it.
        if code == Op::JUMPTAB {
            // The next byte contains the length of the jump table - 1.
            let count = fetch::<u8>(&mut stream[Stream::ST_JUMPTBL_COUNT]) as u32 + 1;

            // For every entry in the Jump Table
            for _ in 0..count {
                let index = stream.fetch::<u8>(Stream::CallIdx);
                let target = if index != 0 {
                    fn_table.move_to_front(index - 1, fn_table[index - 1])
                } else {
                    let target = stream.fetch_be::<u32>(Stream::Call32);
                    fn_table.add(target);

                    target
                };

                // Output the original function target.
                write::<u32>(&mut dst, target);
            }

            // Immediately fetch the next opcode.
            continue;
        }

        // If the next bytes are supposed to contain an OpCode, and the bytes
        // are not function padding (emitted by MSVC).
        if next_is_func && code != Op::OP_INT3 {
            fn_table.add(memory);
            next_is_func = false;
        }

        // If the op is an escape code, emit the following byte
        if code == Op::ESCAPE {
            stream.copy::<u8>(Stream::Op, &mut dst);
        } else {
            write::<u8>(&mut dst, code);

            let mut flags;
            let mut o16 = false;

            // Handle operand size prefix.
            if code == Op::OP_OSIZE {
                o16 = true;
                code = stream.copy::<u8>(Stream::Op, &mut dst);
            }

            // If the code is a return or MSVC function padding
            if matches!(code, Op::OP_RETNI | Op::OP_RETN | Op::OP_INT3) {
                next_is_func = true;
            }

            // Two-byte OpCode, additional OpCode byte following.
            if code == Op::OP_2BYTE {
                flags = TABLE_2[stream.copy::<u8>(Stream::OP2, &mut dst) as usize];
            } else {
                flags = TABLE_1[code as usize];
            }
            assert!(flags != F::ERR);

            if matches!(code, Op::OP_CALLF | Op::OP_JMPF | Op::OP_ENTER) {
                // far call/jump have a *48-bit* immediate address. we deal with it here by copying the segment
                // index manually and encoding the rest as a normal 32-bit direct address.
                // similarly, enter has a word operand and a byte operand. again, we code the word here, and
                // deal with the byte later during the normal flow.
                stream.copy::<u16>(Stream::Imm16, &mut dst);
            }

            if flags & F::MR == F::MR {
                let modrm = stream.copy::<u8>(Stream::MODRM, &mut dst);
                let mut sib = 0;

                if flags == F::MEXTRA {
                    flags = TABLE_X[(((modrm >> 3) & 7)
                        | ((code & 0x01) << 3)
                        | ((code & 0x08) << 1)) as usize] as _;
                }

                if (modrm & 0x07 == 4) && (modrm < 0xc0) {
                    sib = stream.copy::<u8>(Stream::Sib, &mut dst);
                }

                // Register + Byte Displacement
                if modrm & 0xc0 == 0x40 {
                    let stream_index = (modrm & 0x07) + Stream::ST_DISP8_R0 as u8;
                    copy::<u8>(&mut stream[stream_index], &mut dst);
                }

                if (modrm & 0xc0) == 0x80
                    || (modrm & 0xc7) == 0x05
                    || (modrm < 0x40 && (sib & 0x07) == 0x05)
                {
                    let stream_index = if modrm & 0xc7 == 5 {
                        Stream::Addr32
                    } else {
                        Stream::Disp32
                    };
                    stream.copy::<u32>(stream_index, &mut dst);
                }
            }

            if flags & F::MODE == F::AM {
                match flags & F::TYPE {
                    F::AD => {
                        stream.copy::<u32>(Stream::Addr32, &mut dst);
                    }
                    F::DA => {
                        stream.copy::<u32>(Stream::Jump32, &mut dst);
                    }
                    F::BR => {
                        stream.copy::<u8>(Stream::Jump8, &mut dst);
                    }
                    F::DR => {
                        let mut target;
                        if code == Op::OP_CALLN {
                            let index = stream.fetch::<u8>(Stream::CallIdx);
                            if index != 0 {
                                target = fn_table.move_to_front(index - 1, fn_table[index - 1]);
                            } else {
                                target = stream.fetch_be::<u32>(Stream::Call32);
                                fn_table.add(target);
                            }
                        } else {
                            target = stream.fetch_be::<u32>(Stream::Jump32);
                        }

                        target = target
                            .wrapping_sub((dst.as_ptr() as usize - start) as u32 + 4 + memory);
                        write::<u32>(&mut dst, target);
                    }
                    _ => {}
                }
            } else {
                match flags & F::TYPE {
                    F::BI => {
                        stream.copy::<u8>(Stream::Imm8, &mut dst);
                    }
                    F::WI => {
                        stream.copy::<u16>(Stream::Imm16, &mut dst);
                    }
                    F::DI => {
                        if !o16 {
                            stream.copy::<u32>(Stream::Imm32, &mut dst);
                        } else {
                            stream.copy::<u16>(Stream::Imm16, &mut dst);
                        }
                    }
                    _ => {}
                }
            }
        }
    }

    true
}

The Results

Unlike the E8E9 transform, it is important that this is only ran on the code segment of images, which means parsing the image headers (PE, ELF, etc). This also only assumes 32-bit x86 code, rather than x86_64/AMD64. I’m just going to hardcode the addresses. Maybe in the future I’ll dive into properly parsing and loading PE files, the corkami corpus isn’t going to disappearₚₗₑₐₛₑ any time soon.

Same deal as last time, just slightly lazier:

fn main() {
    // We needed a 32-bit PE image, found this hiding in C:\Windows\SysWOW64
    let input_buf = std::fs::read("explorer.exe").expect("Unable to read file");

    let encoded_buf = {
        // Just hardcode these for now
        let code_size = 0x356a00;
        let image_base = 0x400000;
        let code_start = 0x1000;
        let file_offset = 0x400;
        let origin = image_base + code_start;

        // These is the text segment of the image (where the code resides)
        let target_buf = &input_buf[file_offset..file_offset + code_size];

        // Actually apply the transformation. The code section actually
        // reduces without even compressing, remember the nifty function
        // cache?
        let output = dis_filter(target_buf, origin);
        println!("Filtered code: {} -> {}\n", target_buf.len(), output.len());

        // Create a header for the file, we need to know the length of each
        // stream for decoding. We prepend this to the output.
        let hdr = FileHeader {
            size_before: file_offset as _,
            size_after: (input_buf.len() - (file_offset + code_size)) as _,
            size_transformed: output.len() as _,
            size_original: code_size as _,
            origin: origin as _,
        };

        let mut with_header = Vec::<u8>::new();
        with_header.extend_from_slice(&hdr.as_vec());
        with_header.extend_from_slice(&output);
        with_header.extend_from_slice(&input_buf[..hdr.size_before as _]);

        let off = file_offset + code_size;
        with_header.extend_from_slice(&input_buf[off..off + hdr.size_after as usize]);

        with_header
    };

    // Compress without the transform
    let without_split_stream = lz4_flex::compress(&input_buf);
    println!(
        "Without split stream:\t{} -> {} ({:.2}%)",
        input_buf.len(),
        without_split_stream.len(),
        (without_split_stream.len() as f32 / input_buf.len() as f32) * 100.0,
    );

    // Try the compression a second time
    let with_split_stream = lz4_flex::compress(&encoded_buf);
    println!(
        "With split stream:\t{} -> {} ({:.2}%)",
        input_buf.len(),
        with_split_stream.len(),
        (with_split_stream.len() as f32 / input_buf.len() as f32) * 100.0,
    );

    // Report the compression ratios
    println!(
        "\t\t\t{} -> {} = {} ({:.2}%) smaller",
        without_split_stream.len(),
        with_split_stream.len(),
        without_split_stream.len() - with_split_stream.len(),
        100.0 - ((with_split_stream.len() as f32 / without_split_stream.len() as f32) * 100.0),
    );
}

Drum roll…

Filtered code: 3500544 -> 3373453 (96.37%)

Without split stream:   4990552 -> 2848730 (57.08%)
With split stream:      4990552 -> 2541620 (50.93%)
                        2848730 -> 2541620 = 307110 (10.78%) smaller 

Not bad, it just about doubles E8E9 as reported. I am only compressing with LZ4 for simplicity of the lz4_flex crate, however LZMA would provide much better ratios.

Mahoney, M. (2016) ZPAQ Source Code (Version 7.15) [Source code]. https://mattmahoney.net/dc/zpaq715.zip ↩
Mahoney, M. (2012, November 28). pcompress, a deduplication/compression utility. Encode. https://encode.su/threads/1639-pcompress-a-deduplication-compression-utility?p=31545#post31545 ↩
Giesen, F. (2011, January 24). x86 code compression in kkrunchy. The ryg blog. https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ ↩

Assembly Transformations for Binary Compression

Table of Contents

Introduction

E8E9 Filter

The Results

Split-Stream Encoding

The Results