Assembly Transformations for Binary Compression
Table of Contents
Introduction
Since the introduction of memory-managed programming languages, the tradeoffs between speed and safety (or the rather the conveniences of a high-level language) have often been the center of attention when concerning the “correct” direction for the next piece of software to be written in. Some modern languages such as C++, and especially Rust have worked towards closing that gap with their “zero-cost abstractions.” However, one of the less notable tradeoffs with these languages is the size of the executable image they produce. Languages such Java, Python, or any .NET language obviously depend on some runtime environment that must be installed to run their programs. But the same happens with any language that has a dependency on libc, for example. With Rust this has become particularly noticeable, with a StackOver user reporting their debug build of “Hello, World!” being 3MB (3 million bytes!). There are many reasons for why their debug build is so large, and there’s an entire GitHub page to help mitigate the size. But how small can we go?
This isn’t exactly a new question, especially to the demoscene. For many years, authors have been showing off visual/audio software demos often restricted to 64k or 4K bytes in size for various platforms. Often this will be done with the help of an executable packing tool. Packers compress the original program into a separate “stub,” a tiny program (often written in assembler) that will decompress and execute the original image with no difference to the end user. UPX is the most famous example of an executable packer, boasting a 50%-70% reduction in size.
This article isn’t going to focus on how packers or stubs work, but rather some of the obscure methods I have personally come across to increase the compression ratio of executable images before being compressed by a context modeling or LZ-based algorithm, along with my take on a modern implementation of them in Rust.
E8E9 Filter
Originally introduced in windows cabinet files, the BCJ algorithm, also known as the E8E9 transform/filter, finds jump and call instructions that are encoded with relative addresses (a “displacement” from the instruction pointer), and converts the target address to it’s absolute counterpart.
Consider the following snippet of a FizzBuzz program in x86:
div_by_3_5: ; print fizz buzz
mov eax, 4 ; syscall number
mov ebx, 1 ; stdout
mov ecx, fizzbuzz ; ptr to buffer
mov edx, 11 ; buffer size
int 0x80
jmp dword cont
div_by_3: ; print fizz
mov eax, 4 ; syscall number
mov ebx, 1 ; stdout
mov ecx, fizz ; ptr to buffer
mov edx, 6 ; buffer size
int 0x80
jmp dword cont
div_by_5: ; print buzz
mov eax, 4 ; syscall number
mov ebx, 1 ; stdout
mov ecx, buzz ; ptr to buffer
mov edx, 6 ; buffer size
int 0x80
jmp dword cont
cont:
pop ecx ; restore loop counter
cmp ecx, 100
jz exit ; end of loop
inc ecx ; increment loop counter
jmp loop_main
; Here I am using `jmp dword`, to force the instruction to be encoded as a "near
; jump" (E9 cd, JMP rel32), rather than a short jump (EB cb JMP rel8).
section .data
fizz dw "Fizz", 0Ah, 0Dh
buzz dw "Buzz", 0Ah, 0Dh
fizzbuzz dw "Fizz Buzz", 0Ah, 0Dh
After being compiled and then linked, we can use objdump
to see how all of the jmp
instructions look like after being encoded:
nasm -f elf32 -o fizzbuzz.o fizzbuzz.nasm
ld -m elf_i386 -o fizzbuzz fizzbuzz.o
objdump -d fizzbuzz | grep "jmp.*cont"
...
8049046: e9 51 00 00 00 jmp 804909c <cont>
8049061: e9 36 00 00 00 jmp 804909c <cont>
804907c: e9 1b 00 00 00 jmp 804909c <cont>
8049097: e9 00 00 00 00 jmp 804909c <cont>
Even though we are jumping to the same location, since the jumps are being encoded as “Jump near, relative, RIP = RIP + 32-bit displacement sign extended to 64-bits”, the resulting bytes are slightly different with every jump. We can increase the redundancy in the file by converting the jump from their displacement (offset from the instruction pointer) to their absolute form (offset from the beginning of the file). Ideally, we would want the output to look something more like this:
By converting the relative form to its absolute address (eg. 0x804909c - (0x8049046 + 5) = 51
), we would end up with a binary file that looks a little more like :
8049046: e9 80 49 09 c0 jmp 804909c <cont>
8049061: e9 80 49 09 c0 jmp 804909c <cont>
804907c: e9 80 49 09 c0 jmp 804909c <cont>
8049097: e9 80 49 09 c0 jmp 804909c <cont>
Matt Mahoney’s ZPAQ contains a good reference implementation of this algorithm in C:
It scans the input block backward for 5 byte strings of the form {E8|E9 xx xx xx 00|FF} and adds the offset from the start of the block to the middle 3 bytes interpreted as a little-endian (LSB first) number (mod 2^24). E8 and E9 are the CALL and JMP instructions, followed by a 32 bit relative offset.1
// E8E9 transform of buf[0..n-1] to improve compression of .exe and .dll.
// Patterns (E8 | E9 xx xx xx 00 | FF) at offset i replace the 3 middle
// bytes with x+i mod 2^24, LSB first, reading backward.
void e8e9(unsigned char* buf, int n) {
for (int i = n-5; i >= 0; --i) {
if (((buf[i] & 254) == 0xe8) && ((buf[i + 4] + 1) & 254) == 0) {
unsigned a = (buf[i + 1] | buf[i + 2] << 8 | buf [i + 3] << 16) + i;
buf[i + 1] = a;
buf[i + 2] = a >> 8;
buf[i + 3] = a >> 16;
}
}
}
The implementation is pretty straight forward, with a few notable optimizations. Firstly,
on buf[i] & 254 == 0xe8
. Since 0xE8
and 0xE9
only have a 1-bit difference
at the end, a bitwise AND makes it possible to have one comparison for both operands, for example:
This is trivial for modern compilers to catch for you, and I opted to omit this for clarity in my implementation.
The second check for 0x00 || 0xFF
does in fact produce two cmp instructions
when implemented as buf[i + 4] == 0xFF || buf[i + 4] == 0x00
rather than relying on the overflow,
so I decided to keep the trick in my implementation to reduce the final assembly size.
Finally here is my take on the implementation in Rust:
// The lending iterator trait and WindowsMut structure is just a fancy way of
// being able to iterate over the input buffer without manually indexing it
// C-style. I wouldn't ever include all of this overly complex boilerplate into
// production software, where the simplicity has a much higher value. It is
// however an example of a (close to) "zero-cost abstraction" that I felt would
// be interesting to include.
trait LendingIterator {
type Item<'a>
where
Self: 'a;
fn next<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)>;
fn prev<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)>;
}
pub struct WindowsMut<'window, const WINDOW_LEN: usize, T> {
slice: &'window mut [T],
index: usize,
}
impl<'window, const WINDOW_LEN: usize, T> WindowsMut<'window, WINDOW_LEN, T> {
fn new(slice: &'window mut [T]) -> Self {
Self { slice, index: 0 }
}
fn reverse(self) -> Self {
let index = self.slice.len() - WINDOW_LEN;
Self {
slice: self.slice,
index,
}
}
}
impl<'window, const WINDOW_LEN: usize, T> LendingIterator for WindowsMut<'window, WINDOW_LEN, T> {
type Item<'a> = &'a mut [T; WINDOW_LEN] where Self: 'a;
fn next<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)> {
let Self { slice, index } = self;
let ans = slice[*index..].first_chunk_mut();
let res = ans.map(|chunk| (*index, chunk));
*index += 1;
res
}
fn prev<'a>(&'a mut self) -> Option<(usize, Self::Item<'a>)> {
let Self { slice, index } = self;
let ans = slice[*index..].first_chunk_mut();
if *index == 0 {
None
} else {
let res = ans.map(|chunk| (*index, chunk));
*index -= 1;
res
}
}
}
// This modifies the buffer in place, rather than requiring the user to make a
// seperate allocation to use.
pub fn e8e9(bytes: &mut [u8]) {
let mut iter = WindowsMut::new(bytes);
while let Some((i, &mut [a, ref mut b, ref mut c, ref mut d, e])) = iter.next() {
// We must use e.wrapping_add(1) here rather than e + 1, because
// overflowing will panic in Debug builds, but we are relying on this
// behaviour.
if (a == 0xE8 || a == 0xE9) && (e.wrapping_add(1) & 254 == 0) {
// u32::from_le_bytes produces a better assembly output over '*b
// as u32 | ((*c as u32) << 8) | ((*d as u32) << 16)', and is nicer
// to read.
let displacement = u32::from_le_bytes([*b, *c, *d, 0]);
let offset = displacement.wrapping_add(i as u32);
*b = (offset >> 16) as u8;
*c = (offset >> 8) as u8;
*d = offset as u8;
}
}
}
And inversely:
pub fn e8e9_inverse(bytes: &mut [u8]) {
let mut iter = WindowsMut::new(bytes).reverse();
while let Some((i, &mut [a, ref mut b, ref mut c, ref mut d, e])) = iter.prev() {
// We must use e.wrapping_add(1) here rather than e + 1, because
// overflowing will panic in Debug builds, but we are relying on this
// behaviour.
if (a == 0xE8 || a == 0xE9) && (e.wrapping_add(1) & 254 == 0) {
// u32::from_le_bytes produces a better assembly output over '*d
// as u32 | ((*c as u32) << 8) | ((*b as u32) << 16)', and is nicer
// to read.
let displacement = u32::from_le_bytes([*d, *c, *b, 0]);
let offset = displacement.wrapping_sub(i as u32);
*b = offset as u8;
*c = (offset >> 8) as u8;
*d = (offset >> 16) as u8;
}
}
}
Despite the using the Lending Iterator pattern to build a mutable sliding window, and the safety guarantees in the Rust version, the compiled output is only marginally bigger than it’s C counterpart:
; rustc 1.82.0 with -Copt-level=3
example::e8e9::h2b129b97de9d3e96:
push rbx
mov eax, 1
mov rcx, rsi
jmp .LBB0_1
.LBB0_4:
lea rax, [rdx + 1]
dec rdx
dec rcx
cmp rdx, rsi
jae .LBB0_5
.LBB0_1:
cmp rcx, 5
jb .LBB0_7
mov rdx, rax
movzx eax, byte ptr [rdi + rax - 1]
mov r8d, eax
and r8b, -2
cmp r8b, -24
jne .LBB0_4
movzx r8d, byte ptr [rdi + rdx + 3]
inc r8b
cmp r8b, 2
jae .LBB0_4
movzx ebx, word ptr [rdi + rdx]
shl ebx, 8
or ebx, eax
add ebx, edx
add eax, edx
mov byte ptr [rdi + rdx], al
mov byte ptr [rdi + rdx + 1], bh
shr ebx, 16
mov byte ptr [rdi + rdx + 2], bl
jmp .LBB0_4
.LBB0_7:
pop rbx
ret
; C Version
e8e9:
add esi, -5
.LBB0_1:
test esi, esi
js .LBB0_6
mov eax, esi
mov cl, byte ptr [rdi + rax]
and cl, -2
cmp cl, -24
jne .LBB0_5
mov cl, byte ptr [rdi + rax + 4]
inc cl
cmp cl, 1
ja .LBB0_5
movzx edx, word ptr [rdi + rax + 1]
movzx ecx, byte ptr [rdi + rax + 3]
shl ecx, 16
add edx, esi
add ecx, edx
mov byte ptr [rdi + rax + 1], dl
mov byte ptr [rdi + rax + 2], ch
shr ecx, 16
mov byte ptr [rdi + rax + 3], cl
.LBB0_5:
dec esi
jmp .LBB0_1
.LBB0_6:
ret
The Results
To test the transform, we can create a new Rust program with cargo new
e8e9-test
, and then add a dependency to lz4_flex with cargo add lz4_flex
while in the e8e9-test
directory; Then add
the following code to main.rs
:
fn main() {
// Get the input argument from the User
let args: Vec<_> = std::env::args().collect();
assert!(args.len() == 2, "Bad usage");
// Read the input binary into a vector
let mut bytes = std::fs::read(&args[1]).expect("Unable to read file");
// Compress the image using LZ4 without the transform
let without_e8e9 = lz4_flex::compress(&bytes);
println!(
"Without e8e9:\t{} -> {} ({:.2}%)",
bytes.len(),
without_e8e9.len(),
(without_e8e9.len() as f32 / bytes.len() as f32) * 100.0,
);
// Apply the transform
e8e9(&mut bytes);
// Try the compression a second time
let with_e8e9 = lz4_flex::compress(&bytes);
println!(
"With e8e9:\t{} -> {} ({:.2}%)",
bytes.len(),
with_e8e9.len(),
(with_e8e9.len() as f32 / bytes.len() as f32) * 100.0,
);
// Report the compression ratios
println!(
"\t\t{} -> {} = {} ({:.2}%) smaller",
without_e8e9.len(),
with_e8e9.len(),
without_e8e9.len() - with_e8e9.len(),
100.0 - ((with_e8e9.len() as f32 / without_e8e9.len() as f32) * 100.0),
);
}
As you can see, we don’t need to parse the executable image for it’s text section (where the code is) due to the improbability of the E8E9 pattern appearing in random data. If the probability of a single byte appearing is , the probability of the pattern appearing in random data is . Furthermore, Matt Mahoney points out that 0xE8 and 0xE9 is out of range for ASCII encoded data, and 0xFF is invalid for UTF-8 encoded data2.
Finally place any executable image (PE, ELF, etc.) in the directory with your Makefile.toml,
and run cargo run -- filename
, I chose nvidia-pcc.exe just due to it’s sheer size.
cargo run -- nvidia-pcc.exe
Without e8e9: 25312776 -> 13390424 (52.90%)
With e8e9: 25312776 -> 12674656 (50.07%)
13390424 -> 12674656 = 715768 bytes (5.35%) smaller
We can see that this transformation saved another 715,768 bytes in our final compressed buffer, not bad for such a simple and fast algorithm. Personally, this is enough for me - shaving off any more bytes isn’t worth it … something something diminishing returns. Maybe I only say this because I spent the last couple days debugging the next one. Either way, it doesn’t get any simpler.
Split-Stream Encoding
Fabian “ryg” Giesen’s executable packer, kkrunchy uses a slightly more involved approach to compressing x86. Originally based on an MS Research Technical Report paper titled PPMexe, split-stream encoding seperates the components of each instruction, and groups them into “streams” based on their function. In the following example, each stream is denoted by a seperate colour:
e8 6a 13 02 00 call CopyMem4
8b 4b 1c mov ecx, dword ptr [ebx+1Ch]
83 c4 0c add esp, 0Ch
8b c5 mov eax, ebp
d1 e8 shr eax, 1
8b f5 mov esi, ebp
8d 14 40 lea edx, dword ptr [eax+eax*2]
8b 44 24 1c mov eax, dword ptr [esp+1Ch]
83 e6 01 and esi, 1
8d 3c 96 lea edi, dword ptr [esi+edx*4]
39 44 b9 1c cmp dword ptr [ecx+edi*4+1ch], eax
75 77 jne short L15176
Opcode + ModRM — Jump Offset — Displacement — SIB — Immediate
https://www.farbrausch.de/~fg/seminars/workcompression_download.pdf
The idea being that the components of instructions are more alike (read
repetitive) in nature, over how they laid out within the sequence of a
regular instruction. Notice how the Displacement stream (coloured by red) will
now contain three bytes repeated bytes of 0x1c
, in contrast to the left-hand
column (the original sequence of bytes), where they are laid out further apart from each other.
This does require a disassembler implementation (a quite complex problem
for CISC instruction sets) in order to seperate the instructions into
their respective streams. Fortunately, the disassembler doesn’t actually
have to be very accurate. When x86 is incorrectly disassembled, it tends to
synchronize itself
shortly after. This can be observed in commercial
disassemblers such as IDA Pro or Ghidra, when bytes of data (rather than
code) end up in the .text
section of an executable image. Dougall J has a more extensive
article
that goes into deeper detail on the subject.
Although this transform comes at an implementation cost, in terms of both the complexity and the actual size of the compiled decompression routine. This can increase the compression of an executable up to ~20%, approximately 10% higher than the E8E9 transform3.
Althought I can’t fit the entire implementation in the article, I tried to document some of the main ideas in the comments. The following is based on the disfilter source, which is detailed more in depth here. Aside from just splitting the instructions into seperate streams, it utilizes a cache of target addresses (FIFO queue). Whenever a call is encountered, the “target” is added to the cache. Whenever an address already exists in the cache, we can just write the index (1-byte) into the cache rather than the full displacement (4-bytes).
Without further delay, here is my main transform routine:
// Function Table implementation omitted for brevity...
const FN_TABLE_LEN: usize = 256;
#[repr(transparent)]
pub struct FunctionCache {
table: [u32; FN_TABLE_LEN],
}
struct SplitStreamCtx {
buffer: InstructionStream,
fn_table: FunctionCache,
next_is_func: bool,
code_start: usize,
code_end: usize,
}
impl SplitStreamCtx {
pub fn process_instr(&mut self, instr: &[u8], memory: usize) -> (bool, usize) {
let mut instr = instr;
// If there is a jump with N entries
if let Some(entries) = self.detect_jmp_table(instr, memory) {
// For every entry in the jump table
let mut remaining = entries;
while remaining > 0 {
// Jump table counts are encoded as u8's, so they are 256
// entries long per Jump Table code.
let count = if remaining < 256 { remaining } else { 256 };
// Emit the jump table opcode, followed the count.
self.buffer.put::<u8>(Stream::Op, Op::JUMPTAB);
self.buffer
.put::<u8>(Stream::ST_JUMPTBL_COUNT, (count - 1) as u8);
for _ in 0..count {
// Retrieve the target function that the entry leads to
let target = DisFilterCtx::fetch::<u32>(&mut instr);
// If our function exists in the table
if let Some(index) = self.fn_table.find(target) {
self.buffer.put(Stream::CallIdx, (index + 1) as u8);
} else {
// The function was added to the start of the table
self.buffer.put::<u8>(Stream::CallIdx, 0);
self.buffer.put::<u32>(Stream::Call32, target);
}
}
remaining -= count;
}
// The next instruction is a function, and each entry is
// 4 bytes long.
return (true, entries as usize * 4);
}
// Save where we start parsing the instruction to calculate how many
// bytes we have consumed by the end of the routine. Then save the
// original OpCode for emitting, in case where we are unable to decode
// the full instruction.
let start = instr.as_ptr() as usize;
let start_op = Self::fetch::<u8>(&mut instr);
// Fetch the instruction's OpCode.
let mut code = start_op;
// MSVC pads the space between functions with int3.
if self.next_is_func && code != Op::OP_INT3 {
self.fn_table.add(memory as u32);
self.next_is_func = false;
}
let o16 = if code == Op::OP_OSIZE {
code = Self::fetch::<u8>(&mut instr);
true
} else {
false
};
let (code2, mut flags) = if code == Op::OP_2BYTE {
let code2 = Self::fetch::<u8>(&mut instr);
(code2 as i32, TABLE_2[code2 as usize] as i32)
} else {
(0 as i32, TABLE_1[code as usize] as i32)
};
if matches!(code, Op::OP_RETNI | Op::OP_RETN | Op::OP_INT3) {
// Return, function is going to start next.
self.next_is_func = true;
}
if flags as u8 == F::MEXTRA {
flags = TABLE_X
[(((instr[0] >> 3) & 7) | ((code & 0x01) << 3) | ((code & 0x08) << 1)) as usize]
as _;
}
if flags as u8 != F::ERR {
if o16 {
self.buffer.put::<u8>(Stream::Op, Op::OP_OSIZE);
}
self.buffer.put::<u8>(Stream::Op, code);
if code == OpCode::OP_2BYTE {
self.buffer.put::<u8>(Stream::Op, code2 as _);
}
if matches!(code, Op::OP_CALLF | Op::OP_JMPF | Op::OP_ENTER) {
// far call/jump have a *48-bit* immediate address. we deal with
// it here by copying the segment index manually and encoding
// the rest as a normal 32-bit direct address. similarly, enter
// has a word operand and a byte operand. again, we code the
// word here, and deal with the byte later during the normal
// flow.
self.copy::<u16>(Stream::Imm16, &mut instr);
}
if flags as u8 & F::MODE == F::MR {
let modrm = self.copy::<u8>(Stream::MODRM, &mut instr);
let sib = if (modrm & 0x07 == 4) && (modrm < 0xc0) {
self.copy::<u8>(Stream::Sib, &mut instr)
} else {
0
};
// Register + Byte Displacement
if modrm & 0xc0 == 0x40 {
let stream = Stream::ST_DISP8_R0 as u8 + (modrm & 0x07);
self.copy::<u8>(stream.try_into().unwrap(), &mut instr);
}
// Register + Dword Displacement
if (modrm & 0xc0 == 0x80)
|| (modrm & 0xc7 == 0x05)
|| ((modrm < 0x40) && (sib & 0x07 == 5))
{
let stream = if modrm & 0xc7 == 0x05 {
Stream::Addr32
} else {
Stream::Disp32
};
self.copy::<u32>(stream, &mut instr);
}
}
if flags as u8 & F::MODE == F::AM {
match flags as u8 & F::TYPE {
F::AD => {
self.copy::<u32>(Stream::Addr32, &mut instr);
}
F::DA => {
self.copy::<u32>(Stream::AJump32, &mut instr);
}
F::BR => {
self.copy::<u8>(Stream::Jump8, &mut instr);
}
F::DR => {
let mut target = Self::fetch::<u32>(&mut instr);
target = target.wrapping_add(
((instr.as_ptr() as usize - start) as u32) + memory as u32,
);
if code != OpCode::OP_CALLN {
self.buffer.put::<u32>(Stream::Jump32, target);
} else {
// If our function exists in the table
if let Some(index) = self.fn_table.find(target) {
self.buffer.put::<u8>(Stream::CallIdx, index + 1);
} else {
// The function was added to the start of the table
self.buffer.put::<u8>(Stream::CallIdx, 0);
self.buffer.put::<u32>(Stream::Call32, target);
}
}
}
_ => {}
}
} else {
match flags as u8 & F::TYPE {
F::BI => {
self.copy::<u8>(Stream::Imm8, &mut instr);
}
F::WI => {
self.copy::<u16>(Stream::Imm16, &mut instr);
}
F::DI => {
if !o16 {
self.copy::<u32>(Stream::Imm32, &mut instr);
} else {
self.copy::<u16>(Stream::Imm16, &mut instr);
}
}
_ => {}
}
}
let bytes = instr.as_ptr() as usize - start;
return (false, bytes);
}
// We were unable to decode decode instruction.
else {
// Emit the escape code, followed by the unrecognized OpCode.
self.buffer.put::<u8>(Stream::Op, Op::Escape);
self.buffer.put::<u8>(Stream::Op, start_op);
return (false, 1);
}
}
fn detect_jmp_table(&self, instr: &[u8], addr: usize) -> Option<u32> {
assert!(addr < self.code_end);
let max = (self.code_end - addr) / 4;
let mut count = 0;
while count < max {
let coded_addr = Self::load::<u32>(&instr[(count * 4)..]);
if coded_addr >= self.code_start as _ && coded_addr < (self.code_end as _) {
count += 1;
} else {
break;
}
}
// If there are less than 3 entries, it's probably not a jump table.
(count >= 3).then_some(count as u32)
}
}
pub fn split_stream(src: &[u8], origin: usize) -> Vec<u8> {
let mut ctx = SplitStreamCtx::new(origin, origin + src.len());
// Process all instructions, until there are less than 15 bytes (the
// maximum size of an x86 instruction).
let mut pos = 0;
while pos < src.len() - MAX_INSTR_LEN {
let (was_jump, bytes) = ctx.process_instr(&src[pos..], origin + pos);
assert!(was_jump || bytes <= MAX_INSTR_LEN);
pos += bytes;
}
// for the last few bytes, be very careful not to read past the end
// of the input instruction stream. create a check point on every
// instruction; if PackInstr would've read past the end of the input
// stream, we undo the last step.
while pos < src.len() {
// Copy remaining bytes into buffer
let mut instr_buf = [0u8; MAX_INSTR_LEN];
instr_buf[..src.len() - pos].copy_from_slice(&src[pos..]);
// Save current output size for all streams
let check_point: [usize; Stream::ST_MAX] = ctx
.buffer
.iter()
.map(|buf| buf.len())
.collect::<Vec<usize>>()
.try_into()
.unwrap();
// Process the instruction
let (was_jmp, bytes) = ctx.process_instr(&src, origin + pos);
assert!(was_jmp || bytes <= MAX_INSTR_LEN);
// If it was a valid instruction
if pos + bytes <= src.len() {
pos += bytes;
} else {
// We read past the end, restore the checkpoint
ctx.buffer
.iter_mut()
.enumerate()
.for_each(|(i, buf)| buf.truncate(check_point[i]));
break;
}
}
// If there are still bytes left, encode them as escapes
while pos < src.len() {
ctx.buffer.put::<u8>(Stream::Op, OpCode::ESCAPE);
ctx.buffer.put::<u8>(Stream::Op, src[pos]);
pos += 1;
}
// Generate the final output
ctx.flush()
}
And inversely:
#[repr(transparent)]
pub struct BorrowedInstructionStream<'a> {
// Stream::ST_MAX is the number of seperate streams the instructions are
// grouped into (19).
buffer: [&'a [u8]; Stream::ST_MAX],
}
// BorrowedInstructionStream implementation ommited for brevity ...
pub fn inverse_split_stream(src: &[u8], dst: &mut [u8], mem_start: u32) -> bool {
// Create a binding that will be "consumed" or incremented, as we write to the destination.
let mut dst = dst;
// Intialize the function cache.
let mut fn_table = FunctionTable::new();
// Create a table of streams from the contigous stream lengths present in
// the header.
let mut stream = {
use core::mem::{transmute, MaybeUninit};
// Do not initialize the table, we need to calculate each stream from
// the length bytes.
let mut stream: [MaybeUninit<&[u8]>; Stream::ST_MAX] =
[MaybeUninit::uninit(); Stream::ST_MAX];
let mut header = src;
let mut offset = Stream::ST_MAX * size_of::<u32>();
for slice in &mut stream {
// Retrieve the length of the slice, then initialize it.
let len = fetch::<u32>(&mut header) as usize;
slice.write(&src[offset..offset + len]);
offset += len;
}
// SAFETY: Each stream has been fully initialized and the ABI of
// MaybeUninit<&[u8]> and &[u8] is the same. Furthurmore, the ABI of
// [&[u8]; Stream::ST_MAX] and that of BorrowedInstructionStream is
// the same, as it is marked as #[repr(transparent)].
transmute::<_, BorrowedInstructionStream>(stream)
};
// Assume that the code section starts at a function (it should be the entry
// point of the executable image).
let mut next_is_func = true;
// Keep track of the start of the destination slice, remember we will be
// "consuming" it as we decode the source and write to the destination.
// So we will no longer be able to access the beginning of the destination
// buffer through `dst`.
let dst_start = dst.as_ptr() as usize;
// While there are OpCodes still left in the stream.
while !stream[Stream::Op].is_empty() {
let start = dst.as_ptr() as usize;
let memory = mem_start + (dst.as_ptr() as u32 - dst_start as u32);
// Retrieve the next OpCode
let mut code = stream.fetch::<u8>(Stream::Op);
// If there is an OP_JUMPTAB (0xce), decode it.
if code == Op::JUMPTAB {
// The next byte contains the length of the jump table - 1.
let count = fetch::<u8>(&mut stream[Stream::ST_JUMPTBL_COUNT]) as u32 + 1;
// For every entry in the Jump Table
for _ in 0..count {
let index = stream.fetch::<u8>(Stream::CallIdx);
let target = if index != 0 {
fn_table.move_to_front(index - 1, fn_table[index - 1])
} else {
let target = stream.fetch_be::<u32>(Stream::Call32);
fn_table.add(target);
target
};
// Output the original function target.
write::<u32>(&mut dst, target);
}
// Immediately fetch the next opcode.
continue;
}
// If the next bytes are supposed to contain an OpCode, and the bytes
// are not function padding (emitted by MSVC).
if next_is_func && code != Op::OP_INT3 {
fn_table.add(memory);
next_is_func = false;
}
// If the op is an escape code, emit the following byte
if code == Op::ESCAPE {
stream.copy::<u8>(Stream::Op, &mut dst);
} else {
write::<u8>(&mut dst, code);
let mut flags;
let mut o16 = false;
// Handle operand size prefix.
if code == Op::OP_OSIZE {
o16 = true;
code = stream.copy::<u8>(Stream::Op, &mut dst);
}
// If the code is a return or MSVC function padding
if matches!(code, Op::OP_RETNI | Op::OP_RETN | Op::OP_INT3) {
next_is_func = true;
}
// Two-byte OpCode, additional OpCode byte following.
if code == Op::OP_2BYTE {
flags = TABLE_2[stream.copy::<u8>(Stream::OP2, &mut dst) as usize];
} else {
flags = TABLE_1[code as usize];
}
assert!(flags != F::ERR);
if matches!(code, Op::OP_CALLF | Op::OP_JMPF | Op::OP_ENTER) {
// far call/jump have a *48-bit* immediate address. we deal with it here by copying the segment
// index manually and encoding the rest as a normal 32-bit direct address.
// similarly, enter has a word operand and a byte operand. again, we code the word here, and
// deal with the byte later during the normal flow.
stream.copy::<u16>(Stream::Imm16, &mut dst);
}
if flags & F::MR == F::MR {
let modrm = stream.copy::<u8>(Stream::MODRM, &mut dst);
let mut sib = 0;
if flags == F::MEXTRA {
flags = TABLE_X[(((modrm >> 3) & 7)
| ((code & 0x01) << 3)
| ((code & 0x08) << 1)) as usize] as _;
}
if (modrm & 0x07 == 4) && (modrm < 0xc0) {
sib = stream.copy::<u8>(Stream::Sib, &mut dst);
}
// Register + Byte Displacement
if modrm & 0xc0 == 0x40 {
let stream_index = (modrm & 0x07) + Stream::ST_DISP8_R0 as u8;
copy::<u8>(&mut stream[stream_index], &mut dst);
}
if (modrm & 0xc0) == 0x80
|| (modrm & 0xc7) == 0x05
|| (modrm < 0x40 && (sib & 0x07) == 0x05)
{
let stream_index = if modrm & 0xc7 == 5 {
Stream::Addr32
} else {
Stream::Disp32
};
stream.copy::<u32>(stream_index, &mut dst);
}
}
if flags & F::MODE == F::AM {
match flags & F::TYPE {
F::AD => {
stream.copy::<u32>(Stream::Addr32, &mut dst);
}
F::DA => {
stream.copy::<u32>(Stream::Jump32, &mut dst);
}
F::BR => {
stream.copy::<u8>(Stream::Jump8, &mut dst);
}
F::DR => {
let mut target;
if code == Op::OP_CALLN {
let index = stream.fetch::<u8>(Stream::CallIdx);
if index != 0 {
target = fn_table.move_to_front(index - 1, fn_table[index - 1]);
} else {
target = stream.fetch_be::<u32>(Stream::Call32);
fn_table.add(target);
}
} else {
target = stream.fetch_be::<u32>(Stream::Jump32);
}
target = target
.wrapping_sub((dst.as_ptr() as usize - start) as u32 + 4 + memory);
write::<u32>(&mut dst, target);
}
_ => {}
}
} else {
match flags & F::TYPE {
F::BI => {
stream.copy::<u8>(Stream::Imm8, &mut dst);
}
F::WI => {
stream.copy::<u16>(Stream::Imm16, &mut dst);
}
F::DI => {
if !o16 {
stream.copy::<u32>(Stream::Imm32, &mut dst);
} else {
stream.copy::<u16>(Stream::Imm16, &mut dst);
}
}
_ => {}
}
}
}
}
true
}
The Results
Unlike the E8E9 transform, it is important that this is only ran on the code segment of images, which means parsing the image headers (PE, ELF, etc). This also only assumes 32-bit x86 code, rather than x86_64/AMD64. I’m just going to hardcode the addresses. Maybe in the future I’ll dive into properly parsing and loading PE files, the corkami corpus isn’t going to disappearₚₗₑₐₛₑ any time soon.
Same deal as last time, just slightly lazier:
fn main() {
// We needed a 32-bit PE image, found this hiding in C:\Windows\SysWOW64
let input_buf = std::fs::read("explorer.exe").expect("Unable to read file");
let encoded_buf = {
// Just hardcode these for now
let code_size = 0x356a00;
let image_base = 0x400000;
let code_start = 0x1000;
let file_offset = 0x400;
let origin = image_base + code_start;
// These is the text segment of the image (where the code resides)
let target_buf = &input_buf[file_offset..file_offset + code_size];
// Actually apply the transformation. The code section actually
// reduces without even compressing, remember the nifty function
// cache?
let output = dis_filter(target_buf, origin);
println!("Filtered code: {} -> {}\n", target_buf.len(), output.len());
// Create a header for the file, we need to know the length of each
// stream for decoding. We prepend this to the output.
let hdr = FileHeader {
size_before: file_offset as _,
size_after: (input_buf.len() - (file_offset + code_size)) as _,
size_transformed: output.len() as _,
size_original: code_size as _,
origin: origin as _,
};
let mut with_header = Vec::<u8>::new();
with_header.extend_from_slice(&hdr.as_vec());
with_header.extend_from_slice(&output);
with_header.extend_from_slice(&input_buf[..hdr.size_before as _]);
let off = file_offset + code_size;
with_header.extend_from_slice(&input_buf[off..off + hdr.size_after as usize]);
with_header
};
// Compress without the transform
let without_split_stream = lz4_flex::compress(&input_buf);
println!(
"Without split stream:\t{} -> {} ({:.2}%)",
input_buf.len(),
without_split_stream.len(),
(without_split_stream.len() as f32 / input_buf.len() as f32) * 100.0,
);
// Try the compression a second time
let with_split_stream = lz4_flex::compress(&encoded_buf);
println!(
"With split stream:\t{} -> {} ({:.2}%)",
input_buf.len(),
with_split_stream.len(),
(with_split_stream.len() as f32 / input_buf.len() as f32) * 100.0,
);
// Report the compression ratios
println!(
"\t\t\t{} -> {} = {} ({:.2}%) smaller",
without_split_stream.len(),
with_split_stream.len(),
without_split_stream.len() - with_split_stream.len(),
100.0 - ((with_split_stream.len() as f32 / without_split_stream.len() as f32) * 100.0),
);
}
Drum roll…
Filtered code: 3500544 -> 3373453 (96.37%)
Without split stream: 4990552 -> 2848730 (57.08%)
With split stream: 4990552 -> 2541620 (50.93%)
2848730 -> 2541620 = 307110 (10.78%) smaller
Not bad, it just about doubles E8E9 as reported. I am only compressing with LZ4 for simplicity of the lz4_flex crate, however LZMA would provide much better ratios.
-
Mahoney, M. (2016) ZPAQ Source Code (Version 7.15) [Source code]. https://mattmahoney.net/dc/zpaq715.zip ↩
-
Mahoney, M. (2012, November 28). pcompress, a deduplication/compression utility. Encode. https://encode.su/threads/1639-pcompress-a-deduplication-compression-utility?p=31545#post31545 ↩
-
Giesen, F. (2011, January 24). x86 code compression in kkrunchy. The ryg blog. https://fgiesen.wordpress.com/2011/01/24/x86-code-compression-in-kkrunchy/ ↩