|
Volume 0x0b, Issue 0x3f, Phile #0x0d of 0x14 |=------=[ cryptexec: Next-generation runtime binary encryption ]=-------=| |=------=[ using on-demand function extraction ]=-------=| |=-----------------------------------------------------------------------=| |=----------------=[ Zeljko Vrba <zvrba@globalnet.hr> ]=-----------------=| |=-----------------------------------------------------------------------=| ABSTRACT Please excuse my awkward English, it is not my native language. What is binary encryption and why encrypt at all? For the answer to this question the reader is referred to the Phrack#58 [1] and article therein titled "Runtime binary encryption". This article describes a method to control the target program that doesn't does not rely on any assistance from the OS kernel or processor hardware. The method is implemented in x86-32 GNU AS (AT&T syntax). Once the controlling method is devised, it is relatively trivial to include on-the-fly code decryption. 1 Introduction 2 OS- and hardware-assisted tracing 3 Userland tracing 3.1 Provided API 3.2 High-level description 3.3 Actual usage example 3.4 XDE bug 3.5 Limitations 3.6 Porting considerations 4 Further ideas 5 Related work 5.1 ELFsh 5.2 Shiva 5.3 Burneye 5.4 Conclusion 6 References 7 Credits A Appendix: source code A.1 crypt_exec.S A.2 cryptfile.c A.3 test2.c Note: Footnotes are marked by # and followed by the number. They are listed at the end of each section. --[ 1.0 - Introduction First let me introduce some terminology used in this article so that the reader is not confused. o The attributes "target", "child" and "traced" are used interchangeably (depending on the context) to refer to the program being under the control of another program. o The attributes "controlling" and "tracing" are used interchangeably to refer to the program that controls the target (debugger, strace, etc.) --[ 2.0 - OS- and hardware-assisted tracing Current debuggers (both under Windows and UNIX) use x86 hardware features for debugging. The two most commonly used features are the trace flag (TF) and INT3 instruction, which has a convenient 1-byte encoding of 0xCC. TF resides in bit 8 of the EFLAGS register and when set to 1 the pro- cessor generates exception 1 (debug exception) after each instruction is executed. When INT3 is executed, the processor generates exception 3 (breakpoint). The traditional way to trace a program under UNIX is the ptrace(2) syscall. The program doing the trace usually does the following (shown in pseudocode): fork() child: ptrace(PT_TRACE_ME) execve("the program to trace") parent: controls the traced program with other ptrace() calls Another way is to do ptrace(PT_ATTACH) on an already existing process. Other operations that ptrace() interface offers are reading/writing target instruction/data memory, reading/writing registers or continuing the execution (continually or up to the next system call - this capability is used by the well-known strace(1) program). Each time the traced program receives a signal, the controlling program's ptrace() function returns. When the TF is turned on, the traced program receives a SIGTRAP after each instruction. The TF is usually not turned on by the traced program#1, but from the ptrace(PT_STEP). Unlike TF, the controlling program places 0xCC opcode at strategic#2 places in the code. The first byte of the instruction is replaced with 0xCC and the controlling program stores both the address and the original opcode. When execution comes to that address, SIGTRAP is delivered and the controlling program regains control. Then it replaces (again using ptrace()) 0xCC with original opcode and single-steps the original instruction. After that the original opcode is usually again replaced with 0xCC. Although powerful, ptrace() has several disadvantages: 1. The traced program can be ptrace()d only by one controlling program. 2. The controlling and traced program live in separate address spaces, which makes changing traced memory awkward. 3. ptrace() is a system call: it is slow if used for full-blown tracing of larger chunks of code. I won't go deeper in the mechanics of ptrace(), there are available tutorials [2] and the man page is pretty self-explanatory. __ #1 Although nothing prevents it to do so - it is in the user-modifiable portion of EFLAGS. #2 Usually the person doing the debugging decides what is strategic. --[ 3.0 - Userland tracing The tracing can be done solely from the user-mode: the instructions are executed natively, except control-transfer instructions (CALL, JMP, Jcc, RET, LOOP, JCXZ). The background of this idea is explained nicely in [3] on the primitive 1960's MIX computer designed by Knuth. Features of the method I'm about to describe: o It allows that only portions of the executable file are encrypted. o Different portions of the executable can be encrypted with different keys provided there is no cross-calling between them. o It allows encrypted code to freely call non-encrypted code. In this case the non-encrypted code is also executed instruction by instruction. When called outside of encrypted code, it still executes without tracing. o There is never more than 24 bytes of encrypted code held in memory in plaintext. o OS- and language-independent. The rest of this section explains the provided API, gives a high-level description of the implementation, shows a usage example and discusses Here are the details of my own implementation. ----[ 3.1 - Provided API No "official" header file is provided. Because of the sloppy and convenient C parameter passing and implicit function declarations, you can get away with no declarations whatsoever. The decryption API consists of one typedef and one function. typedef (*decrypt_fn_ptr)(void *key, unsigned char *dst, const unsigned char *src); This is the generic prototype that your decryption routine must fit. It is called from the main decryption routine with the following arguments: o key: pointer to decryption key data. Note that in most cases this is NOT the raw key but pointer to some kind of "decryption context". o dst: pointer to destination buffer o src: pointer to source buffer Note that there is no size argument: the block size is fixed to 8 bytes. The routine should not read more than 8 bytes from the src and NEVER output more than 8 bytes to dst. Another unusual constraint is that the decryption function MUST NOT modify its arguments on the stack. If you need to do this, copy the stack arguments into local variables. This is a consequence of how the routine is called from within the decryption engine - see the code for details. There are no constraints whatsoever on the kind of encryption which can be used. ANY bijective function which maps 8 bytes to 8 bytes is suitable. Encrypt the code with the function, and use its inverse for the decryption. If you use the identity function, then decryption becomes simple single-stepping with no hardware support -- see section 4 for related work. The entry point to the decryption engine is the following function: int crypt_exec(decrypt_fn_ptr dfn, const void *key, const void *lo_addr, const void *hi_addr, const void *F, ...); The decryption function has the capability to switch between executing both encrypted and plain-text code. The encrypted code can call the plain-text code and plain-text code can return into the encrypted code. But for that to be possible, it needs to know the address bounds of the encrypted code. Note that this function is not reentrant! It is not allowed for ANY kind of code (either plain-text or encrypted) running under the crypt_exec routine to call crypt_exec again. Things will break BADLY because the internal state of previous invocation is statically allocated and will get overwritten. The arguments are as follows: o dfn: Pointer to decryption function. The function is called with the key argument provided to crypt_exec and the addresses of destination and source buffers. o key: This are usually NOT the raw key bytes, but the initialized decryption context. See the example code for the test2 program: first the user-provided raw key is loaded into the decryption context and the address of the _context_ is given to the crypt_exec function. o lo_addr, hi_addr: These are low and high addresses that are encrypted under the same key. This is to facilitate calling non-encrypted code from within encrypted code. o F: pointer to the code which should be executed under the decryption engine. It can be an ordinary C function pointer. Since the tracing routine was written with 8-byte block ciphers in mind, the F function must be at least 8-byte aligned and its length must be a multiple of 8. This is easier to achieve (even with standard C) than it sounds. See the example below. o ... become arguments to the called function. crypt_exec arranges to function F to be called with the arguments provided in the varargs list. When crypt_exec returns, its return value is what the F returned. In short, the call x = crypt_exec(dfn, key, lo_addr, hi_addr, F, ...); has exactly the same semantics as x = F(...); would have, were F plain-text. Currently, the code is tailored to use the XDE disassembler. Other disassemblers can be used, but the code which accesses results must be changed in few places (all references to the disbuf variable). The crypt_exec routine provides a private stack of 4kB. If you use your own decryption routine and/or disassembler, take care not to consume too much stack space. If you want to enlarge the local stack, look for the local_stk label in the code. __ #3 In the rest of this article I will call this interchangeably tracing or decryption routine. In fact, this is a tracing routine with added decryption. ----[ 3.2 - High-level description The tracing routine maintains two contexts: the traced context and its own context. The context consists of 8 32-bit general-purpose registers and flags. Other registers are not modified by the routine. Both contexts are held on the private stack (that is also used for calling C). The idea is to fetch, one at a time, instructions from the traced program and execute them natively. Intel instruction set has rather irregular encoding, so the XDE [5] disassembler engine is used to find both the real opcode and total instruction length. During experiments on FreeBSD (which uses LOCK- prefixed MOV instruction in its dynamic loader) I discovered a bug in XDE which is described and fixed below. We maintain our own EIP in traced_eip, round it down to the next lower 8-byte boundary and then decrypt#4 24 bytes#5 into our own buffer. Then the disassembly takes place and the control is transferred to emulation routines via the opcode control table. All instructions, except control transfer, are executed natively (in traced context which is restored at appropriate time). After single instruction execution, the control is returned to our tracing routine. In order to prevent losing control, the control transfer instructions#6 are emulated. The big problem was (until I solved it) emulating indirect JMP and CALL instructions (which can appear with any kind of complex EA that i386 supports). The problem is solved by replacing the CALL/JMP instruction with MOV to register opcode, and modifying bits 3-5 (reg field) of modR/M byte to set the target register (this field holds the part of opcode in the CALL/JMP case). Then we let the processor to calculate the EA for us. Of course, a means are needed to stop the encrypted execution and to enable encrypted code to call plaintext code: 1. On entering, the tracing engine pops the return address and its private arguments and then pushes the return address back to the traced stack. At that moment: o The stack frame is good for executing a regular C function (F). o The top of stack pointer (esp) is stored into end_esp. 2. When the tracing routine encounters a RET instruction it first checks the traced_esp. If it equals end_esp, it is a point where the F function would have ended. Therefore, we restore the traced context and do not emulate RET, but let it execute natively. This way the tracing routine loses control and normal instruction execution continues. In order to allow encrypted code to call plaintext code, there are lo_addr and hi_addr parameters. These parameters determine the low and high boundary of encrypted code in memory. If the traced_eip falls out of [lo_addr, hi_addr) range, the decryption routine pointer is swapped with the pointer to a no-op "decryption" that just copies 8 bytes from source to destination. When the traced_eip again falls into that interval, the pointers are again swapped. __ #4 The decryption routine is called indirectly for reasons described later. #5 The number comes from worst-case considerations: if an instruction begins at a boundary that is 7 (mod 8), given maximum instruction length of 15 bytes, yields a total of 22 bytes = 3 blocks. The buffer has 32 bytes in order to accommodate an additional JMP indirect instruction after the traced instruction. The JMP jumps indirectly to place in the tracing routine where execution should continue. #6 INT instructions are not considered as control transfer. After (if) the OS returns from the invoked trap, the program execution continues sequentially, the instruction right after INT. So there are no special measures that should be taken. ----[ 3.3 - Actual usage example Given encrypted execution engine, how do we test it? For this purpose I have written a small utility named cryptfile that encrypts a portion of the executable file ($ is UNIX prompt): $ gcc -c cast5.c $ gcc cryptfile.c cast5.o -o cryptfile $ ./cryptfile USAGE: ./cryptfile <-e_-d> FILE KEY STARTOFF ENDOFF KEY MUST be 32 hex digits (128 bits). The parameters are as follows: o -e,-d: one of these is MANDATORY and stands for encryption or decryption. o FILE: the executable file to be encrypted. o KEY: the encryption key. It must be given as 32 hex digits. o STARTOFF, ENDOFF: the starting and ending offset in the file that should be encrypted. They must be a multiple of block size (8 bytes). If not, the file will be correctly encrypted, but the encrypted execution will not work correctly. The whole package is tested on a simple program, test2.c. This program demonstrates that encrypted functions can call both encrypted and plaintext functions as well as return results. It also demonstrates that the engine works even when calling functions in shared libraries. Now we build the encrypted execution engine: $ gcc -c crypt_exec.S $ cd xde101 $ gcc -c xde.c $ cd .. $ ld -r cast5.o crypt_exec.o xde101/xde.o -o crypt_monitor.o I'm using patched XDE. The last step is to combine several relocatable object files in a single relocatable file for easier linking with other programs. Then we proceed to build the test program. We must ensure that functions that we want to encrypt are aligned to 8 bytes. I'm specifying 16 , just in case. Therefore: $ gcc -falign-functions=16 -g test2.c crypt_monitor.o -o test2 We want to encrypt functions f1 and f2. How do wemap from function names to offsets in the executable file? Fortunately, this can be simply done for ELF with the readelf utility (that's why I chose such an awkward way - I didn't want to bother with yet another ELF 'parser'). $ readelf -s test2 Symbol table '.dynsym' contains 23 entries: Num: Value Size Type Bind Vis Ndx Name 0: 00000000 0 NOTYPE LOCAL DEFAULT UND 1: 08048484 57 FUNC GLOBAL DEFAULT UND printf 2: 08050aa4 0 OBJECT GLOBAL DEFAULT ABS _DYNAMIC 3: 08048494 0 FUNC GLOBAL DEFAULT UND memcpy 4: 08050b98 4 OBJECT GLOBAL DEFAULT 20 __stderrp 5: 08048468 0 FUNC GLOBAL DEFAULT 8 _init 6: 08051c74 4 OBJECT GLOBAL DEFAULT 20 environ 7: 080484a4 52 FUNC GLOBAL DEFAULT UND fprintf 8: 00000000 0 NOTYPE WEAK DEFAULT UND __deregister_frame.. 9: 0804fc00 4 OBJECT GLOBAL DEFAULT 13 __progname 10: 080484b4 172 FUNC GLOBAL DEFAULT UND sscanf 11: 08050b98 0 NOTYPE GLOBAL DEFAULT ABS __bss_start 12: 080484c4 0 FUNC GLOBAL DEFAULT UND memset 13: 0804ca64 0 FUNC GLOBAL DEFAULT 11 _fini 14: 080484d4 337 FUNC GLOBAL DEFAULT UND atexit 15: 080484e4 121 FUNC GLOBAL DEFAULT UND scanf 16: 08050b98 0 NOTYPE GLOBAL DEFAULT ABS _edata 17: 08050b68 0 OBJECT GLOBAL DEFAULT ABS _GLOBAL_OFFSET_TABLE_ 18: 08051c78 0 NOTYPE GLOBAL DEFAULT ABS _end 19: 080484f4 101 FUNC GLOBAL DEFAULT UND exit 20: 08048504 0 FUNC GLOBAL DEFAULT UND strlen 21: 00000000 0 NOTYPE WEAK DEFAULT UND _Jv_RegisterClasses 22: 00000000 0 NOTYPE WEAK DEFAULT UND __register_frame_info Symbol table '.symtab' contains 145 entries: Num: Value Size Type Bind Vis Ndx Name 0: 00000000 0 NOTYPE LOCAL DEFAULT UND 1: 080480f4 0 SECTION LOCAL DEFAULT 1 2: 08048110 0 SECTION LOCAL DEFAULT 2 3: 08048128 0 SECTION LOCAL DEFAULT 3 4: 080481d0 0 SECTION LOCAL DEFAULT 4 5: 08048340 0 SECTION LOCAL DEFAULT 5 6: 08048418 0 SECTION LOCAL DEFAULT 6 7: 08048420 0 SECTION LOCAL DEFAULT 7 8: 08048468 0 SECTION LOCAL DEFAULT 8 9: 08048474 0 SECTION LOCAL DEFAULT 9 10: 08048520 0 SECTION LOCAL DEFAULT 10 11: 0804ca64 0 SECTION LOCAL DEFAULT 11 12: 0804ca80 0 SECTION LOCAL DEFAULT 12 13: 0804fc00 0 SECTION LOCAL DEFAULT 13 14: 08050aa0 0 SECTION LOCAL DEFAULT 14 15: 08050aa4 0 SECTION LOCAL DEFAULT 15 16: 08050b54 0 SECTION LOCAL DEFAULT 16 17: 08050b5c 0 SECTION LOCAL DEFAULT 17 18: 08050b64 0 SECTION LOCAL DEFAULT 18 19: 08050b68 0 SECTION LOCAL DEFAULT 19 20: 08050b98 0 SECTION LOCAL DEFAULT 20 21: 00000000 0 SECTION LOCAL DEFAULT 21 22: 00000000 0 SECTION LOCAL DEFAULT 22 23: 00000000 0 SECTION LOCAL DEFAULT 23 24: 00000000 0 SECTION LOCAL DEFAULT 24 25: 00000000 0 SECTION LOCAL DEFAULT 25 26: 00000000 0 SECTION LOCAL DEFAULT 26 27: 00000000 0 SECTION LOCAL DEFAULT 27 28: 00000000 0 SECTION LOCAL DEFAULT 28 29: 00000000 0 SECTION LOCAL DEFAULT 29 30: 00000000 0 SECTION LOCAL DEFAULT 30 31: 00000000 0 SECTION LOCAL DEFAULT 31 32: 00000000 0 FILE LOCAL DEFAULT ABS crtstuff.c 33: 08050b54 0 OBJECT LOCAL DEFAULT 16 __CTOR_LIST__ 34: 08050b5c 0 OBJECT LOCAL DEFAULT 17 __DTOR_LIST__ 35: 08050aa0 0 OBJECT LOCAL DEFAULT 14 __EH_FRAME_BEGIN__ 36: 08050b64 0 OBJECT LOCAL DEFAULT 18 __JCR_LIST__ 37: 0804fc08 0 OBJECT LOCAL DEFAULT 13 p.0 38: 08050b9c 1 OBJECT LOCAL DEFAULT 20 completed.1 39: 080485b0 0 FUNC LOCAL DEFAULT 10 __do_global_dtors_aux 40: 08050ba0 24 OBJECT LOCAL DEFAULT 20 object.2 41: 08048610 0 FUNC LOCAL DEFAULT 10 frame_dummy 42: 00000000 0 FILE LOCAL DEFAULT ABS crtstuff.c 43: 08050b58 0 OBJECT LOCAL DEFAULT 16 __CTOR_END__ 44: 08050b60 0 OBJECT LOCAL DEFAULT 17 __DTOR_END__ 45: 08050aa0 0 OBJECT LOCAL DEFAULT 14 __FRAME_END__ 46: 08050b64 0 OBJECT LOCAL DEFAULT 18 __JCR_END__ 47: 0804ca30 0 FUNC LOCAL DEFAULT 10 __do_global_ctors_aux 48: 00000000 0 FILE LOCAL DEFAULT ABS test2.c 49: 08048660 75 FUNC LOCAL DEFAULT 10 f1 50: 080486b0 58 FUNC LOCAL DEFAULT 10 f2 51: 08050bb8 16 OBJECT LOCAL DEFAULT 20 key.0 52: 080486f0 197 FUNC LOCAL DEFAULT 10 decode_hex_key 53: 00000000 0 FILE LOCAL DEFAULT ABS cast5.c 54: 0804cba0 1024 OBJECT LOCAL DEFAULT 12 s1 55: 0804cfa0 1024 OBJECT LOCAL DEFAULT 12 s2 56: 0804d3a0 1024 OBJECT LOCAL DEFAULT 12 s3 57: 0804d7a0 1024 OBJECT LOCAL DEFAULT 12 s4 58: 0804dba0 1024 OBJECT LOCAL DEFAULT 12 s5 59: 0804dfa0 1024 OBJECT LOCALDEFAULT 12 s6 60: 0804e3a0 1024 OBJECT LOCAL DEFAULT 12 s7 61: 0804e7a0 1024 OBJECT LOCAL DEFAULT 12 sb8 62: 0804a3c0 3734 FUNC LOCAL DEFAULT 10 key_schedule 63: 0804b408 0 NOTYPE LOCAL DEFAULT 10 identity_decrypt 64: 08051bf0 0 NOTYPE LOCAL DEFAULT 20 r_decrypt 65: 08051be8 0 NOTYPE LOCAL DEFAULT 20 key 66: 08050bd4 0 NOTYPE LOCAL DEFAULT 20 lo_addr 67: 08050bd8 0 NOTYPE LOCAL DEFAULT 20 hi_addr 68: 08050bcc 0 NOTYPE LOCAL DEFAULT 20 traced_eip 69: 08050be0 0 NOTYPE LOCAL DEFAULT 20 end_esp 70: 08050bd0 0 NOTYPE LOCAL DEFAULT 20 traced_ctr 71: 0804b449 0 NOTYPE LOCAL DEFAULT 10 decryptloop 72: 08050bc8 0 NOTYPE LOCAL DEFAULT 20 traced_esp 73: 08051be4 0 NOTYPE LOCAL DEFAULT 20 stk_end 74: 0804b456 0 NOTYPE LOCAL DEFAULT 10 decryptloop_nocontext 75: 0804b476 0 NOTYPE LOCAL DEFAULT 10 .store_decrypt_ptr 76: 08051bec 0 NOTYPE LOCAL DEFAULT 20 decrypt 77: 0804fc35 0 NOTYPE LOCAL DEFAULT 13 insn 78: 08051bf4 0 NOTYPE LOCAL DEFAULT 20 disbuf 79: 08051be4 0 NOTYPE LOCAL DEFAULT 20 ilen 80: 080501f0 0 NOTYPE LOCAL DEFAULT 13 continue 81: 0804fdf0 0 NOTYPE LOCAL DEFAULT 13 control_table 82: 0804fc20 0 NOTYPE LOCAL DEFAULT 13 _unhandled 83: 0804fc21 0 NOTYPE LOCAL DEFAULT 13 _nonjump 84: 0804fc33 0 NOTYPE LOCAL DEFAULT 13 .execute 85: 0804fc55 0 NOTYPE LOCAL DEFAULT 13 _jcc_rel8 86: 0804fc5e 0 NOTYPE LOCAL DEFAULT 13 _jcc_rel32 87: 0804fc65 0 NOTYPE LOCAL DEFAULT 13 ._jcc_rel32_insn 88: 0804fc71 0 NOTYPE LOCAL DEFAULT 13 ._jcc_rel32_true 89: 0804fc6b 0 NOTYPE LOCAL DEFAULT 13 ._jcc_rel32_false 90: 0804fc72 0 NOTYPE LOCAL DEFAULT 13 rel_offset_fixup 91: 0804fc7d 0 NOTYPE LOCAL DEFAULT 13 _retn 92: 0804fca6 0 NOTYPE LOCAL DEFAULT 13 ._endtrace 93: 0804fcbe 0 NOTYPE LOCAL DEFAULT 13 _loopne 94: 0804fce0 0 NOTYPE LOCAL DEFAULT 13 ._loop_insn 95: 0804fcd7 0 NOTYPE LOCAL DEFAULT 13 ._doloop 96: 0804fcc7 0 NOTYPE LOCAL DEFAULT 13 _loope 97: 0804fcd0 0 NOTYPE LOCAL DEFAULT 13 _loop 98: 0804fcec 0 NOTYPE LOCAL DEFAULT 13 ._loop_insn_true 99: 0804fce2 0 NOTYPE LOCAL DEFAULT 13 ._loop_insn_false 100: 0804fcf6 0 NOTYPE LOCAL DEFAULT 13 _jcxz 101: 0804fd0a 0 NOTYPE LOCAL DEFAULT 13 _callrel 102: 0804fd0f 0 NOTYPE LOCAL DEFAULT 13 _call 103: 0804fd38 0 NOTYPE LOCAL DEFAULT 13 _jmp_rel8 104: 0804fd41 0 NOTYPE LOCAL DEFAULT 13 _jmp_rel32 105: 0804fd49 0 NOTYPE LOCAL DEFAULT 13 _grp5 106: 0804fda4 0 NOTYPE LOCAL DEFAULT 13 ._grp5_continue 107: 08050bdc 0 NOTYPE LOCAL DEFAULT 20 our_esp 108: 0804fdc9 0 NOTYPE LOCAL DEFAULT 13 ._grp5_call 109: 0804fdd0 0 NOTYPE LOCAL DEFAULT 13 _0xf 110: 08050be4 0 NOTYPE LOCAL DEFAULT 20 local_stk 111: 00000000 0 FILE LOCAL DEFAULT ABS xde.c 112: 0804b419 0 NOTYPE GLOBAL DEFAULT 10 crypt_exec 113: 08048484 57 FUNC GLOBAL DEFAULT UND printf 114: 08050aa4 0 OBJECT GLOBAL DEFAULT ABS _DYNAMIC 115: 08048494 0 FUNC GLOBAL DEFAULT UND memcpy 116: 0804b684 4662 FUNC GLOBAL DEFAULT 10 xde_disasm 117: 08050b98 4 OBJECT GLOBAL DEFAULT 20 __stderrp 118: 0804fc04 0 OBJECT GLOBAL HIDDEN 13 __dso_handle 119: 0804b504 384 FUNC GLOBAL DEFAULT 10 reg2xset 120: 08048468 0 FUNC GLOBAL DEFAULT 8 _init 121: 0804c8bc 364 FUNC GLOBAL DEFAULT 10 xde_asm 122: 08051c74 4 OBJECT GLOBAL DEFAULT 20 environ 123: 080484a4 52 FUNC GLOBAL DEFAULT UND fprintf 124: 00000000 0 NOTYPE WEAK DEFAULT UND __deregister_frame.. 125: 0804fc00 4 OBJECT GLOBAL DEFAULT 13 __progname 126: 08048520 141 FUNC GLOBAL DEFAULT 10 _start 127: 0804b258 431 FUNC GLOBAL DEFAULT 10 cast5_setkey 128: 080484b4 172 FUNC GLOBAL DEFAULT UND sscanf 129: 08050b98 0 NOTYPE GLOBAL DEFAULT ABS __bss_start 130: 080484c4 0 FUNC GLOBAL DEFAULT UND memset 131: 080487c0 318 FUNC GLOBAL DEFAULT 10 main 132: 0804ca64 0 FUNC GLOBAL DEFAULT 11 _fini 133: 080484d4 337 FUNC GLOBAL DEFAULT UND atexit 134: 080484e4 121 FUNC GLOBAL DEFAULT UND scanf 135: 08050200 2208 OBJECT GLOBAL DEFAULT 13 xde_table 136: 08050b98 0 NOTYPE GLOBAL DEFAULT ABS _edata 137: 08050b68 0 OBJECT GLOBAL DEFAULT ABS _GLOBAL_OFFSET_TABLE_ 138: 08051c78 0 NOTYPE GLOBAL DEFAULT ABS _end 139: 08049660 3421 FUNC GLOBAL DEFAULT 10 cast5_decrypt 140: 080484f4 101 FUNC GLOBAL DEFAULT UND exit 141: 08048900 3421 FUNC GLOBAL DEFAULT 10 cast5_encrypt 142: 08048504 0 FUNC GLOBAL DEFAULT UND strlen 143: 00000000 0 NOTYPE WEAK DEFAULT UND _Jv_RegisterClasses 144: 00000000 0 NOTYPE WEAK DEFAULT UND __register_frame_info We see that function f1 has address 0x8048660 and size 75 = 0x4B. Function f2 has address 0x80486B0 and size 58 = 3A. Simple calculation shows that they are in fact consecutive in memory so we don't have to encrypt them separately but in a single block ranging from 0x8048660 to 0x80486F0. $ readelf -l test2 Elf file type is EXEC (Executable file) Entry point 0x8048520 There are 6 program headers, starting at offset 52 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x08048034 0x08048034 0x000c0 0x000c0 R E 0x4 INTERP 0x0000f4 0x080480f4 0x080480f4 0x00019 0x00019 R 0x1 [Requesting program interpreter: /usr/libexec/ld-elf.so.1] LOAD 0x000000 0x08048000 0x08048000 0x06bed 0x06bed R E 0x1000 LOAD 0x006c00 0x0804fc00 0x0804fc00 0x00f98 0x02078 RW 0x1000 DYNAMIC 0x007aa4 0x08050aa4 0x08050aa4 0x000b0 0x000b0 RW 0x4 NOTE 0x000110 0x08048110 0x08048110 0x00018 0x00018 R 0x4 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .hash .dynsym .dynstr .rel.dyn .rel.plt .init .plt .text .fini .rodata 03 .data .eh_frame .dynamic .ctors .dtors .jcr .got .bss 04 .dynamic 05 .note.ABI-tag >From this we see that both addresses (0x8048660 and 0x80486F0) fall into the first LOAD segment which is loaded at VirtAddr 0x804800 and is placed at offset 0 in the file. Therefore, to map virtual address to file offset we simply subtract 0x8048000 from each address giving 0x660 = 1632 and 0x6F0 = 1776. If you obtain ELFsh [7] then you can make your life much easier. The following transcript shows how ELFsh can be used to obtain the same information: $ elfsh Welcome to The ELF shell 0.51b3 .::. .::. This software is under the General Public License .::. Please visit http://www.gnu.org to know about Free Software [ELFsh-0.51b3]$ load test2 [*] New object test2 loaded on Mon Jun 13 20:45:33 2005 [ELFsh-0.51b3]$ sym f1 [SYMBOL TABLE] [Object test2] [059] 0x8048680 FUNCTION f1 size:0000000075 foffset:001632 scope:Local sctndx:10 => .text + 304 [ELFsh-0.51b3]$ sym f2 [SYMBOL TABLE] [Object test2] [060] 0x80486d0 FUNCTION f2 size:0000000058 foffset:001776 scope:Local sctndx:10 => .text + 384 [ELFsh-0.51b3]$ exit [*] Unloading object 1 (test2) * Good bye ! .::. The ELF shell 0.51b3 The field foffset gives the symbol offset within the executable, while size is its size. Here all the numbers are decimal. Now we are ready to encrypt a part of the executable with a very 'imaginative' password and then test the program: $ echo -n "password" | openssl md5 5f4dcc3b5aa765d61d8327deb882cf99 $ ./cryptfile -e test2 5f4dcc3b5aa765d61d8327deb882cf99 1632 1776 $ chmod +x test2.crypt $ ./test2.crypt At the prompt enter the same hex string and then enter numbers 12 and 34 for a and b. The result must be 1662, and esp before and after must be the same. Once you are sure that the program works correctly, you can strip(1) symbols from it. ----[ 3.4 - XDE bug During the development, a I have found a bug in the XDE disassembler engine: it didn't correctly handle the LOCK (0xF0) prefix. Because of the bug XDE claimed that 0xF0 is a single-byte instruction. This is the needed patch to correct the disassembler: --- xde.c Sun Apr 11 02:52:30 2004 +++ xde_new.c Mon Aug 23 08:49:00 2004 @@ -101,6 +101,8 @@ if (c == 0xF0) { if (diza->p_lock != 0) flag |= C_BAD; /* twice */ + diza->p_lock = c; + continue; } break; I also needed to remove __cdecl on functions, a 'feature' of Win32 C compilers not needed on UNIX platforms. ----[ 3.5 - Limitations o XDE engine (probably) can't handle new instructions (SSE, MMX, etc.). For certain it can't handle 3dNow! because they begin with 0x0F 0x0F, a byte sequence for which the XDE claims is an invalid instruction encoding. o The tracer shares the same memory with the traced program. If the traced program is so badly broken that it writes to (random) memory it doesn't own, it can stumble upon and overwrite portions of the tracing routine. o Each form of tracing has its own speed impacts. I didn't measure how much this method slows down program execution (especially compared to ptrace()). o Doesn't handle even all 386 instructions (most notably far calls/jumps and RET imm16). In this case the tracer stops with HLT which should cause GPF under any OS that runs user processes in rings other than 0. o The block size of 8 bytes is hardcoded in many places in the program. The source (both C and ASM) should be parametrized by some kind of BLOCKSIZE #define. o The tracing routine is not reentrant! Meaning, any code being executed by crypt_exec can't call again crypt_exec because it will overwrite its own context! o The code itself isn't optimal: - identity_decrypt could use 4-byte moves. - More registers could be used to minimize memory references. ----[ 3.6 - Porting considerations This is as heavy as it gets - there isn't a single piece of machine- independent code in the main routine that could be used on an another processor architecture. I believe that porting shouldn't be too difficult, mostly rewriting the mechanics of the current program. Some points to watch out for include: o Be sure to handle all control flow instructions. o Move instructions could affect processor flags. o Write a disassembly routine. Most RISC architectures have regular instruction set and should be far easier to disassemble than x86 code. o This is self-modifying code: flushing the instruction prefetch queue might be needed. o Handle delayed jumps and loads if the architecture provides them. This could be tricky. o You might need to get around page protections before calling the decryptor (non-executable data segments). Due to unavailability of non-x86 hardware I wasn't able to implement the decryptor on another processor. --[ 4 - Further ideas o Better encryption scheme. ECB mode is bad, especially with small block size of 8 bytes. Possible alternative is the following: 1. Round the traced_eip down to a multiple of 8 bytes. 2. Encrypt the result with the key. 3. Xor the result with the instruction bytes. That way the encryption depends on the location in memory. Decryption works the same way. However, it would complicate cryptfile.c program. o Encrypted data. Devise a transparent (for the C programmer) way to access the encrypted data. At least two approaches come to mind: 1) playing with page mappings and handling read/write faults, or 2) use XDE to decode all accesses to memory and perform encryption or decryption, depending on the type of access (read or write). The first approach seems too slow (many context switches per data read) to be practical. o New instruction sets and architectures. Expand XDE to handle new x86 instructions. Port the routine to architectures other than i386 (first comes to mind AMD64, then ARM, SPARC...). o Perform decryption on the smart card. This is slow, but there is no danger of key compromise. o Polymorphic decryption engine. ----[ 5 - Related Work This section gives a brief overview of existing work, either because of similarity in coding techniques (ELFsh and tracing without ptrace) or because of the code protection aspect. 5.1 ELFsh --------- The ELFsh crew's article on elfsh and e2dbg [7], also in this Phrack issue. A common point in our work is the approach to program tracing without using ptrace(2). Their latest work is a scriptable embedded ELF debugger, e2dbg. They are also getting around PaX protections, an issue I didn't even take into account. 5.2 Shiva --------- The Shiva binary encryptor [8], released in binary-only form. It tries really hard to prevent reverse engineering by including features such as trap flag detection, ptrace() defense, demand-mapped blocks (so that fully decrpyted image can't be dumped via /proc), using int3 to emulate some instructions, and by encryption in layers. The 2nd, password protected layer, is optional and encrypted using 128-bit AES. Layer 3 encryption uses TEA, the tiny encryption algorithm. According to the analysis in [9], "for sufficiently large programs, no more than 1/3 of the program will be decrypted at any given time". This is MUCH larger amount of decrypted program text than in my case: 24 bytes, independent of any external factors. Also, Shiva is heavily tied to the ELF format, while my method is not tied to any operating system or executable format (although the current code IS limited to the 32-bit x86 architecture). 5.3 Burneye ----------- There are actually two tools released by team-teso: burneye and burneye2 (objobf) [10]. Burneye is a powerful binary encryption tool. Similarly to Shiva, it has three layers: 1) obfuscation, 2) password-based encryption using RC4 and SHA1 (for generating the key from passphrase), and 3) the fingerprinting layer. The fingerprinting layer is the most interesting one: the data about the target system is collected (e.g. amount of memory, etc..) and made into a 'fingeprint'. The executable is encrypted taking the fingerprint into account so that the resulting binary can be run only on the host with the given fingerprint. There are two fingerprinting options: o Fingeprint tolerance can be specified so that Small deviations are allowed. That way, for example, the memory can be upgraded on the target system and the executable will still work. If the number of differences in the fingeprint is too large, the program won't work. o Seal: the program produced with this option will run on any system. However, the first time it is run, it creats a fingerprint of the host and 'seals' itself to that host. The original seal binary is securely deleted afterwards. The encrypted binary can also be made to delete itself when a certain environment variable is set during the program execution. objobf is just relocatable object obfuscator. There is no encryption layer. The input is an ordinary relocatable object and the output is transformed, obfuscated, and functionally equivalent code. Code transformations include: inserting junk instructions, randomizing the order of basic blocks, and splitting basic blocks at random points. 5.4 Conclusion -------------- Highlights of the distinguishing features of the code encryption technique presented here: o Very small amount of plaintext code in memory at any time - only 24 bytes. Other tools leave much more plain-text code in memory. o No special loaders or executable format manipulations are needed. There is one simple utility that encrypts the existing code in-place. It is executable format-independent since its arguments are function offsets within the executable (which map to function addresses in runtime). o The code is tied to the 32-bit x86 architecture, however it should be portable without changes to any operating system running on x86-32. Special arrangements for setting up page protections may be necessary if PaX or NX is in effect. On the downside, the current version of the engine is very vulnerable with respect to reverse-engineering. It can be easily recognized by scanning for fixed sequences of instructions (the decryption routine). Once the decryptor is located, it is easy to monitor a few fixed memory addresses to obtain both the EIP and the original instruction residing at that EIP. The key material data is easy to obtain, but this is the case in ANY approach using in-memory keys. However, the decryptor in its current form has one advantage: since it is ordinary code that does no special tricks, it should be easy to combine it with a tool that is more resilient to reverse-engineering, like Shiva or Burneye. ----[ 6 - References 1. Phrack magazine. http://www.phrack.org 2. ptrace tutorials: http://linuxgazette.net/issue81/sandeep.html http://linuxgazette.net/issue83/sandeep.html http://linuxgazette.net/issue85/sandeep.html 3. D. E. Knuth: The Art of Computer Programming, vol.1: Fundamental Algorithms. 4. Fenris. http://lcamtuf.coredump.cx/fenris/whatis.shtml 5. XDE. http://z0mbie.host.sk 6. Source code for described programs. The source I have written is released under MIT license. Other files have different licenses. The archive also contains a patched version of XDE. http://www.core-dump.com.hr/software/cryptexec.tar.gz 7. ELFsh, the ELF shell. A powerful program for manipulating ELF files. http://elfsh.devhell.org 8. Shiva binary encryptor. http://www.securereality.com.au 9. Reverse Engineering Shiva. http://blackhat.com/presentations/bh-federal-03/bh-federal-03-eagle/ bh-fed-03-eagle.pdf 10. Burneye and Burneye2 (objobf). http://packetstormsecurity.org/groups/teso/indexsize.html ----[ 7 - Credits Thanks go to mayhem who has reviewed this article. His suggestions were very helpful, making the text much more mature than the original. --[ A - Appendix: Source code Here I'm providing only my own source code. The complete source package can be obtained from [6]. It includes: o All source listed here, o the patched XDE disassembler, and o the source of the CAST5 cryptographic algorithm. ----[ A.1 - The tracer source: crypt_exec.S /* Copyright (c) 2004 Zeljko Vrba Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ .text /************************************************************************ * void *crypt_exec( * decrypt_fn_ptr dfn, const void *key, * const void *lo_addr, const void *hi_addr, * const void *addr, ...) * typedef (*decrypt_fn_ptr)( * void *key, unsigned char *dst, const unsigned char *src); * * - dfn is pointer to deccryption function * - key is pointer to crypt routine key data * - addr is the addres where execution should begin. due to the way the * code is decrypted and executed, it MUST be aligned to 8 (BLOCKSIZE) * bytes!! * - the rest are arguments to called function * * The crypt_exec stops when the stack pointer becomes equal to what it * was on entry, and executing 'ret' would cause the called function to * exit. This works assuming normal C compiled code. * * Returns the value the function would normally return. * * This code calls: * int xde_disasm(unsigned char *ip, struct xde_instr *outbuf); * XDE disassembler engine is compiled and used with PACKED structure! * * It is assumed that the encryption algorithm uses 64-bit block size. * Very good protection could be done if decryption is executed on the * SMART CARD. * * Some terminology: * 'Traced' refers to the original program being executed instruction by * instruction. The technique used resembles Knuth's tracing routine (and * indeed, we get true tracing when decryption is dropped). * * 'Our' refers to our data stack, etc. * * TODOs and limitations: * - some instructions are not emulated (FAR CALL/JMP/RET, RET NEAR imm16) * - LOOP* and JCXZ opcodes haven't been tested * - _jcc_rel32 has been tested only indirectly by _jcc_rel8 ***********************************************************************/ /* Offsets into xde_instr struct. */ #define OPCODE 23 #define OPCODE2 24 #define MODRM 25 /* Set up our stack and save traced context. The context is saved at the end of our stack. */ #define SAVE_TRACED_CONTEXT \ movl %esp, traced_esp ;\ movl $stk_end, %esp ;\ pusha ;\ pushf /* Restore traced context from the current top of stack. After that restores traced stack pointer. */ #define RESTORE_TRACED_CONTEXT \ popf ;\ popa ;\ movl traced_esp, %esp /* Identity decryption routine. This just copies 8 bytes (BLOCKSIZE) from source to destination. Has normal C calling convention. Is not global. */ identity_decrypt: movl 8(%esp), %edi /* destination address */ movl 12(%esp), %esi /* source address */ movl $8, %ecx /* 8 bytes */ cld rep movsb ret crypt_exec: .globl crypt_exec .extern disasm /* Fetch all arguments. We are called from C and not expected to save registers. This is the stack on entry: [ ret_addr dfn key lo_addr hi_addr addr ...args ] */ popl %eax /* return address */ popl r_decrypt /* real decryption function pointer */ popl key /* encryption key */ popl lo_addr /* low traced eip */ popl hi_addr /* high traced eip */ popl traced_eip /* eip to start tracing */ pushl %eax /* put return addr to stack again */ /* now the stack frame resembles as if inner function (starting at traced_eip) were called by normal C calling convention (after return address, the vararg arguments folow) */ movl %esp, end_esp /* this is used to stop tracing. */ movl $0, traced_ctr /* reset counter of insns to 0 */ decryptloop: /* This loop traces a single instruction. The CONTEXT at the start of each iteration: traced_eip: points to the next instruction in traced program First what we ever do is switch to our own stack and store the traced program's registers including eflags. Instructions are encrypted in ECB mode in blocks of 8 bytes. Therefore, we always must start decryption at the lower 8-byte boundary. The total of three blocks (24) bytes are decrypted for one instruction. This is due to alignment and maximum instruction length constraints: if the instruction begins at addres that is congruent to 7 mod 8 + 16 bytes maximum length (given some slack) gives instruction span of three blocks. Yeah, I know ECB sucks, but this is currently just a proof-of concept. Design something better for yourself if you need it. */ SAVE_TRACED_CONTEXT decryptloop_nocontext: /* This loop entry point does not save traced context. It is used from control transfer instruction emulation where we doall work ourselves and don't use traced context. The CONTEXT upon entry is the same as for decryptloop. First decide whether to decrypt or just trace the plaintext code. */ movl traced_eip, %eax movl $identity_decrypt, %ebx /* assume no decryption */ cmpl lo_addr, %eax jb .store_decrypt_ptr /* traced_eip < lo_addr */ cmpl hi_addr, %eax ja .store_decrypt_ptr /* traced_eip > hi_addr */ movl r_decrypt, %ebx /* in bounds, do decryption */ .store_decrypt_ptr: movl %ebx, decrypt /* Decrypt three blocks starting at eax, reusing arguments on the stack for the total of 3 calls. WARNING! For this to work properly, the decryption function MUST NOT modify its arguments! */ andl $-8, %eax /* round down traced_eip to 8 bytes */ pushl %eax /* src buffer */ pushl $insn /* dst buffer */ pushl key /* key data pointer */ call *decrypt /* 1st block */ addl $8, 4(%esp) /* advance dst */ addl $8, 8(%esp) /* advance src */ call *decrypt /* 2nd block */ addl $8, 4(%esp) /* advance dst */ addl $8, 8(%esp) /* advance src */ call *decrypt /* 3rd block */ addl $12, %esp /* clear args from stack */ /* Obtain the real start of instruction in the decrypted buffer. The traced eip is taken modulo blocksize (8) and added to the start address of decrypted buffer. Then XDE is called (standard C calling convention) to get necessary information about the instruction. */ movl traced_eip, %eax andl $7, %eax /* traced_eip mod 8 */ addl $insn, %eax /* offset within decrypted buffer */ pushl $disbuf /* address to disassemble into */ pushl %eax /* insn offset to disassemble */ call xde_disasm /* disassemble and return len */ movl %eax, ilen /* store instruction length */ popl %eax /* decrypted insn start */ popl %ebx /* clear remaining arg from stack */ /* Calculate the offset in control table of the instruction handling routine. Non-control transfer instructions are just executed in traced context, other instructions are emulated. Before executing the instruction, the traced eip is advanced by instruction length, and the number of executed instructions is incremented. We also append indirect 'jmp *continue' after the instruction, to continue execution at appropriate place in our tracing. The JMP indirect opcodes are 0xFF 0x25. */ movl ilen, %ebx addl %ebx, traced_eip /* advance traced eip */ incl traced_ctr /* increment counter */ movw $0x25FF, (%eax, %ebx) /* JMP indirect; little-endian! */ movl $continue, 2(%eax, %ebx) /* store address */ movzbl OPCODE+disbuf, %esi /* load instruction byte */ jmp *control_table(,%esi,4) /* execute by appropirate handler */ .data /* Emulation routines start here. They are in data segment because code segment isn't writable and we are modifying our own code. We don't want yet to mess around with mprotect(). One day (non-exec page table support on x86-64) it will have to be done anyway.. The CONTEXT upon entry on each emulation routine: eax : start of decrypted (CURRENT) insn addr to execute ilen : instruction length in bytes stack top -> [traced: eflags edi esi ebp esp ebx edx ecx eax] traced_esp : original program's esp traced_eip : eip of next insn to execute (NOT of CURRENT insn!) */ _unhandled: /* Unhandled opcodes not normally generated by compiler. Once proper emulation routine is written, they become handled :) Executing privileged instruction, such as HLT, is the easiest way to terminate the program. %eax holds the address of the instruction we were trying to trace so it can be observed from debugger. */ hlt _nonjump: /* Common emulation for all non-control transfer instructions. Instruction buffer (insn) is already filled with decrypted blocks. Decrypted instruction can begin in the middle of insn buffer, so the relative jmp instruction is adjusted to jump to the traced insn, skipping 'junk' at the beginning of insn. When the instruction is executed, our execution continues at location where 'continue' points to. Normally, this is decryptloop, but occasionaly it is temporarily changed (e.g. in _grp5). */ subl $insn, %eax /* insn begin within insn buffer */ movb %al, .execute+1 /* update jmp instruction */ RESTORE_TRACED_CONTEXT .execute: jmp insn /* relative, only offset adjusted */ insn: .fill 32, 1, 0x90 _jcc_rel8: /* Relative 8-bit displacement conditional jump. It is handled by relative 32-bit displacement jump, once offset is adjusted. Opcode must also be adjusted: short jumps are 0x70-0x7F, long jumps are 0x0F 0x80-0x8F. (conditions correspond directly). Converting short to long jump needs adding 0x10 to 2nd opcode. */ movsbl 1(%eax), %ebx /* load sign-extended offset */ movb (%eax), %cl /* load instruction */ addb $0x10, %cl /* adjust opcode to long form */ /* drop processing to _jcc_rel32 as 32-bit displacement */ _jcc_rel32: /* Emulate 32-bit conditional relative jump. We pop the traced flags, let the Jcc instruction execute natively, and then adjust traced eip ourselves, depending whether Jcc was taken or not. CONTEXT: ebx: jump offset, sign-extended to 32 bits cl : real 2nd opcode of the instruction (1st is 0x0F escape) */ movb %cl, ._jcc_rel32_insn+1 /* store opcode to instruction */ popf /* restore traced flags */ ._jcc_rel32_insn: /* Explicit coding of 32-bit relative conditional jump. It is executed with the traced flags. Also the jump offset (32 bit) is supplied. */ .byte 0x0F, 0x80 .long ._jcc_rel32_true - ._jcc_rel32_false ._jcc_rel32_false: /* The Jcc condition was false. Just save traced flags and continue to next instruction. */ pushf jmp decryptloop_nocontext ._jcc_rel32_true: /* The Jcc condition was true. Traced flags are saved, and then the execution falls through to the common eip offset-adjusting routine. */ pushf rel_offset_fixup: /* Common entry point to fix up traced eip for relative control-flow instructions. CONTEXT: traced_eip: already advanced to the would-be next instruction. this is done in decrypt_loop before transferring control to any insn-handler. ebx : sign-extended 32-bit offset to add to eip */ addl %ebx, traced_eip jmp decryptloop_nocontext _retn: /* Near return (without imm16). This is the place where the end-of trace condition is checked. If, at this point, esp equals end_esp, this means that the crypt_exec would return to its caller. */ movl traced_esp, %ebp /* compare curr traced esp to esp */ cmpl %ebp, end_esp /* when crypt_exec caller's return */ je ._endtrace /* address was on top of the stack */ /* Not equal, emulate ret. */ movl %esp, %ebp /* save our current stack */ movl traced_esp, %esp /* get traced stack */ popl traced_eip /* pop return address */ movl %esp, traced_esp /* write back traced stack */ movl %ebp, %esp /* restore our current stack */ jmp decryptloop_nocontext ._endtrace: /* Here the traced context is completely restored and RET is executed natively. Our tracing routine is no longer in control after RET. Regarding C calling convention, the caller of crypt_exec will get the return value of traced function. One detail we must watch for: the stack now looks like this: stack top -> [ ret_addr ...args ] but we have been called like this: stack top -> [ ret_addr dfn key lo_addr hi_addr addr ...args ] and this is what compiler expects when popping arg list. So we must fix the stack. The stack pointer can be just adjusted by -20 instead of reconstructing the previous state because C functions are free to modify their arguments. CONTEXT: ebp: current traced esp */ movl (%ebp), %ebx /* return address */ subl $20, %ebp /* fake 5 extra args */ movl %ebx, (%ebp) /* put ret addr on top of stack */ movl %ebp, traced_esp /* store adjusted stack */ RESTORE_TRACED_CONTEXT ret /* return without regaining control */ /* LOOPNE, LOOPE and LOOP instructions are executed from the common handler (_doloop). Only the instruction opcode is written from separate handlers. 28 is the offset of traced ecx register that is saved on our stack. */ _loopne: movb $0xE0, ._loop_insn /* loopne opcode */ jmp ._doloop _loope: movb $0xE1, ._loop_insn /* loope opcode */ jmp ._doloop _loop: movb $0xE2, ._loop_insn /* loop opcode */ ._doloop: /* * Get traced context that is relevant for LOOP* execution: signed * offset, traced ecx and traced flags. */ movsbl 1(%eax), %ebx movl 28(%esp), %ecx popf ._loop_insn: /* Explicit coding of loop instruction and offset. */ .byte 0xE0 /* LOOP* opcodes: E0, E1, E2 */ .byte ._loop_insn_true - ._loop_insn_false ._loop_insn_false: /* LOOP* condition false. Save only modified context (flags and ecx) and continue tracing. */ pushf movl %ecx, 28(%esp) jmp decryptloop_nocontext ._loop_insn_true: /* LOOP* condition true. Save only modified context, and jump to the rel_offset_fixup to fix up traced eip. */ pushf movl %ecx, 28(%esp) jmp rel_offset_fixup _jcxz: /* JCXZ. This is easier to simulate than to natively execute. */ movsbl 1(%eax), %ebx /* get signed offset */ cmpl $0, 28(%esp) /* test traced ecx for 0 */ jz rel_offset_fixup /* if so, fix up traced EIP */ jmp decryptloop_nocontext _callrel: /* Relative CALL. */ movb $1, %cl /* 1 to indicates relative call */ movl 1(%eax), %ebx /* get offset */ _call: /* CALL emulation. CONTEXT: cl : relative/absolute indicator. ebx: absolute address (cl==0) or relative offset (cl!=0). */ movl %esp, %ebp /* save our stack */ movl traced_esp, %esp /* push traced eip onto */ pushl traced_eip /* traced stack */ movl %esp, traced_esp /* write back traced stack */ movl %ebp, %esp /* restore our stack */ testb %cl, %cl /* if not zero, then it is a */ jnz rel_offset_fixup /* relative call */ movl %ebx, traced_eip /* store dst eip */ jmp decryptloop_nocontext /* continue execution */ _jmp_rel8: /* Relative 8-bit displacement JMP. */ movsbl 1(%eax), %ebx /* get signed offset */ jmp rel_offset_fixup _jmp_rel32: /* Relative 32-bit displacement JMP. */ movl 1(%eax), %ebx /* get offset */ jmp rel_offset_fixup _grp5: /* This is the case for 0xFF opcode which escapes to GRP5: the real instruction opcode is hidden in bits 5, 4, and 3 of the modR/M byte. */ movb MODRM+disbuf, %bl /* get modRM byte */ shr $3, %bl /* shift bits 3-5 to 0-2 */ andb $7, %bl /* and test only bits 0-2 */ cmpb $2, %bl /* < 2, not control transfer */ jb _nonjump cmpb $5, %bl /* > 5, not control transfer */ ja _nonjump cmpb $3, %bl /* CALL FAR */ je _unhandled cmpb $5, %bl /* JMP FAR */ je _unhandled movb %bl, %dl /* for future reference */ /* modR/M equals 2 or 4 (near CALL or JMP). In this case the reg field of modR/M (bits 3-5) is the part of instruction opcode. Replace instruction byte 0xFF with 0x8B (MOV r/m32 to reg32 opcode). Replace reg field with 3 (ebx register index). */ movb $0x8B, (%eax) /* replace with MOV_to_reg32 opcode */ movb 1(%eax), %bl /* get modR/M byte */ andb $0xC7, %bl /* mask bits 3-5 */ orb $0x18, %bl /* set them to 011=3: ebx reg index */ movb %bl, 1(%eax) /* set MOV target to ebx */ /* We temporarily update continue location to continue execution in this code instead of jumping to decryptloop. We execute MOV in TRACED context because it must use traced registers for address calculation. Before that we save OUR esp so that original TRACED context isn't lost (MOV updates ebx, traced CALL wouldn't mess with any registers). First we save OUR context, but after that we must restore TRACED ctx. In order to do that, we must adjust esp to point to traced context before restoration. */ movl $._grp5_continue, continue movl %esp, %ebp /* save traced context pointer into ebp */ pusha /* store our context; eflags irrelevant */ movl %esp, our_esp /* our context pointer */ movl %ebp, %esp /* adjust traced context pointer */ jmp _nonjump ._grp5_continue: /* This is where execution continues after MOV calculates effective address for us. CONTEXT upon entry: ebx: target address where traced execution should continue dl : opcode part (bits 3-5) of modR/M, shifted to bits 0-2 */ movl $decryptloop, continue /* restore continue location */ movl our_esp, %esp /* restore our esp */ movl %ebx, 16(%esp) /* so that ebx is restored anew */ popa /* our context along with new ebx */ cmpb $2, %dl /* CALL near indirect */ je ._grp5_call movl %ebx, traced_eip /* JMP near indirect */ jmp decryptloop_nocontext ._grp5_call: xorb %cl, %cl /* mark: addr in ebx is absolute */ jmp _call _0xf: /* 0x0F opcode esacpe for two-byte opcodes. Only 0F 0x80-0x8F range are Jcc rel32 instructions. Others are normal instructions. */ movb OPCODE2+disbuf, %cl /* extended opcode */ cmpb $0x80, %cl jb _nonjump /* < 0x80, not Jcc */ cmpb $0x8F, %cl ja _nonjump /* > 0x8F, not Jcc */ movl 2(%eax), %ebx /* load 32-bit offset */ jmp _jcc_rel32 control_table: /* This is the jump table for instruction execution dispatch. When the real opcode of the instruction is found, the tracer jumps indirectly to execution routine based on this table. */ .rept 0x0F /* 0x00 - 0x0E */ .long _nonjump /* normal opcodes */ .endr .long _0xf /* 0x0F two-byte escape */ .rept 0x60 /* 0x10 - 0x6F */ .long _nonjump /* normal opcodes */ .endr .rept 0x10 /* 0x70 - 0x7F */ .long _jcc_rel8 /* relative 8-bit displacement */ .endr .rept 0x10 /* 0x80 - 0x8F */ .long _nonjump /* long displ jump handled from */ .endr /* _0xf opcode escape */ .rept 0x0A /* 0x90 - 0x99 */ .long _nonjump .endr .long _unhandled /* 0x9A: far call to full pointer */ .rept 0x05 /* 0x9B - 0x9F */ .long _nonjump .endr .rept 0x20 /* 0xA0 - 0xBF */ .long _nonjump .endr .long _nonjump, _nonjump /* 0xC0, 0xC1 */ .long _unhandled /* 0xC2: retn imm16 */ .long _retn /* 0xC3: retn */ .rept 0x06 /* 0xC4 - 0xC9 */ .long _nonjump .endr .long _unhandled, _unhandled /* 0xCA, 0xCB : far ret */ .rept 0x04 .long _nonjump .endr .rept 0x10 /* 0xD0 - 0xDF */ .long _nonjump .endr .long _loopne, _loope /* 0xE0, 0xE1 */ .long _loop, _jcxz /* 0xE2, 0xE3 */ .rept 0x04 /* 0xE4 - 0xE7 */ .long _nonjump .endr .long _callrel /* 0xE8 */ .long _jmp_rel32 /* 0xE9 */ .long _unhandled /* far jump to full pointer */ .long _jmp_rel8 /* 0xEB */ .rept 0x04 /* 0xEC - 0xEF */ .long _nonjump .endr .rept 0x0F /* 0xF0 - 0xFE */ .long _nonjump .endr .long _grp5 /* 0xFF: group 5 instructions */ .data continue: .long decryptloop /* where to continue after 1 insn */ .bss .align 4 traced_esp: .long 0 /* traced esp */ traced_eip: .long 0 /* traced eip */ traced_ctr: .long 0 /* incremented by 1 for each insn */ lo_addr: .long 0 /* low encrypted eip */ hi_addr: .long 0 /* high encrypted eip */ our_esp: .long 0 /* our esp... */ end_esp: .long 0 /* esp when we should stop tracing */ local_stk: .fill 1024, 4, 0 /* local stack space (to call C) */ stk_end = . /* we need this.. */ ilen: .long 0 /* instruction length */ key: .long 0 /* pointer to key data */ decrypt: .long 0 /* USED decryption function */ r_decrypt: .long 0 /* REAL decryption function */ disbuf: .fill 128, 1, 0 /* xde disassembly buffer */ ----[ A.2 - The file encryption utility source: cryptfile.c /* Copyright (c) 2004 Zeljko Vrba Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ /* * This program encrypts a portion of the file, writing new file with * .crypt appended. The permissions (execute, et al) are NOT preserved! * The blocksize of 8 bytes is hardcoded. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include "cast5.h" #define BLOCKSIZE 8 #define KEYSIZE 16 typedef void (*cryptblock_f)(void*, u8*, const u8*); static unsigned char *decode_hex_key(char *hex) { static unsigned char key[KEYSIZE]; int i; if(strlen(hex) != KEYSIZE << 1) { fprintf(stderr, "KEY must have EXACTLY %d hex digits.\n", KEYSIZE << 1); exit(1); } for(i = 0; i < KEYSIZE; i++, hex += 2) { unsigned int x; char old = hex[2]; hex[2] = 0; if(sscanf(hex, "%02x", &x) != 1) { fprintf(stderr, "non-hex digit in KEY.\n"); exit(1); } hex[2] = old; key[i] = x; } return key; } static void *docrypt( FILE *in, FILE *out, long startoff, long endoff, cryptblock_f crypt, void *ctx) { char buf[BLOCKSIZE], enc[BLOCKSIZE]; long curroff = 0; size_t nread = 0; while((nread = fread(buf, 1, BLOCKSIZE, in)) > 0) { long diff = startoff - curroff; if((diff < BLOCKSIZE) && (diff > 0)) { /* this handles the following mis-alignment (each . is 1 byte) ...[..|......].... ^ ^ ^ curoff+BLOCKSIZE | startoff curroff */ if(fwrite(buf, 1, diff, out) < diff) { perror("fwrite"); exit(1); } memmove(buf, buf + diff, BLOCKSIZE - diff); fread(buf + BLOCKSIZE - diff, 1, diff, in); curroff = startoff; } if((curroff >= startoff) && (curroff < endoff)) { crypt(ctx, enc, buf); } else { memcpy(enc, buf, BLOCKSIZE); } if(fwrite(enc, 1, nread, out) < nread) { perror("fwrite"); exit(1); } curroff += nread; } } int main(int argc, char **argv) { FILE *in, *out; long startoff, endoff; char outfname[256]; unsigned char *key; struct cast5_ctx ctx; cryptblock_f mode; if(argc != 6) { fprintf(stderr, "USAGE: %s <-e|-d> FILE KEY STARTOFF ENDOFF\n", argv[0]); fprintf(stderr, "KEY MUST be 32 hex digits (128 bits).\n"); return 1; } if(!strcmp(argv[1], "-e")) { mode = cast5_encrypt; } else if(!strcmp(argv[1], "-d")) { mode = cast5_decrypt; } else { fprintf(stderr, "invalid mode (must be either -e od -d)\n"); return 1; } startoff = atol(argv[4]); endoff = atol(argv[5]); key = decode_hex_key(argv[3]); if(cast5_setkey(&ctx, key, KEYSIZE) < 0) { fprintf(stderr, "error setting key (maybe invalid length)\n"); return 1; } if((endoff - startoff) & (BLOCKSIZE-1)) { fprintf(stderr, "STARTOFF and ENDOFF must span an exact multiple" " of %d bytes\n", BLOCKSIZE); return 1; } if((endoff - startoff) < BLOCKSIZE) { fprintf(stderr, "STARTOFF and ENDOFF must span at least" " %d bytes\n", BLOCKSIZE); return 1; } sprintf(outfname, "%s.crypt", argv[2]); if(!(in = fopen(argv[2], "r"))) { fprintf(stderr, "fopen(%s): %s\n", argv[2], strerror(errno)); return 1; } if(!(out = fopen(outfname, "w"))) { fprintf(stderr, "fopen(%s): %s\n", outfname, strerror(errno)); return 1; } docrypt(in, out, startoff, endoff, mode, &ctx); fclose(in); fclose(out); return 0; } ----[ A.3 - The test program: test2.c /* Copyright (c) 2004 Zeljko Vrba Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include "cast5.h" #define BLOCKSIZE 8 #define KEYSIZE 16 /* * f1 and f2 are encrypted with the following 128-bit key: * 5f4dcc3b5aa765d61d8327deb882cf99 (MD5 of the string 'password') */ static int f1(int a) { int i, s = 0; for(i = 0; i < a; i++) { s += i*i; } printf("called plaintext code: f1 = %d\n", a); return s; } static int f2(int a, int b) { int i; a = f1(a); for(i = 0; i < b; i++) { a += b; } return a; } static unsigned char *decode_hex_key(char *hex) { static unsigned char key[KEYSIZE]; int i; if(strlen(hex) != KEYSIZE << 1) { fprintf(stderr, "KEY must have EXACTLY %d hex digits.\n", KEYSIZE << 1); exit(1); } for(i = 0; i < KEYSIZE; i++, hex += 2) { unsigned int x; char old = hex[2]; hex[2] = 0; if(sscanf(hex, "%02x", &x) != 1) { fprintf(stderr, "non-hex digit in KEY.\n"); exit(1); } hex[2] = old; key[i] = x; } return key; } int main(int argc, char **argv) { int a, b, result; char op[16], hex[256]; void *esp; struct cast5_ctx ctx; printf("enter decryption key: "); scanf("%255s", hex); if(cast5_setkey(&ctx, decode_hex_key(hex), KEYSIZE) < 0) { fprintf(stderr, "error setting key.\n"); return 1; } printf("a b = "); scanf("%d %d", &a, &b); asm("movl %%esp, %0" : "=m" (esp)); printf("esp=%p\n", esp); result = crypt_exec(cast5_decrypt, &ctx, f1, decode_hex_key, f2, a, b); asm("movl %%esp, %0" : "=m" (esp)); printf("esp=%p\n", esp); printf("result = %d\n", result); return 0; }