Heap overflow with stack-pivoting, format string mem leaking and first-stage ROP-ing to shellcode after making it executable on the heap - on a statically linked binary (MBE LAB7A)

This is was one of the most painstaking ones (which is reflected in the length of this write up). While finding the vulnerability was trivial, building a working exploit was quite of a challenge here.

The target app

The target app https://github.com/RPISEC/MBE/blob/master/src/lab07/lab7A.c is, on the face of it, quite similar to the previous one LAB7C.c (https://hackingiscool.pl/exploiting-the-same-user-after-free-twice-to-leak-the-mem-layout-and-execute-code-mbe-lab7c-walkthrough/). But instead of Use after Free, it's vulnerable to a multi-stage heap-based overflow.

There is a structure with buffers, length field and a pointer to overwrite:

And there are standard CRUD (Create/Read/Update/Delete) options available from the user interface, which - when chosen - call relevant functions:

The vulnerability

The initial (and very short) heap overflow resides in the create_message() function and was quite trivial to find:

It gets interesting when the program asks the user to provide the length for the new message by calling get_unum() (which is defined in https://github.com/RPISEC/MBE/blob/master/include/utils.h, for the record):

So, the vulnerability resides on lines 109-110:

While MAX_BLOCKS is 32, BLOCK_SIZE is 4.

msg->message[] buffer size is 128 (MAX_BLOCKS*sizeof(int)).

Results of arithmetic division operations on integers give integer results. So, 128/4 == 32, but also 129/4 == 32, 130/4 == 32 and 131/4 == 32.

129,130 and 131 are possible length values we can sneak in without hitting the (new_msg->msg_len / BLOCK_SIZE) > MAX_BLOCKS condition and having our input length overwritten with the safe value of MAX_BLOCKS*BLOCK_SIZE.

And then, right away after smuggling the slightly bigger new_msg->msg_len, in the same create_message() call, we have this:

So, read() (to which there are no bad characters, by the way! :D) writes new_msg->msglen bytes from standard input to the new_msg->message buffer. So we can overflow the new_msg->message buffer by up to three bytes. And what will we overwrite this way? Yup, the msg_len field! This way we can achieve having created a message with nearly any size in msg_len, as we can control 3 out of its 4 bytes.

This can be taken advantage of in the edit_message() function:

Here is the second heap overflow, directly resulting from the first one, making it possible to overwrite the heap contents far beyond currently edited message body and its length field (so we will overwrite the print_msg pointer of the next message we create on the heap, getting a foothold into execution control).

Also, note the numbuf[32] buffer used to store user input before converting it to an integer used for message index (with 10 being maximum expected number of messages, which are indexed from 0, hence one digit is actually enough to store the index). We are going to use it later.

More insight on the target app

First of all, LAB7A.c comes with the following readme:

This is the way it is being run (output from ps aux from the gameadmin/root user):

lab7end 12237 0.0 0.2 5076 2116 ? S 18:39 0:00 socat TCP-LISTEN:7741,reuseaddr,fork,su=lab7end EXEC:timeout 60 /levels/lab07/lab7A

So, it is running as the lab7end user (so we would have to become root to debug it) and has ugly timeout of 60 seconds, making debugging additional hassle.

So, we should work locally, on /levels/lab07/lab7A, or on its copy in /tmp.

This itself will not let us get rid of the timeout... because it is also implemented with the following macro call on line 10:

The macro boils down to calling alarm(60) and setting the alarm handler to a function doing exit().

I initially tried to work on a version I manually recompiled (with the only difference being the timeout parameter changed from 60 to 0 to effectively disable it), using the same flags... and I thought it was OK... but then later when searching for ROP gadgets I noticed slight differences in offsets, so I decided to work on the original copy instead and just add an automated call alarm(0) to my gdb script (-x commands.txt).

Second, we will have to write a custom shellcode here this time, as the binary is compiled statically (libc will no longer be dynamically linked, instead only the required functions are statically linked, which means they are built in the executable):

/* compiled with: gcc -static -z relro -z now -fstack-protector-all -o lab7A lab7A.c */

Simply returning to libc's system() won't be an option as we won't have the entire libc dynamically linked. Instead, only the libc functions actually used by the program would be linked in, by putting their code into the same code segment as program's own code. This will definitely make exploitation more difficult.

Another specific property is the lack of -fPIE -pie flags, the result of which was the code segment not being ASLR-ed, so all the functions and ROP gadgets are at fixed addresses (which will, in turn, make the exploitation easier). Still, other segments get ASLR-ed (except for the heap, which turned out kind of tricky - although its range in the target app was the same every time I checked, the first address returned by malloc() varied between instances, making the need for leaking).

How data is aligned in memory - getting our first crash at an EIP we control

Ok, let's run this.

A look at the vmmap output:

An interesting thing to notice, the heap is already mapped as well, even though we have not issued a single malloc() yet (as opposed to https://hackingiscool.pl/exploiting-the-same-user-after-free-twice-to-leak-the-mem-layout-and-execute-code-mbe-lab7c-walkthrough/), this is most likely due to the lack of -fPIE -pie flags.

Creating and viewing the first message:

OK, let's see it in memory:

Oh, this was unexpected. Where is it then? Let's find it by searching for any part (first four bytes) of the XOR pad/encrypted output we just displayed above. Search for the literal did not work due to endianness, search for the bytes in reversed order did the trick:

OK, let's see it (we need to aim a bit wider, as 0x80f19d4 is just the beginning of the XOR pad, while we want to see the entire structure and the preceding malloc metadata):

OK, now we create another message and have a look again:

Now, this is the layout with the messages sitting on the heap:

This should give us clear picture of how to start make the program call an arbitrary address. After we create message #0 with an arbitrary length value bigger than 140 (128 for the message body + 4 for the overwritten once again length value + 8 bytes for the malloc meta fields = 140), we will start overwriting message #1's print_msg() pointer, then the message #1's XOR pad, then the message #1's body itself.

Afterwards we ask the program to print message #1, making it call our overwritten #1's print_msg pointer.

First crash

So we create a new message, with arbitrary length of 131 (max we can sneak through the faulty boundary check) and we use those additional three bytes to smuggle three 'C's:

Step 1

Now, those three 'C's overwrote the three least significant bytes of the original msg.len field, turning it from 131 to 1128481536:

OK, cool. Let's proceed to a careful overwrite of a pointer. Luckily, we don't have to be very careful when it comes to the arbitrary value of the message length field we put here. CCC (0x0043434) is good enough, because we don't have to fill the full length of 1128481536, read() will stop when no more data is available from the standard input, at first we'll just write 144 bytes, with the last 4 being the new pointer for the next message's print_msg() function.

So:

  1. We create a message with declared length 131 and following content (this will be index #0):
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCC (already done).

  2. We create a second message with any length and content (irrelevant now).

  3. We edit the first message, filling it with this payload (ZZZZ will be the hijacked EIP):
    $ python -c 'print "A"*140+"Z"*4'
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAZZZZ

  4. We ask the program to print the second message (index #1) and watch it crash on ZZZZ.

Step 2
Step 3 - this time we put 144 bytes as the message body
Step 4

We control EIP, now what

Controlling only the EIP is not sufficient to make the program do exactly what we want. As long as we cannot simply inject a shellcode to an executable memory range and jump to it (and we can't - at least not yet :) - as we are dealing with DEP/NX), we need to go with ROP approach. Even if we were not doing an actual ROP chain, but a simple overwrite of the controlled pointer to system()'s address (which we can't here, as mentioned before), we still need to control the argument that function call is expecting to have on the stack when called.

Let's be entirely clear, this is what we hijack:

messages[i]->print_msg(messages[i]) call inside the print_index() call.

So, at the time of execution of our arbitrary EIP, the argument on the stack will be a pointer to messages[i] structure on the heap (with its first field being the print_msg pointer, by the way). Even if we could make EIP point at system(), we would still want the stack to contain a pointer pointing to a buffer we control - as opposed to the XOR pad, which is filled with random data.

This is what the stack looks like at the time function print_msg() (or ZZZZ) is entered:

The saved RET points to the next instruction in print_index(), messages[i] points to the beginning of the message structure on the heap.

Then I noticed that mprotect() is linked into the program:

So I instantly recalled XPN's ROP Primer writeup (https://blog.xpnsec.com/rop-primer-level-0/) and thought 'fuck yeah, I am gonna set EIP to mprotect() and make the heap executable, the buffer address is the first argument... but what about the next two?'

So the next two variables on the stack being 0x00000000 and 0x0000000a were ALMOST what I thought would suffice. 0x00000000 is where the buffer length should be (0 is not great for length), while 0x0000000a would serve as the flags (0x7 is RWX, so oxa includes 0xb, we would be fine as long as this flags value is not invalid). Obviously, this did not work (at least because of the 0 as length, but probably there's more than there is to it - we'll be back to this later).

Anyway, I started wondering whether I can control that 0x00000000 and 0x0000000a on the stack, only to figure out where they come from: they are a survival from the strtoul(numbuf, NULL, 10) call in print_index() right before our message is printed (it's the NULL and 10, last two arguments to strtoul()):

So I looked at the registers state and the stack at the time of the call/crash, looking for anything to hook on, any candidate for the next step in execution control that would allow us make the memory and registers alignment more favorable - and put my attention to EDX, as it was pointing to the buffer we control (remember, we can write past the ZZZZ any number of bytes we want!):

So I started wondering, what if I could find a ROP gadget that would somehow mv EDX to ESP, tricking the program to start using the heap as the stack? Then we could place the rest of the ROP chain on the heap, as we are having a hard time trying to control the stack right now.

By the way, ropeme is a nice tool for this (the lab7A.ggt was previously created by the same script, by calling generate on the target binary):

Well, let's just say I could not find the proper gadgets (looking for stuff like mov edx, esp; ret; or push edx; pop esp; ret). The two pop esps were not preceded with what I needed once I checked in gdb with x/5i address-offset. And yup, I tried x/5i address-offset for different offsets, as the instructions will vary depending on what offset of the opcodes we hit - which sometimes can work in our favor as we can find a gadget that does not exist under any offset normally operated by the program when it executes as intended - I accidently came across this being mentioned here https://securingtomorrow.mcafee.com/other-blogs/mcafee-labs/emerging-stack-pivoting-exploits-bypass-common-security. Plus, sometimes this phenomenon is abused as an anti-disassembly technique - which I learned from this workshop whitepaper which I recommend: https://github.com/theevilbit/workshops/blob/master/Anti Disassembly Workshop/Hacktivity 2015 - Fitzl Csaba - Hello Anti Disassembly.pdf.

But back on the subject!

As after about two hours I felt stuck, I decided to peek into Corb3nik's solution  https://github.com/Corb3nik/MBE-Solutions/tree/master/lab7a for clues.

Corb3nik's solution - tricky format-string-based leaking of the heap pointer after sneaky stack-grooming

Reading Corb3nik's exploit made me realize that leaking a heap-stored message pointer would in fact be needed for a reliable exploit to work - if we want to use the heap for our payload (as we could alternatively use the stack - but then we would need to leak the stack address, so it does not really make much difference at this point). So leaking has to be done before proceeding to attaining arbitrary code execution. And it turned out to be tricky (no comfy gadget this time, like with https://hackingiscool.pl/mbe-is-fun-lab6a-walkthrough/).

So, if we want to leak something, we can overwrite the EIP with printf()'s address, which as we know is fixed in this case. But what about the argument? When doing heap overflow, we overwrite the next message print_msg pointer and if we keep writing, we overwrite it's XOR pad. Then if we ask the program to print the message, it will call printf() with just one argument, being the pointer to the message[i] structure:

messages[i] (0x080f09d0 at the moment) looks like this:

Now here's the trick. If printf() treats the entire buffer pointed by messages[i] as a string, the XOR pad we overwrite past the print_msg pointer (which itself we overwrite with printf()'s address) can contain a FORMAT STRING expression.

This means that if we overwrite the XOR pad (pointed by messages[i]) with a string like %1$p, it will print the value of the next dword on the stack, right to the heap pointer itself - exactly where the next argument to printf() would be, if it was called with that format string properly (like printf("%1$p",some_pointer_to_be_printed_out);):

This way, we can pick an arbitrary value from the stack we want printed plain and clean in proper pointer format (thus the %p, whereas number$ index selects the number of the value from the stack). So, even the nullbytes on the stack are not an issue.

As we can see on the screenshot above, we could already leak the stack (the last value to the right at the bottom, 0xbffff728 in that case) if we wanted to use it for storing the final payload. Something I realized only while writing this up and have not tried pursuing.

In Corb3nik's solution - which I followed - heap was used to store the payload (a ROP chain execve(/bin/sh) shellcode), by using a pop esp; ret + payload_addr_on_the_heap first-stage chain to trick the program to start using the heap as the stack - something I originally wanted to do when I got stuck.

So,  the problem with this approach is that we do not actually have a pointer pointing to the heap anywhere on the stack while print_msg()/printf() is called - except for messages[i] - but messages[i] is literally the first argument from printf() call's perspective and we cannot reach it with %0$p format string (I tried)... we can reach everything past it, but not it itself. And it's not held anywhere on the heap itself either, so just leaking the heap without a format string at all wouldn't help.

This is where Corb3nik used a recurrent call of print_index() from within print_index(). In order to achieve this, four messages had to be created, because two overwritten pointers were needed:

  1. By overflowing message #0, message #1's pointer was overwritten to print_index() (so this is why we already had to create two messages).
  2. By overflowing message #2, message #3's pointer was overwritten to printf(), with its XOR pad being overwritten with the format string (thus, two more mesages).
  3. Then, by asking the program to print message #1, we have it call print_index() from the menu and asks for the number of the message to print. Once the number is provided, print_index() calls print_msg() - or whatever pointer we put there. Since for message  #1 we overwrote this pointer with print_index(), by asking the program to print message #1 we manage to have a recurrent print_index()->print_index() call. This way another set print_index()'s local variables and arguments is put on the stack, along with messages[i]. When the second print_index() call asks for the message number to print, we chose #3, because its print_msg() pointer was overwritten with printf()'s address and its XOR pad with our format string.
  4. This way we achieve the print_index()->print_index()->printf("%20$p"), creating and exploiting a format string condition.

Below is the stack of print_index():

Below is the stack of print_index()->print_index():

So the stack's properly groomed for leaking the address of messages[i]. From this point we can calculate the offset to the XOR pad we decide to put our payload in. We'll get back to this, now let's find out how to take control of the stack and start our ROP.

Corb3nik's solution - ROP-based stack pivoting

By analyzing Corb3nik's solution, after comprehending the convoluted heap address leaking, I realized that he started his ROP chain with a pointer pointing at a mov ecx, esp; ret; gadget (changing the stack pointer to the value of ECX). We saw a bunch of those earlier in ropeme output, when we were searching for stack-pivoting gadgets. I felt puzzled, as back then when looking at the state of the registers the only register that appeared to have a useful value was EDX.

I understood the use of ECX after I analyzed the way he provided arguments to the final print_index() call in his exploit, made me understand the trick:

So let's not focus on the ROP chain itself now (it makes ESP point to the buffer on the heap and then ret to it, as mentioned before, turning the heap into program's stack, because why not :D - I decided to go a different route, later on that).

Let's focus on the way the chain is delivered - along with the message index!

Again, print_index() source code, focusing on a part we did not pay much attention to before:

So again, this is the function that calls our overwritten pointer. It will trigger the exploitation by calling print_index()->mov ecx, esp; ret;.

Its local variables are held on the stack. Then it makes the call to print_msg() - or whatever we overwrite it with, putting more variables (call arguments, stack frame, saved RET and any local variables if needed) to the stack. This means that the numbuf[32] is there on the stack - and that's where ECX happens to point when messages[i]->print_msg(messages[i]); occurs.

While fgets() allows us to stuff 31 bytes into numbuf[32], only the first one needs to be a digit corresponding to the chosen message index, the rest can be anything non-null we want to place on the stack, as the later strtoul() conversion will simply ignore it - and ECX points there:

So, after we have any message structure on the heap with its print_msg pointer  overwritten with one of the mov ecx, esp; ret; gadgets (e.g. 0x80bd536), we can simply ask the program to print a message and provide it's index along with up to 30 bytes of our ROP chain.

So now we control EIP, we control the stack and we can overwrite the heap pretty much anyway we want. Now we can talk!

My last-stage ugly alternative - shellcode to heap, mprotect() heap RWX and ret there

Corb3nik's ROP chain delivered to the print_index()->numbuff[32] buffer via  print_index()->fgets()'s input made the program start using the heap as the stack (pop esp, ret;).With the rest of his ROP chain stored on the heap via the initial heap overflow:

As you might remember from the beginning of this way too long write up, I wanted to use mprotect() really bad, to make the heap executable and just fucking jump to it (thus ugly), without using any more ROP.

I did as well start off with the messages[i]->print_msg() pointer overwritten with the address of a  mov ecx, esp; ret; gadget.

But my ROP chain delivered along with the message number to the print_index() stack looked like this:

So obviously it started with the address of mprotect().

Then there was an address of a pop3ret instruction - the next instruction the mprotect() would return to - this is where it would expect to have its parent's saved RET stored. Before we return to the next address, we have to jump over/clean up the mprotect() arguments still lying on the stack, hence we have to use popNret as the next addr to return from our function whenever that function takes N arguments. This is the basic principle of building ROP chains.

OK, then the arguments.

First, the start address of the memory area we want to change memory protection flags of. I initially used the address of my shellcode  (warning, this did NOT work and required a fix, but read on!). The shellcode was already delivered to message #1's buffer by overwriting its XOR pad with the format string for printf() AND the shellcode itself:

At the time of writing the exploit it was still a NOP-holder, I left shellcode writing till the end. The point is that the address of the shellcode on the heap was already known thanks to its fixed offset from the leaked pointer.

Then the length  (I wanted to use 0x64 just to be on the safe side and have enough space executable). Then the flags (0x7 = READ + WRITE + EXEC).

And then, lastly, again the address of the shellcode on the heap. This is where the last ret from this short ROP chain will return. Then it will be just normal (a sequence of opcodes) shellcode executed on the heap.

For some reason mprotect() kept returning an error (0xffffffff in EAX), so the range must have been incorrect. I peeked into XPN's ROP Primer https://blog.xpnsec.com/rop-primer-level-0/ write up again and did as he did there with the stack - used the entire fixed heap start address + length as arguments (as I mentioned, they stayed the same between instances, hence could be fixed). Not the prettiest solution, but it was late and I just wanted to write the shellcode, run it from the heap and call it a day.

Shellcode

OK, the shellcode now. As far as I remember it should look like:

I built this using shellnoob (a tool I recommend for asm->opcode and opcode->asm conversions):

So, the opcodes are:

682f736800
682f62696e
54
5b
6a0b
58
31c9
31d2
cd80

So yeah, it worked:

The full code of my heavily uglified version of Corb3nik's exploit can be found here:

https://github.com/ewilded/MBE-snippets/blob/master/LAB7A/exploit_rwx_heap.py

No one really gives a shit about cookies and neither do I