Tuesday, October 23, 2007

Reliable staging without a stager receive loop

Staged payloads are used by Metasploit to help reduce the size of the initial payload blob that needs to be transmitted as part of an exploitation attempt. These stagers typically connect to a Metasploit client and bootstrap (read in) a second stage payload blob which is subsequently executed. There's a problem with this approach, however, and it has to do with partial reads.

It just so happens that Metasploit has a stage (the DLL injection stage) that is nearly 3000 bytes in size. On all modern operating systems, TCP is designed to prevent fragmentation through the use of TCP MSS (maximum segment size). TCP MSS makes it possible to send packets whose size stays at least below an outgoing interface's MTU of the underlying interface by restricting the amount of data that can be sent in any individual packet based on the MTU minus the overhead added by the IP and TCP headers. While this definitely improves overall network performance, it also means that it's possible for a target machine that is being sent data over TCP to receive part of a "message" that was sent as a whole buffer from the sender. This is typically called a partial read. Since TCP is a streaming protocol, a partial read is perfectly fine. The application is responsible for any internal buffering of underlying messages. For a payload stager, however, this can be disasterous.

It's pretty easy to illustrate this point. Write a server in C that binds to a port, accepts a connection, and then receives a fixed amount of data (say, 3000 bytes) from the wire. The server might look something like this:


int main(int argc, char **argv)
{
int fd = socket(AF_INET, SOCK_STREAM, 0), c;
struct sockaddr_in s;
char buf[3000];
int len;

s.sin_family = AF_INET;
s.sin_port = htons(4444);
s.sin_addr.s_addr = INADDR_ANY;

bind(fd, (struct sockaddr *)&s, sizeof(s));
listen(fd, 1);
while (1)
{
c = accept(fd, NULL, NULL);
len = recv(c, buf, sizeof(buf), 0);
printf("Got %lu bytes\n", len);
close(c);
}
}


When connecting to this server and sending some data with data piped to nc, the output is the following:


On the client:

$ echo -ne "abcd" | nc server 4444

On the server:

Got 4 bytes


This looks just like you'd expect. We sent four bytes and the server read four bytes. But what happens if we send 3000 bytes?


On the client:

$ perl -e 'print "A" x 3000;' | nc server 4444

On the server:

Got 1448 bytes


This might seem a bit strange. We obviously sent 3000 bytes at once from our perspective as the client, but the server only read 1448 bytes. What's going on here? To answer this, we need to look at the packet capture:


03:47:20.324613 IP x.x.x.x.38720 > y.y.y.y.4444:
S 982288742:982288742(0) win 5840
03:47:20.324787 IP y.y.y.y.4444 > x.x.x.x.38720:
S 3678693369:3678693369(0) ack 982288743 win 5792
03:47:20.391740 IP x.x.x.x.38720 > y.y.y.y.4444:
. ack 1 win 92
03:47:20.398728 IP x.x.x.x.38720 > y.y.y.y.4444:
. 1:1449(1448) ack 1 win 92
03:47:20.398805 IP y.y.y.y.4444 > x.x.x.x.38720:
. ack 1449 win 543
03:47:20.398838 IP y.y.y.y.4444 > x.x.x.x.38720:
F 1:1(0) ack 1449 win 543
03:47:20.405973 IP x.x.x.x.38720 > y.y.y.y.4444:
. 1449:2897(1448) ack 1 win 92


In the above, we can see that two TCP segments are sent in the capture. The first TCP segment is 1448 bytes in size as is the second (and a third follows which is not shown). Notice that the remote server, y.y.y.y, ACKs the first 1448 bytes before receiving the second 1448. It also sends a FIN indicating that it has closed its half of the connection. If the 3000 bytes we were sending had been a payload blob, only half of the payload blob would have arrived and been executed. This would certainly lead to an unexpected crash.

This problem has been known about for some time. The most common solutions typically involve implementing a receive loop that reads an expected number of bytes before executing the stage. For instance, an attacking machine might transmit a four byte value describing the length of the stage. The target machine's stager can then loop calling receive until the expected number of bytes have been read. While this works perfectly fine, it adds to the size of the stager (which is meant to be small) and also adds a potentially signaturable network effect.

The obvious question at this point is whether or not we can do something better. It's important to think about the behavior of TCP segments and the way that receive queues are managed in modern operating systems. TCP is designed to be a reliable transport that is capable of experiencing intermittent packet loss. It supports this by describing communication in terms of a receive window that may consist of multiple segments. Due to way that packets are routed on the internet, it may be possible for TCP segments to arrive out of order. Since TCP is a streaming protocol, the order of transmitted data must be preserved. In the case of mainstream operating systems, it seems to be common practice that out of order segments are queued rather than discarded in an effort to reduce the number of retransmissions. The implications of this are the key to solving our problem.

To solve the partial read issue, we must have a mechanism to ensure that the entire stage shows up in the target socket's receive buffer at the same time. If we assume that the target's operating system will retain out of order segments, then we may be able to make our example stager reliable, even across the internet. We could even detect this through normal communication with the target host during exploitation, assuming TCP is involved.

There are two simple ways that this could be done. The first way would involve reversing the order of TCP segments sent by the attacker with a moderate delay added between each transmit in order to reduce the chances of a transient routing condition. The nice thing about this is that it would be challenging to discern this behavior from that of potentially real internet traffic. An alternative approach that is arguably easier to implement though perhaps easier to identify would involve transmitting all TCP segments except for the one accounting for the first byte of the data being sent. After all of the other segments have been transmitted, the first byte can be sent causing the target machine to completely reassemble and place the result in the receive buffer for the associated connection. A quick proof of concept test seems to show that this is feasible when sending an 861 byte packet (at least against a Linux target). It's thought that other platforms will share this behavior though this has not been determined.


On the server:

Got 861 bytes

With the associated packet capture:

03:29:05.914211 IP x..x.x.x.43807 > y.y.y.y.4444:
S 3870094441:3870094441(0) win 5840
03:29:05.914270 IP y.y.y.y.4444 > x..x.x.x.43807:
S 130543367:130543367(0) ack 3870094442 win 5792
03:29:06.094311 IP x..x.x.x.43807 > y.y.y.y.4444:
. ack 1 win 1460
03:29:09.353861 IP x..x.x.x.43807 > y.y.y.y.4444:
P 2:862(860) ack 1 win 65535
03:29:09.354071 IP y.y.y.y.4444 > x..x.x.x.43807:
. ack 1 win 362
03:29:09.357354 IP x..x.x.x.43807 > y.y.y.y.4444:
P 1:2(1) ack 1 win 1460
03:29:09.357499 IP y.y.y.y.4444 > x..x.x.x.43807:
. ack 862 win 362


Note that the segment describing bytes 2-862 arrives before the segment describing the first byte. This example should be applicable to larger packet sizes though the tool that was used to test is currently not capable of trying this out. Thanks to anonymous for helping to test this :-).

Even though this appears to work in testing there are definitely some real problems with it. There's a chance that a stateful firewall or IPS device may be in between the attacker and the target machine. If this is the case then it cannot be safely assumed that the target machine will receive the packets out of order. This is due to the fact that the transparent device may perform stream reassembly and then not preserve the out of order characteristic when transmitting data out the other side. While this may the case, it could be argued that it is likely that such a device would often be in close proximity of the target machine, thus increasing the likelihood of the entire stage being present in the receive buffer even if sent in order due to the decreased latency. Another problem has to do with the maximum size of the stage that can be sent. This restriction is constrained by both the window size and the size of the receive buffer associated with the socket on the server doing the receiving. There might be some other scenarios and/or platforms that make this approach impossible to use (please post a comment if you're aware of any).

The idea of intentionally sending out of order segments is definitely not new. Fragroute has supported this for quite some time. Still, the application of out of order segments to payload staging may not have been as obvious. It's unlikely that Metasploit will implement this in the immediate future. We currently use an 89 byte intermediate stager when necessary. This solves this problem without having to alter our existing stagers. Still, it seems like a fun possibility if payload size restrictions happen to be exceedingly tight :-).

9 Comments:

Anonymous Anonymous said...

Nice article Matt!

A couple of comments and then some ideas to possibly look into.

First, I'm surprised at how few out-of-order packets one really sees at their perimeter. Note that if you trace across multiple peered ASes then you will definitely see asymmetric routing (as you would expect)... but still seldom do you actually get packets out of order even in these cases. One might ascribe this to the over provisioning that has occurred over the last several years. Of course one will see even fewer out of order packets transmitted internally on ones networks.

So my comment is that while the non-loop read is a known issue and as we are concerned about any increase in payload size, it would seem this solution would not only fail in the cases you mention (statefull devices reassembling and reordering packets within streams), it would seem to be overly noisy to passive monitoring tactics (e.g. IDS and other monitors).

Have you thought about placing some of the data within the initial SYN packet (this is not disallowed in rfc 793) and it seems some stacks handle this just fine. It should not make it to the recv() function until the three way handshake has been completed at which point the data from the mbufs will have been handed over (including the data put in the initial SYN). In essence you would be using the kernel to buffer this first amount of data for you and combine it with the data in the first PSH packet after the 3-way. Thus getting the data portion of two packets in the first userland read.

Of course, there's the chance that after moving to connected state that recv is handed the initial data and your 1 shot read returns early after the SYN w/ data and prior to the first normal data packet, but it's worth a shot checking out which stacks handle this in which ways.

Then there's T/TCP and rfc 1644 (which is normally enabled or disabled through setting a sysctl kernel state). SYN with data is nothing new here and perhaps enabling net.inet.tcp.rfc1644 would be enough to have the first two packets (syn w/ data being pkt #1 and ack|psh w/ data #2) be buffered into your single read.

These other options will appear nonstandard to monitoring devices as well, but they might make it through statefull filtering ;)

Hey, you asked for ideas so I decided to ramble a bit. Keep up the great work!

cheers,

.mudge

10:55:00 AM  
Blogger skape said...

Yeah, I'd definitely agree that out or order packets are likely rare, though I don't have strong figures to prove this. An aggressive IDS could definitely signature on this behavior as potentially bad. Transmitting data as part of the initial SYN falls into a similar category -- it's valid, but uncommon. Still, it's definitely an option :)

11:59:00 AM  
Anonymous Anonymous said...

Did you think about using the recv(2) flag MSG_WAITALL? Seems to be the most obvious way to me.

10:41:00 AM  
Anonymous Anonymous said...

Did you think about using the MSG_WAITALL flag for recv(2)? Seems to be the most obvious way to me.

10:52:00 AM  
Blogger skape said...

MSG_WAITALL is definitely an option on the POSIX side of the house. Unfortunately, Windows doesn't support MSG_WAITALL (last I checked).

12:05:00 PM  
Anonymous Anonymous said...

Very interesting research!
Good job

/JA

12:54:00 AM  
Anonymous Anonymous said...

If you want to receive n bytes, then what's the problem of using recv() until you've received the n bytes...

8:12:00 AM  
Anonymous Anonymous said...

It's not that much of an increase in size... ;(

8:16:00 AM  
Blogger skape said...

Using a receive loop is definitely an option if you can manage the extra size overhead (which I agree is fairly small). The out-of-order packets was meant to be more of an illustration of another way of going about it that wouldn't impact size :) Metasploit's current solution fixes the problem without adding any size overhead or the need for a receive loop (the intermediate stager).

12:21:00 PM  

Post a Comment

Links to this post:

Create a Link

<< Home