Reverse Engineering using Memory Analysis
During the past three years I've been developing tools for research and implementation of a new type of software analysis, which I will introduce in this paper. This new type of reverse engineering allows recovering internal implementation details using only passive memory analysis, and without requiring any disassembly. I will also discuss how to cope with the challenge that applications (including DBs) are always in a state of flux - new versions, security updates, etc., keep changing the memory structure. I will answer the question of supporting a new version of the target application without seeing it.
I will discuss the added value of this new method of internals' recovery over the more common method of disassembling and decompiling. I will also share my stockpile of common memory patterns, written in Python, and explain the vast information that can be uncovered simply by roaming about in memory land.
This paper is follow-up to the presentation with the same name, in the presentation I gave a demonstration that included a description of a security problem that I found in Microsoft SQL Server (published during 2009), as a result of applying this methodology. I demonstrated how it is possible to recover the internal structures of a program as complex as a DBMS, and how one can find the important core internals that should be protected.
One major application of this technique is discussed, which is to gain the deep knowledge and understanding of the internal building blocks and design of the target application, required to implement monitoring. As far as I know this method of memory monitoring has never before been used for security purposes. This method allows us to achieve a good view of the application’s activity, while also minimizing the performance impact (in contrast to methods that require extensive application logging, for example). It depends on the existence of caching, pipelining and buffering of data to create a real time view of the application’s activity. When applied efficiently it can be used to protect applications from various exploits and thus can be adopted as an alternative to applying security patches to products, especially when applying the patches comes at a very high cost (e.g. extensive testing of applications, shutting down mission-critical applications, etc.).
Reverse-engineers may consider recovering internal implementations and data structures by studying memory dumps difficult or not worth the hassle. In this paper you will see that not only is this job not as complex as one may think, but it could also be more effective then traditional SRE. I will show the benefits of this work in many real world examples. I will divide this problem into four smaller subjects as follows:
Examine the tools one needs for the taskAnalyze all of the different primitives we ought to find in memoryDiscuss a simple way to define at a high level the structures and patterns to search for in memoryTools
A lot has been said about the subject of SRE tools, and almost any debugger would be sufficient for our needs. I find the Python interactive interpreter to be the most efficient environment for carrying out research of this kind. As I perform my research, the current status of the interpreter holds my current knowledge of the inspected target. Any piece of information can be easily accessed because it is all stored in global variables. Thanks to these benefits and many more, one can “play” with the data and try to make some sense of it. On Win32 there is the PyDBG module that enables a researcher to debug a process from a Python environment. An alternative to PyDBG would be a tool that I wrote for the task called pyMint, which is freely available online.
The functionalities one would be looking for in the debugger are is:
Displaying memory in various ways.Searching memory in various ways.Gathering as much information about the memory as possible (e.g. page attributes, memory regions, heap structures and so on).Displaying memory dumps could be done in Binary form, DWord form, ASCII, Unicode, Graphical and more. It is better when all modes are accessible from one integrated environment. A simple modification of the way the memory is shown can make the difference between random-looking bits and bytes and a data structure with an apparent purpose. For example here are two dumps of the same memory:
Figure 1
Figure 2
The first dump looks like a bunch of bytes that make no sense, while the second looks like a table in which every entry starts with two pointers followed by three numbers. A good (and correct, in this case) guess would be that this is an open hash table, where the first two DWords are the next / prev pointers of the linked list and the following number is the number of items in the bucket.
Another interesting way to inspect memory is graphical, and it was used in a tool called Kartograph. This tool was created by Elie Bursztein to produce map hacks for strategy games.
Figure 3
My tool currently carries a much simpler version of the Kartograph tool, which generates colourful memory dumps as shown in Figure 3.
The idea behind this is that every byte is represented by a single Pixel. Every byte value has a corresponding color. This simple tool lets the user play with three aspects of the display in order to find correct memory display.
Memory in detail
In order to classify the primitives found in memory, I’ve divided them into four groups.
1. Pointers
2. Data
a.Text
b.Time stamps
c.etc.
3. Completely Random
4. Code
Pointers can be identified by their tendency to point to something in memory. Furthermore, the CPU handles DWord aligned addresses better, which makes the compiler, heap and the OS try to make pointers aligned if possible. This means most pointers end in either 0x0, 0x4, 0x8 or 0xc.
“Data” is anything that is found in memory, that has a meaning such as IDs, handles, names, etc. “Data” is simply identified by prior knowledge of its meaning, for instance if I know that my session ID is 0x33, finding 0x33 in a memory array would guide me in the memory maze.
Contrary to common belief, truly random numbers are hardly ever found in memory. Furthermore, even memory that is not allocated at all and is not referred to by any code is not filled with random data, but with whatever was in that memory the last time it was used. In fact, when one encounters a buffer in memory that seems to be randomly generated, it usually corresponds to encrypted data, compressed data, hash digest or a pseudo-random numbers buffer, which is helpful when one is trying to recover some logic.
To identify code one should be familiar with some assembly encoding. Almost every kind of CPU has it’s own signatures for functions prologue / epilogue and common code. Most debuggers do a good job in separating the code from the data, and for an exotic CPU a new code searching function could be written in a matter of hours. If we take, for example, x86 and the Visual Studio compiler, we can see that almost every function ends with 0xc3 0x90 0x90 0x90 0x90 which is the RET opcode followed by four NOPs (Used for the MS detours library).
Memory Patterns
What I mean by “Memory Patterns” is an easy, yet robust way to define C like structures with many “unknowns”. For example, I might know that one structure starts with a pointer to a virtual table followed by three DWords that I’m yet to figure out what they stand for, followed by another DWord which is known to be some kind of ID. In C the code that defines such structure could look like:
struct someObject {
void ** vtable;
int someDwords[3];
int ID; }
Let’s say that I know for fact that IDs are numbers in the range of 1300 to 3700. I would like to use this information in my search for this pattern, to eliminate as many false positives as possible. So using my tool I can define in Python the following data struct:
someObjectPattern = [
SHAPE(“VTable”, 0, POINTER()),
SHAPE(“ID”, 0xc, NUMBER((1300, 3700))) ]
Each pattern is defined by SHAPEs, and each SHAPE is defined by three to four elements:
1. Name of the SHAPE, to allow easy access to it in the future.
2. Location in memory (Relative to the previous SHAPE)
3. Type + Value in memory
4. Extra check function, for more advanced patterns
I tried to achieve three things in these Patterns:
1. Define and save the information I gathered about my current research.
2. Search the patterns in real-time, in case I couldn’t find a single constant place in memory to relay on.
3. Automate the search over many different versions of the same program.
The problem with automating the search over different versions, is the fact that the source code of the application in target tend to change from version to version. As direct side effect of code change is that he structures tend to change as well. Therefore, this patterns system supports different search ranges, and unknown arrays in the structures. For example, a new version of the structure above is now introduced, with five DWords seprating the VTable from the ID variable. A new pattern that catches both structures would be:
someObjectPattern = [
SHAPE(“VTable”, 0, POINTER()),
SHAPE(“ID”, (0xc, 0x14), NUMBER((1000, 10000))) ]
The idea is simple, I need as much information that defines the structure as possible. For instance, a user information structure is not the same without a valid user name, while the user Hindi name entry could be something that is found only in the Russian versions of the application.
I have two ways to support data that is not elementary as Numbers, Pointers or Strings. Lets say that the ID element in the structure above is followed by a time stamp, which is the standard CTime value of sometime today.
First approach, would use the Extra Check Function as following:
from time import gmtime
def validateTime(context, value):
return gmtime(value).tm_yday == gmtime().tm_yday
someObjectPattern = [
SHAPE(“VTable”, 0, POINTER()),
SHAPE(“ID”, (0xc, 0x14), NUMBER((1000, 10000))),
SHAPE(“SomeTime”, 0, NUMBER(), validateTime ]
The extra check function is simply a Boolean python function, that is getting executed in the context of the search. The function can access any data that is currently found in the search, the offsets and the addresses
Another approach would be to define a new kind of basic data element to be used in the pattern as following:
from datetime import datetime
from time import ctime
def TimeStamp( NUMBER ):
def __init__(self, maxDaysDelta, **kw):
NUMBER.__init__(self, size=4)
self.maxDaysDelta = maxDaysDelta
def isValid(self, patFinder, address, value):
t = datetime.fromtimestamp(ctime(value))
delta = abs(t - datetime.now())
if self.maxDaysDelta < delta:
yield True
someObjectPattern = [
SHAPE(“VTable”, 0, POINTER()),
SHAPE(“ID”, (0xc, 0x14), NUMBER((1000, 10000))),
SHAPE(“SomeTime”, 0, TimeStamp(1)) ]
Future Work
Currently the implementation is not complete and I’ve focused on the aspects that were necessary for my work at Sentrigo. There is also more to be considered for future work:
• Adding features such as RegExp, faster memory scanning, better memory map query and more to the Candy / Mint Python modules. Although, I didn’t use any of these on my projects, other people may find these kinds of features more essential.
• Integrating all the tools / modules into one environment supporting all platforms and vast methods to access memory. I started working on the module the last month, and you can find the work in progress under the nativDebugging SVN.
• Writing an Action-Script VM (Flash) in memory debugger / editor. This kind of tool could make Flash debugging and developing much more effective.
• Creating a proof-of-concept web server monitor. I do believe that the well proven security and monitoring method implemented by Sentrigo, should be used for many other applications.
• Considering malware, it is interesting to check what kind of data a virus can harvest from a target by monitoring the memory and staying completely invisible to logs and monitors. On the other hand, anti-viruses could use some of the techniques described here to search for and locate viruses and make signatures for them.
Thanks
The Sensor team @ Sentrigo and the rest of Sentrigo for the time and effort and the great product of Hedgehog.Elie Bursztein for Kartograph.Roy Fox, Anna Trainin for proofing this paper.Anyone who contributes to the source.
No comments:
Post a Comment