When reverse engineering malware or just programs in general, it is usually important not to focus too much on each individual assembly instruction. It’s often a better idea to get a broader overview, and then focus on areas of interest. Recognizing C Code Constructs can help us to visualize the structure when reversing.
This post will assume your binary is a compiled C application, and that the assembly is x86.
Even though some decompilers like the one featured in Ghidra or IDA Pro can convert assembly back to pseudo C, I still believe it is important to be able to understand the structure of a bunch of instructions in a basic block. What does this mean though?
It means that we would like to understand what a section of assembly code is doing by examining the order of instructions, to understand what it would look like if we had the source code ourselves.
An example could be that a bunch of instructions appears to loop, we can usually identify this, if we have jump instructions to an earlier point in the code. But what kind of loop is it? What is the structure of the loop? How is the variables in the loop declared? What is the condition? When does it terminate? and the questions go on. These questions, and more, will be answered in this post, along with some practical examples
completely stolen borrowed from a book I’m currently reading named “Practical Malware analysis – The Hands-On Guide to Dissecting Malicious Software.
It should be mentioned, that this post is not a writeup of the exercises themselves, this could be in another category if there is interest in that, but merely used as a tool to facilitate learning.
Before we being our journey into understanding C Code Constructs, I decided to list setup so you can follow along yourself if you wish to. In my own experience, I learn a lot by doing, rather than just reading about it.
The setup has the following elements:
- Ida Freeware 5.0
- Windows XP SP3
- C compiled binaries for x86 from the aforementioned book
We open up IDA and simply choose to analyze a new file, in this case it is the one named: Lab06-01.exe.
We open it up and are met with the following:
According to the book, the function of interest is sub_401000. This main function can however already show an interesting code construct. Namely an if, else statement, but we choose to analyze the one inside the function mentioned above, since it is a bit more interesting.
If-Else construct: Investigating sub_401000
We enter the function and this time I chose the non-graph view first to see if we can deduce what construct we’re dealing with.
It appears to be an if-else statement, but what hints did we have to help us conclude that?
- We have a cmp instruction, which compares the value found at address ebp+var_4 with 0.
- We have a jump instruction right after a comparison which depends on the result from the cmp instruction.
- The result of the branch taken would leave the other one out. We can see this by looking at the various arrows on the left. But also directly where the jumps take us. If the result is zero, it will skip the code that prints the “success” message.
- We have visual help in form of the text that is pushed as a parameter to the printf function. Either we have an error or success. Of course this can not guarantee we’re dealing with an if else statement, but it at least makes it a bit more obvious what is going on.
Turning on the graph view seems to agree with our investigation.
This piece of code, seems to check if there’s an internet connection or not. According to the author, this is something malware often does so it knows how to behave in various circumstances.
Multi If-Else construct: Investigating sub_401040
The program have been changed a bit, and we now have a new function to investigate in a new binary.
The name of the binary is Lab06-02.exe and the function we’re looking at is called sub_401040. The main difference here, is that the function is a bit more advanced, and to me it proved a challenge at first even in graphical view to understand how it was constructed. We will take it one step at a time, and at the end show the graphical view to see the big picture.
We start off by digging into the function mentioned above. I have copied the nongraphical view in three parts that I felt seemed to indicate some kind of branch or structure. Hopefully that makes it a bit easier to “digest”.
I will not go into the same details as last time if we discover similar constructs, this is to avoid repetition.
Here we are met with some interesting assembly. Remember, this is a piece of malware we’re investigating. Knowledge about malware behavior might sometimes help us to understand the “typical” structure of some functionality, an example could be the earlier section about malware checking for internet connections.
In this example we have two calls related to InternetOpen functions. The ascii on the right side helps us a bit again and it looks like an if else statement(So far). Either it can open the URL or it cannot. Again we can see this, since it skips a code section if a condition is not met. We take a look at our next code section in the following image.
Here we can see where the jump takes us, if it was able to open the website. It takes us down to “.text:0040109d”. If it did not, it would eventually end up at loc_40112c(The exit part) after having closed a few handles. This part however, seems to contain another if – else statement.
Dealing with multiple if – else statements
It seems similar to the other part we investigated, this just attempts to read a file on the remote server, this is likely the HTML page that it set up for earlier. Knowledge about how windows functions work helps a great deal during reverse engineering too. I can highly suggest googling a function you’re not sure what is doing, especially if your disassembler is so kind to provide it for you (This is not always the case 🙁 ).
If the branch above does not go into the exit function, that is if the InternetReadFile function successfully reads the file at the URL, then we’re taking into the last part.
At the last part here, we see something very interesting. We have a bunch of comparisons right next to each other where the “else” seems to lead to the same destination. I was unsure if this was done using many if statements with the exact same else statement or if it was some sort of else-if setup.
A helping hand from GodBolt
I did some digging, and I found a really handy website that was able to help me be a bit more certain of what it could be. When programming, the higher level languages abstract away these small details, but I know from experience, that an “&&” operator works similar to extra if statements. So using the website named https://godbolt.org/ we can easily create a simple C function and convert it directly to assembly. An example is shown in the image below.
The assembly on the right seems quite similar to the one found earlier, so I believe the construct we found was originally an “&&” statement. Keep in mind the exact way the original source code is of lesser importance, as long as you understand the overall structure or at the least what the program does, then that is what matters.
So we can conclude that we’re dealing with a bunch of nested if-else statement that includes a quite picky “&&” operator in one of the statements. But what about the weird “buffer”? And why does it increase the buffer with 1,2,3 and 4? This will be explained in the end, for now we’re just trying to understand the structure of this function. And to do so, we will finally enable the graphical view.
Graphical view to the rescue
The graphical view comes to the rescue and helps us out by easily showing us the branching that we just spent quite a bit of time trying to make sense in our head. Of course we could draw it ourselves as we went along, this is good for practice, but falls short once we enter more complicated functions. Various anti reversing techniques would make it next to impossible to read without having graphs enabled. Even though we have turned graphical view on the 4 small branches on the right (Which we figured out was likely coded using “&&”) it actually doesn’t tell us much about how it was coded at first. It’s important not to falsely believe it was an else-if statement. This difference might be subtle, but later we can see that there is a very specific reason for that structure.
The malware makes a bunch of web-requests and if they all succeed, it reads or looks for a comment in the HTML code. It does so character for character. This means that the characters needs to be exactly “<!–” or else it would fail. If this is indeed on the webpage, it loads a fourth character into a register. This register is returned to the main function(that called this current function) and will sleep if it went well.
Figuring out what “buffer +x” actually is
Earlier we wondered why the buffer would +1, +2 when being dereferenced (When we try to take the value at a specific address). This is actually because we’re dealing with a chararacter array in C. Remember, in plain old C we don’t have Strings by default, we have character arrays. An array is on the lower levels, a data structure. The array itself can be dereferenced and it will point to its first value. How else could be figure that out in this case?
- We can see  around what is being dereferenced. This means we’re dealing with a pointer, which an array when referenced in C is.
- We seems to have a steady increase in the buffer, (+1,+2,+3). Elements in arrays all take the same space.
- The name is buffer, and is earlier in the program pushed to a function that is named InternetReadFile.
Googling this reveals the following quote. “WinINet attempts to write the HTML to the lpBuffer buffer a line at a time.” Looking in our code we see IDA has named our buffer “lpBuffer”
- Comparing something to a String from for instance online, is in C done using character arrays.
End of part 1 – A summary
Well done! You managed to read the whole part. This was my first post, so I likely have lots to improve. I hope you enjoyed reading it and managed to learn a new thing or two.
In this part we learned the basics of recognizing c code constructs in assembly. More specifically we learned:
- What an if – else statement can look like in assembly
- How to spot the hints that we’re dealing with an if – else statement
- How nested if – else statements can look like in assembly
- What an “&&” operator can look like in assembly
- How to avoid a pitfalls when deducing the kind of “if statement”
- How to spot a character array in assembly and the hints we’re given
- In general how knowledge about the context of the binary can help us deduce functionality
The next part will look into more complicated structures while dealing with even more interesting malware. Thanks a lot for reading.
If you have any questions, corrections, or general feedback, you are more than welcome to contact me at my twitter here: https://twitter.com/SnowballSec.
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?