Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompilation incorrectly uses pointer instead of the value it points to #384

Open
palant opened this issue Sep 7, 2018 · 8 comments
Open

Comments

@palant
Copy link
Contributor

palant commented Sep 7, 2018

Running RetDec built from d57764a. I'm looking at a function that starts like this in the .dsm file:

0x10049390:   55                                 	push ebp
0x10049391:   8b ec                              	mov ebp, esp
0x10049393:   83 ec 14                           	sub esp, 0x14
0x10049396:   53                                 	push ebx
0x10049397:   56                                 	push esi
0x10049398:   8b 71 04                           	mov esi, dword ptr [ecx + 4]
0x1004939b:   8b da                              	mov ebx, edx
0x1004939d:   57                                 	push edi
0x1004939e:   8b 39                              	mov edi, dword ptr [ecx]
0x100493a0:   3b fe                              	cmp edi, esi
0x100493a2:   0f 84 81 00 00 00                  	je 0x10049429 <function_10049390+0x99>

In the .c file I get the following:

int32_t function_10049390(int32_t a1, int32_t a2) {
    // 0x10049390
    int32_t v1; // bp-24
    int32_t v2 = &v1; // 0x10049393
    int32_t v3 = g4; // 0x10049396
    int32_t v4 = g8; // bp-32
    int32_t str = g5; // 0x10049398
    int32_t v5 = *(int32_t *)(str + 4); // 0x10049398
    g8 = v5;
    int32_t v6 = g7; // 0x1004939b
    int32_t v7 = g6; // bp-36
    g6 = str;
    if (v5 == str) {

So ecx is turned into g5, edi into str and esi into v5 - so far it all makes sense. However, unlike indicated in the C program, edi and ecx are not the same. Rather, ecx points to a structure with two pointers, string start and string end. This appears to be some optimization gone wrong. I would rather expect the following here:

    int32_t v5 = *(int32_t *)(g5 + 4); // 0x10049398
    int32_t str = *(int32_t *)g5; // 0x1004939e
@palant
Copy link
Contributor Author

palant commented Sep 7, 2018

A bit further down in the same function I have the following assembly:

0x100493a8:   80 3f 5b                           	cmp byte ptr [edi], 0x5b
0x100493ab:   75 19                              	jne 0x100493c6 <function_10049390+0x36>

This is translated into:

    if ((char)str != 91) {

But str/edi isn't a character - it's a pointer to a character! So the correct translation would be:

    if (*(char *)str != 91) {

Not sure whether this is caused by the same internal misrepresentation of this value.

@palant
Copy link
Contributor Author

palant commented Sep 12, 2018

I minimized that problematic function into the following NASM code:

  SECTION .text
  global main

main:
  push esi
  push edi
  mov esi, dword [ecx + 4]
  mov edi, dword [ecx]
  cmp edi, esi
  je .done
  mov eax, edi

.done:
  pop edi
  pop esi
  ret

Compiled with nasm -f elf and decompiled with RetDec I receive the following code in the .ll file:

  %v0_0 = load i32, i32* %esi.global-to-local, align 4
  %v0_1 = load i32, i32* %edi.global-to-local, align 4
  %v0_2 = load i32, i32* %ecx.global-to-local, align 4
  %v1_2 = add i32 %v0_2, 4
  %v2_2 = inttoptr i32 %v1_2 to i32*
  %v3_2 = load i32, i32* %v2_2, align 4
  store i32 %v3_2, i32* %esi.global-to-local, align 4
  store i32 %v0_2, i32* %edi.global-to-local, align 4

You can see how esi gets the proper value assigned whereas edi gets a copy of ecx instead of the value at the address referenced by it.

@palant
Copy link
Contributor Author

palant commented Sep 12, 2018

It seems that the incorrect optimization here comes from the -instcombine bin2llvmir flag, it merges the processing for these two instructions incorrectly.

@palant
Copy link
Contributor Author

palant commented Sep 12, 2018

My bad, -instcombine merely removed code that became unnecessary. The mistake was introduced in an earlier step by the -constants optimization. The original code was:

  %12 = load i32, i32* @ecx
  %13 = inttoptr i32 %12 to i32*
  %14 = load i32, i32* %13
  store i32 %14, i32* @edi

After the optimization it became:

  %12 = load i32, i32* @ecx
  %13 = inttoptr i32 %12 to i32*
  %14 = load i32, i32* @ecx
  store i32 %14, i32* @edi

So %13 was replaced by @ecx even though they aren't identical.

@palant
Copy link
Contributor Author

palant commented Sep 12, 2018

It seems that the bug is here:

https://github.com/avast-tl/retdec/blob/78c7a628eda6a46a2e15db421779fc177a12d251/src/bin2llvmir/analyses/symbolic_tree.cpp#L415-L420

The analysis here marks variables in a load statement as identical with the statement's argument. I am by no means an expert on LLVM IR, but from what I've seen so far I'm not sure that this is ever correct. For now I removed that block and am testing now whether I get more meaningful output this way.

@palant
Copy link
Contributor Author

palant commented Sep 12, 2018

This indeed fixed the issue I mentioned in #384 (comment) as well, so it's caused by the same bug.

@palant
Copy link
Contributor Author

palant commented Sep 12, 2018

With this change I finally got C code that makes sense. It's also 30% larger however, it seems that there were side-effects on the optimization quality which I don't yet understand.

@palant
Copy link
Contributor Author

palant commented Sep 14, 2018

Ok, the side effect I saw were branches getting duplicated in llvmir2hll. While it isn't clear why the duplication didn't happen when casts were different, this change merely triggered suboptimal behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants