Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement telfhash for ELF import table #936

Merged
merged 21 commits into from
Apr 14, 2021
Merged

Conversation

HoundThe
Copy link
Member

I've added tlsh https://github.com/trendmicro/tlsh code into the project and used it to calculate telfhash for ELF import tables.

@PeterMatula
Copy link
Collaborator

Lets run TC tests.

Copy link
Collaborator

@PeterMatula PeterMatula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • linux-doxygen-build failed.
  • win-build failed - looks like it has trouble compiling TLSH. Solving errors on WIndows is a major pita. If you don't have a windows machine, and cannot get you hands on some virtualbox, let me know. We could probably give you access to a windows server, but it is kinda last resort - having it locally in virutalbox is better.
  • TLSH sources must not be between other RetDec sources. Move it to /deps/tlsh. Create CMake library target for it - look for example at tinyxml2 which should be similar. I can help with CMake if needed. Also, add at least license file to sources, and some version identification, e.g. version file if it exists - preferably use some tagged TLSH version, not random commit at master. Also, add TLSH entry to our /LICENSE-THIRD-PARTY file.
  • I also expect PR in retdec-regression-tests that on some real-world input sample tests both plain and json values. Place it in /tools/fileinfo/features.
  • Can you send me a file, where it produces some hash? I tried it on retdec-regression-tests/integration/ack/2018-09-13/ack.x64.gcc.O0.g.elf and I got newly added entries for SHA256, MD5, CRC32, but no TLSH. Is it working for you? M'I doing something wrong? This is exactly the thing that regression tests would solve automatically.
  • After you fix these issues, and test the implementation on those 450k samples I sent, change the PR from draft to a full PR, and let me know. I will go over it again and merge it.

src/fileformat/types/import_table/elf_import_table.cpp Outdated Show resolved Hide resolved
src/fileformat/types/import_table/elf_import_table.cpp Outdated Show resolved Hide resolved
src/fileformat/types/import_table/elf_import_table.cpp Outdated Show resolved Hide resolved
imported_symbols.push_back(funcName);
}

std::sort(imported_symbols.begin(), imported_symbols.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are sorting, why not use std::set? Are duplicates possible/desirable? If yes, and we want to keep them, std::vector is ok. If we don't want them, std::set would solve it for us.
Does telfhash allow duplicates in this container?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a thing, is the sorting algorithm behaving the same as in the telfhash implementation? I think it is possible to somehow control the string sorting in C++? There might be differences. And even it there are not, a comment on this would be helpful so that no one screws it up later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original implementation seems to use a python list and not check for duplicates, so I think it allows duplicates. Python sort and C++ one on strings should sort lexicographically. But I'll be more sure after further testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be confusing C++ with Linux, but there are definitely more ways to sort strings - Linux sort, or maybe some other app, is controlled by LOCALE (I think). Just check iPython <-> C++, and add some note. In general, because we try to exactly reimplement some other code here, comments on these things are highly desirable.

src/fileformat/types/import_table/elf_import_table.cpp Outdated Show resolved Hide resolved
src/fileformat/types/import_table/elf_import_table.cpp Outdated Show resolved Hide resolved
src/fileformat/types/import_table/import_table.cpp Outdated Show resolved Hide resolved
@HoundThe
Copy link
Member Author

  • Can you send me a file, where it produces some hash? I tried it on retdec-regression-tests/integration/ack/2018-09-13/ack.x64.gcc.O0.g.elf and I got newly added entries for SHA256, MD5, CRC32, but no TLSH. Is it working for you? M'I doing something wrong? This is exactly the thing that regression tests would solve automatically.

This one for example https://www.virustotal.com/gui/file/fb5b5659801beccc790444f5fecb6acc74c10469acf7554f8f1eeea033bbbfe8
I think the file you tested it on has too few symbols to generate the hash from. I am not sure what is the exact minimum input, from the tlsh source it seems to be this #define MIN_DATA_LENGTH 50, the binary you've tried has data length of only 6, as the only non-filtred import symbol is printf. Can be tested by the original telfhash implementation, which outputs tnull:

$ telfhash retdec-regression-tests/integration/ack/2018-09-13/ack.x64.gcc.O0.g.elf -d
ELFCLASS64
8 symbols found
1 symbols considered:
printf
retdec-regression-tests/integration/ack/2018-09-13/ack.x64.gcc.O0.g.elf  tnull

@PeterMatula
Copy link
Collaborator

Lets run TC tests.

@HoundThe
Copy link
Member Author

HoundThe commented Apr 1, 2021

I've moved TLSH to deps and wrote a CMakeList for it inspired by existing ones, but it's the first time I used CMake so I am not sure if I did it correctly, especially with the cond_add which I've set if RETDEC_ENABLE_FILEFORMAT (in option.cmake) is set.

I've added their license file content into the LICENSE-THIRD-PARTY file. But I am not sure what should I put into the license file and version file that I should put among the source files. I've used the latest tagged release https://github.com/trendmicro/tlsh/releases/tag/4.2.1 . The version values are in the CMakeList

@HoundThe
Copy link
Member Author

HoundThe commented Apr 2, 2021

After testing out the implementation on 464527 ELF binaries I've found few things about telfhash that I didn't really account for:

  • telfhash hashes also exported symbols, as in symbols that have value.
  • telfhash cares only about sections, if the binary has symbols in dynamic segments, telfhash doesn't use them even tho RetDec could because it's capable to extract them.

The first one results in a situation where we calculated telfhash, but because there are no imported symbols, there is no import table in the output, so we don't output telfhash. Maybe use something else strictly for the import table and use telfhash separately? That could also give the ability to use symbols from the dynamic segments.

@HoundThe HoundThe marked this pull request as ready for review April 7, 2021 00:45
@HoundThe
Copy link
Member Author

HoundThe commented Apr 7, 2021

I moved the telfhash to the ElfFormat and kept the existing implementation in import table as a general import hash. I've also added TLSH hash to every import table, because I thought why not, but I can remove it.

I've tested it and seems to work well compared to original implementation, I think it's ready for re-review.

@PeterMatula PeterMatula merged commit 0cdc9a1 into avast:master Apr 14, 2021
PeterMatula added a commit that referenced this pull request Apr 14, 2021
@HoundThe HoundThe deleted the telfhash branch April 14, 2021 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants