Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaging issues #412

Open
timokau opened this issue Oct 10, 2018 · 2 comments
Open

Packaging issues #412

timokau opened this issue Oct 10, 2018 · 2 comments

Comments

@timokau
Copy link

timokau commented Oct 10, 2018

Packaging issues

I just updated the nixos retdec package from version 3.0 to 3.2. That was not a pleasant experience, which is probably why it remained outdated for so long. I don't know if distribution packaging is a priority for this project, but in case it is I will outline the problems here.

Retdec-support is huge

Only 127M compressed but 4G uncompressed. That takes up space on our mirror server but, more importantly, on every users computer. We currently remove the PE static code patterns (3G) by default and only install these when explicitly requested by the user. I've opened avast/retdec-support#3 for that.

Dependencies are fetched at build-time

This is by far the biggest issue. NixOS and many other distros separate the build in two distinct phases: one to fetch the sources, one to build. The build phase is executed in a sandbox that does not have internet access. There are various reasons for that, including:

  • reproducibility: If the build process could be influenced by the availability and content of files on the internet, there is no way to guarantee two different builds will yield the same result of even succeed at all.
  • security: Some of these files could be replaced by malicious ones.
  • mirroring: We are able to mirror all the source files in a content-addressable manner on our build servers.

Since the retdec build process fetches its dependencies at build time, a lot of patching is required. We have to fetch all the dependencies ourselves and then patch every CMakeLists file to use those local versions instead. Since that essentially duplicates parts of the build system, it makes a lot of manual adjustments during upgrades necessary.

I've started discussing this at #279.

Dependency versions are very specific

All dependencies are fetched by commit-hash. That makes it very difficult for distros since they usually try to ship one version of every package and share those dependencies. It is generally impossible to judge the true intentions behind the build hash:
Is it "we need at least this version since we depend on feature x" or "newer versions break the api" or "we need exactly version 2.3"?
This and the lack of a central file that collects the necessary dependency versions makes it painful to adjust the dependency versions on an upgrade.

Dependencies are forked

Capstone and yara are forked. Can the changes not be upstreamed?

Dependencies are built from source

While a distro would ideally build all the dependencies themselves and then build retdec based on those pre-compiled dependencies, retdec instead insists on building everything itself. The exception to this is openssl, which is only built if it cannot be found on the system. That should be the case for all dependencies.

@PeterMatula
Copy link
Collaborator

I will try to address everything mentioned here.

0. Basic things

  • We want to stick with CMake.
  • We don't want to go to git submodules.
  • This is not a priority, since I feel like there is a lot of work to be done to improve the decompilation quality. But many points are valid and could/should be gradually solved.
  • I will write up the possible solutions, things to do, things to discuss, etc. here, and anyone can discuss, or help solve them. I will keep these points in mind when I end up modifying some related parts - and try to implement them if possible.

1. Retdec-support is huge

  • See comment.
  • Some signatures are not very helpful for general RetDec use. They are mostly good for our regression tests, but it is unlikely they would hit many binaries in a real world. They are too specific for that. These are mostly gcc signatures for compilers we used to generate binaries in regression tests.
  • The biggest signatures are however for Delphi and MSVC, which have a distinct versions used all over, and therefore a decent chance to hit in real binaries.
  • Instead of compiled YARA, we could distribute text YARA and add a script that would compile the rules. This script could be triggered when RetDec is being installed to system. Package would be smaller, but rules would be compiled only once and there would be no performance penalty.
  • We could experiment with signature sizes - maybe they do not have to be so huge.
  • Make it possible to selectively choose which signatures to install (or none at all). Exact mechanism is to be discussed - e.g. different distro packages?

2. Dependencies are fetched at build-time

3. Dependency versions are very specific

  • Analyze the dependencies we use and determine if we can use system packages. This should be possible in most of the cases.
  • If not possible, devise a solution.

4. Dependencies are forked

  • Capstone could/should be merged to upstream.
  • Our YARA fork is setting some custom limits that are hardcoded in upstream YARA. This cannot be merged in the current state. Possible solution is to make it more general - modify YARA so that these limits can be set at build, or even runtime, and try to merge that. Then YARA could use its default limits, and we could set our own. It is not guaranteed that such a pull request would be accepted.

5. Dependencies are built from source

  • I think this would be solved by solving points 3. and 4.

@timokau
Copy link
Author

timokau commented Oct 17, 2018

Thank you for your reply, I'm glad to see that you take the issues seriously even though they are not top priority. TL/DR: solving (3) would go a long way and (2) and (4) basically wouldn't matter anymore from a packaging perspective.

1. Retdec-support is huge

* See [comment](https://github.com/avast-tl/retdec-support/issues/3#issuecomment-429791800).

* Some signatures are not very helpful for general RetDec use. They are mostly good for our regression tests, but it is unlikely they would hit many binaries in a real world. They are too specific for that. These are mostly gcc signatures for compilers we used to generate binaries in regression tests.

* The biggest signatures are however for Delphi and MSVC, which have a distinct versions used all over, and therefore a decent chance to hit in real binaries.

Since the non-compiled signatures are relatively small, would it be reasonable to work with those and compile them to a cache on the fly? That should significantly cut down on the space overhead but also only take additional time on first use.

* Instead of compiled YARA, we could distribute text YARA and add a script that would compile the rules. This script could be triggered when RetDec is being installed to system. Package would be smaller, but rules would be compiled only once and there would be no performance penalty.

I think distributing text YARA and building it during the regular build process would be better practice than shipping the compiled results in either case. But if all rules are compiled on first use that would be no real space improvement. As I suggested before, would it be possible to compile them as-needed?

* We could experiment with signature sizes - maybe they do not have to be so huge.

* Make it possible to selectively choose which signatures to install (or none at all). Exact mechanism is to be discussed - e.g. different distro packages?

Maybe. In that case there would need to be some way to notify the user when he gets sub-optimal results because of missing signatures so that he knows he has to install those.

2. Dependencies are fetched at build-time

* This is caused by CMake. How can we hack it?

* Related to #279.

* What are you allowed to do in the first phase when you are fetching the sources? Are you allowed (can you) run CMake's configure step? I.e. if we get all the resources in this step, would it be ok? We could script CMake to download everything that is needed and then configure itself as suggested in #279 to use these local instances instead of downloading them in the build step. If this is not possible, would some script that would do this be an acceptable solution? If the only thing you can do is `git clone`, then I'm not sure we can help.

Whatever the "fetch phase" produces has to be compared against a known hash. So basically the rule is that we can do anything that will be 100% binary reproducible. If the whole configure step would be binary reproducible it would be possible, although a bit of a hack.

I don't know much about cmake -- is it not possible to seperate fetching the dependencies from the configure phase?

* It would be possible to create a git meta-repository that would use git submodules as you need, and then configure (#279) build to use them. This repo would be used only to build the package for packaging. It would not be a part of the official organization. I'm not sure if you want to do something like this. I hope some of the above solutions could work instead,

That would have all the disadvantages of both solutions wouldn't it? I thought the difficulty in maintaining such a repository is the main reason for using cmake instead of submodules in the first place?

3. Dependency versions are very specific

* Analyze the dependencies we use and determine if we can use system packages. This should be possible in most of the cases.

That would really be the greatest help. If the build system would use traditional autoconf-style configure checks of the dependencies (dep x can be found and has feature y / is at least version z), it wouldn't really matter very much how dependencies are fetched. Mabe a ./configure --local would disable fetching altogether and instead just say

checking for dependency x version >= y... Not found

and then error. That would solve a lot.

* If not possible, devise a solution.

4. Dependencies are forked

* Capstone could/should be merged to upstream.

This would be great and should also ease the maintenance burdon on the retdec team in the long run.

* Our YARA fork is setting some custom limits that are hardcoded in upstream YARA. This cannot be merged in the current state. Possible solution is to make it more general - modify YARA so that these limits can be set at build, or even runtime, and try to merge that. Then YARA could use its default limits, and we could set our own. It is not guaranteed that such a pull request would be accepted.

The runtime solution would be perfect. The buildtime solution would improve matters but would still require having two seperate yara packages.

5. Dependencies are built from source

* I think this would be solved by solving points 3. and 4.

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants