My GSoC'21 final report

The file already contains enough descriptive text Signed-off-by: Hritik Vijay <[email protected]>
aboutcode-org · Aug 21, 2021 · 115590a · 115590a
1 parent 6c433b8
commit 115590a
Showing 1 changed file with 173 additions and 0 deletions.
diff --git a/docs/source/contribute/gsoc21.rst b/docs/source/contribute/gsoc21.rst
@@ -0,0 +1,173 @@
+Google Summer of Code 2021 Final Report
+============================================
+
+Organization - `AboutCode <https://www.aboutcode.org>`_
+-----------------------------------------------------------
+| `Hritik Vijay <https://github.com/hritik14>`_
+| Project: `VulnerableCode <github.com/nexB/vulnerablecode>`_
+
+Overview
+---------
+VulnerableCode is a decentralized python program to collect data about open
+source software vulnerabilities across the internet.  My proposal for this
+year's Google Summer of Code involved improving the import speed, refactoring
+existing code, finding points for overall improvement and adding importers.
+
+Detailed Report
+-----------------
+
+Improve Import Time
+^^^^^^^^^^^^^^^^^^^^
+Profiling showed that a lot of time was being wasted during auto commits
+undertaken by django. Wraping the importer in an atomic block avoids lots of
+database commits and shows huge performance improvement. This simple change
+allows for much faster import times while not drastically changing the code
+structure::
+
+    Alpine: 202.7s -> 50.9s
+    Archlinux 2116.6s -> 107.8s
+    Gentoo 3176.3s -> 225.8s
+
+Yielding an average of 93% reduction in time (14x faster)
+
+More: https://github.com/nexB/vulnerablecode/pull/478
+
+Speed up upstream tests
+^^^^^^^^^^^^^^^^^^^^^^^^
+VulnerableCode performs upstream tests for all the importers to make sure that
+any change change in upstream data structure is easily spotted. This allows us
+to have a look at failing importers without actually deploying the application.
+
+Earlier, all of the importers were run one by one in order to verify that they
+are intact. While this being the obvious and the full proof way to detect any
+anomalies in the imported data schema, it did not work because the time
+required to run all the importers much exceeded 6 hours - which is the maximum
+time allowed for GitHub actions to run. 
+With this PR, the updated_advisories method of each importer is expected to
+create at least one Advisory object. If it does so, the importer is marked
+working. While this is not full proof, it stays much below the allowed resource
+usage cap. In the end, this is a trade off between resource usage and data
+accuracy.  This brings major performance improvement during the test.
+
+| Before: ~6hrs, now ~9 minutes
+| More: https://github.com/nexB/vulnerablecode/pull/490
+
+Improve Docker Configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The preferred mode of deployment for VulnerableCode is deploying using Docker
+images. Docker configuration existing earlier was very insecure and
+rudimentary. I took the inspiration for a uniform Docker configuration from the
+ScanCodeIO project and provided with detailed documentation for installation
+using a docker image. The current configuration makes use of files like
+``docker.env`` to supply container's environment and ``.dockerignore`` to skip
+over any unnecessary files for deployment.
+
+| More:
+| https://github.com/nexB/vulnerablecode/pull/497
+| https://github.com/nexB/vulnerablecode/pull/521
+
+Add Makefile
+^^^^^^^^^^^^^
+Makefile usage is prevalent in sister projects like `ScanCodeIO
+<https://github.com/nexB/scancode.io>`_. It gives VulnerableCode a consistent
+behavior and provides a very friendly interface for invocations. This also
+avoids security risks like having a default django ``SECRET_KEY`` as it can be
+easily generated by a make target.  I added a Makefile which has a similar
+usage as that of ScanCodeIO, replaced all the CI tests to use make, updated the
+relevant part of the documentation and updated settings to reject insecure
+deployments.
+
+| More:
+| https://github.com/nexB/vulnerablecode/pull/497
+| https://github.com/nexB/vulnerablecode/pull/523
+
+Use svn to collects tags in GitHubTagsAPI
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Surprisingly, GitHub allows svn requests to repositories. Now we can
+have all the tags with a single request. This is much more efficient and
+gentle to the APIs.
+This was as issue since the importers based on GithubDataSource were `failing
+<https://github.com/nexB/vulnerablecode/issues/507>`_ because of being rate
+limited by GitHub.
+
+| `Philippe <https://github.com/pombredanne>`_, thank you so much for the suggestion
+| More: https://github.com/nexB/vulnerablecode/pull/508
+
+Separate import and improve operations - WIP
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This introduces a new concept of ``improver``. Earlier, data fetching and
+improvement were done as one single process by ``importer``. This meant that
+importers were convoluted and not very modular. The concept of ``improver``
+comes from the idea that an ``importer`` should only do one thing - import. Any
+further improvement on the data is delegated to the improvers. This allows for
+us to have multiple ways of improvement with certain confidence on the improved
+data making the import and improve operations modular and simpler to work with.
+As a bonus, writing importers will be very easy and welcome more contributors
+to the project. As of writing this report, this remains a work in progress
+which will be finished very soon.
+
+More: https://github.com/nexB/vulnerablecode/pull/525
+
+Others
+^^^^^^^
+- helper: split_markdown_front_matter: https://github.com/nexB/vulnerablecode/pull/443
+- Dump yaml in favor of saneyaml https://github.com/nexB/vulnerablecode/pull/452
+- Refactor package_managers https://github.com/nexB/vulnerablecode/pull/495/commits
+- Importers bugfix https://github.com/nexB/vulnerablecode/pull/505
+
+Pre GSoC
+----------
+
+I started to like VulnerableCode as soon as I laid eyes on the project. While
+exploring the codebase, I realized that there is a lot of room for improvement.
+Thus I looked for simple improvements and bugs to fix in the early stage, which
+were:
+
+- `Correct API docs path and fix pytest invocation <https://github.com/nexB/vulnerablecode/pull/379>`_
+- `Explicity provide lxml parser to beautifulsoup <https://github.com/nexB/vulnerablecode/pull/382>`_
+- `Make sure vulnerability id is_cve or is_vulcoid <https://github.com/nexB/vulnerablecode/pull/389>`_
+- `Fix istio importer <https://github.com/nexB/vulnerablecode/pull/395>`_ (cleared a huge confusion about the codebase)
+- `Add me to AUTHORS <https://github.com/nexB/vulnerablecode/pull/405>`_ (Should've done this a lot earlier)
+- `Add unspecified scoring system <https://github.com/nexB/vulnerablecode/pull/415>`_
+- `Fix redhat import failure <https://github.com/nexB/vulnerablecode/pull/418>`_ (This one took a *lot* of effort to pinpoint)
+- `expose find_all_cve helper <https://github.com/nexB/vulnerablecode/pull/439>`_
+
+Post GSoC - Future Plans
+-------------------------
+I wish to carry on with the development of VulnerableCode and implement the
+ideas suggested by my mentors. This will require a lot of effort to bring
+VulnerableCode to a stable point. I hope to see VulnerableCode integrated into
+the ScanCode toolkit happen in a near future.
+
+Further, if possible, I would like VulnerableCode to interact with other great
+open source tools like *Eclipse Steady* and *Prospector*.  VulnerableCode,
+currently, works statically to collect all the vulnerabilities from different
+data sources, meanwhile there have been some developments with the Prospector
+project of Eclipse Steady. The project aims to scan fix-commits of the git
+repository in order to find out if the vulnerable part of a library was
+actually used in a project. It is not always the case that if a library is
+vulnerable then all the projects building upon it would be vulnerable too. It
+is crucial to identify if it is worth updating the library in use and dealing
+with the breaking changes.  *Prospectus* is undergoing improvements in order to
+be released as a usable public tool. *Project KB* (Under Eclipse Steady) is
+also working on a "tool support for mining repositories and databases of
+advisories to establish the (missing) link between vulnerabilities (as
+described in natural language in the advisories) and the corresponding
+fix-commits".  When these projects are ready for public use I would like to add
+them to VulnerableCode as a modules. I hope this will benefit both the projects
+and the downstream.
+
+Closing Thoughts
+-------------------
+I really enjoyed working on the project. There were ups and downs when I met
+some weird bugs but every one of them taught me something new about Python,
+Django and programming in general. The best part of working with my amazing
+mentors - Philippe and Shivam - were the `weekly meets
+<https://github.com/nexB/vulnerablecode/wiki/WeeklyMeetings#meeting-on-tuesday-2021-08-17-at-1400-utc>`_
+where we would together try to figure out how to proceed with the development.
+I learned something new with every call and interaction we had. Thank you so
+much my mentors for providing a very smooth experience and Google for showing
+me the guiding light for participation.
+
+To the reader, I would really like you to read `this <https://en.wikipedia.org/wiki/Program_optimization#When_to_optimize>`_
+before Philippe asks you to ;)