Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix documentation #4

Merged
merged 4 commits into from
Aug 21, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -72,4 +72,6 @@ ___*
dask-worker-space
*.parquet
*.zip
*.pkl
*.pkl
*.bib.bak
*.bib.sav
7 changes: 2 additions & 5 deletions TODO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,10 @@ Proposed analyses

The are many types of analyses which are already implemented or planned.

- [x] Replication of the algorithm of [BH2007]_.
- [x] Replication of the algorithm of Bessen and Hunt (2007).
- [ ] Replacing old results of Random Forest implementation with a current
implementation.
- [ ] Improving the algorithm of Bessen and Hunt (2007) on the same indicator
data with machine learning methods.
- [ ] Machine and deep learning techniques using textual data.
- [ ] Network analysis of patents with the citation data at [PATENTSVIEW]_.

.. [BH2007] https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1530-9134.2007.00136.x
.. [PATENTSVIEW] http://www.patentsview.org/download/
- [ ] Network analysis of patents with the citation data from PT.
135 changes: 134 additions & 1 deletion src/documentation/introduction.rst
Original file line number Diff line number Diff line change
@@ -1 +1,134 @@
.. include:: ../../README.rst
Introduction
------------

This project deals with the identification of software patents and combines
multiple approaches from simple algorithms to novel machine learning models to
achieve this goal.


Background
----------

The origin of this project was a Bachelor's thesis built on the algorithmic
approach of :cite:`bessen2007empirical`. The authors wanted to estimate the
number of software patents and find out where software patents are used and
what economic indicators are correlated with the amount of software patents in
certain industries.

To classify patents into categories of software and non-software, the authors
developed a simple algorithm based on the evaluation of a random sample of
patents. The algorithm is as follows:

..

(("software" in specification) OR ("computer" AND "program" in
specification))

AND (utility patent excluding reissues)

ANDNOT ("chip" OR "semiconductor" OR "bus" OR "circuit" OR "circuitry" in
title)

ANDNOT ("antigen" OR "antigenic" OR "chromatography" in specification)

Whereas the title is simply identified, the specification is defined as the
abstract and the description of the patent (`PatentsView`_ separates the
description in :cite:`bessen2007empirical` definition into description and
summary).

To replicate the algorithm, the project relies on two strategies. The first
data source is `Google Patents <https://patents.google.com/>`_ where the texts
can be crawled. As this procedure is not feasible for the whole corpus of
patents, the second data source is `PatentsView`_ which provides large data
files for all patents from 1976 on.

The replication of the original algorithm succeeds in 398 of 400 cases as one
patent was retracted and in one case an indicator was overlooked which lead to
a error in the classification.

Compared to the manual classification of the authors, the algorithm performed
in the following way:

+-------------------+----------+--------------+
| | Relevant | Not Relevant |
+===================+==========+==============+
| **Retrieved** | 42 | 8 |
+-------------------+----------+--------------+
| **Not Retrieved** | 12 | 337 |
+-------------------+----------+--------------+

Relevant refers to software patents according to the manual
classification whereas retrieved indicates software patents
detected by the algorithm. The upper left corner can also be called
true-positives whereas the lower right corner shows the number of
true-negatives.

Applying the algorithm on the whole patent corpus, we get the following
distributions of patents and software versus non-software patents.

.. raw:: html

<p align="center">
<b>Absolute Number of Utility Patents</b><br>
<img src="_static/fig-patents-distribution.png"
width="600" height="400">
</p>

.. raw:: html

<p align="center">
<b>Absolute Number of Software vs. Non-Software Patents</b><br>
<img src="_static/fig-patents-distribution-vs.png"
width="600" height="400">
</p>

.. raw:: html

<p align="center">
<b>Relative Number of Software vs. Non-Software Patents</b><br>
<img src="_static/fig-patents-distribution-vs-shares.png"
width="600" height="400">
</p>


Installation
------------

To play with the project, clone the repository to your disk with

.. code-block:: bash

$ git clone https://github.com/tobiasraabe/software_patents

After that create an environment with ``conda`` and activate it by running

.. code-block:: bash

$ conda env create -n sp -f environment.yml
$ activate sp

If you only want to download the files for reproducing the analysis based on
the indicators, run the following commands for downloading and validating:

.. code-block:: bash

$ python prepare_data_for_project download --subset replication
$ python prepare_data_for_project validate

(If you want to have the raw data or everything, use ``--subset raw`` or
``--subset all``. Note that, you need about 60GB of free space on your disk.
Furthermore, handling the raw data requires an additional step where the files
are splitted into smaller chunks, so that they can fit into the memory of your
machine. These steps require knowledge about `Dask
<https://dask.pydata.org/en/latest/>`_. You can find more on this `here
<https://github.com/tobiasraabe/software_patents/blob/master/src/documentation/
data.rst>`_.)

Then, run the following two commands to replicate the results.

.. code-block:: bash

$ python waf.py configure distclean
$ python waf.py build

.. _PatentsView: http://www.patentsview.org/web/
1 change: 1 addition & 0 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ changedir = src/documentation
deps =
sphinx
sphinxcontrib.bibtex
ipython
jupyter_client
nbsphinx
sphinx_rtd_theme
Expand Down