diff --git a/.gitignore b/.gitignore index 05e3462..9ee7ffb 100644 --- a/.gitignore +++ b/.gitignore @@ -72,4 +72,6 @@ ___* dask-worker-space *.parquet *.zip -*.pkl \ No newline at end of file +*.pkl +*.bib.bak +*.bib.sav \ No newline at end of file diff --git a/TODO.rst b/TODO.rst index 8d04c6c..14ee958 100644 --- a/TODO.rst +++ b/TODO.rst @@ -6,13 +6,10 @@ Proposed analyses The are many types of analyses which are already implemented or planned. -- [x] Replication of the algorithm of [BH2007]_. +- [x] Replication of the algorithm of Bessen and Hunt (2007). - [ ] Replacing old results of Random Forest implementation with a current implementation. - [ ] Improving the algorithm of Bessen and Hunt (2007) on the same indicator data with machine learning methods. - [ ] Machine and deep learning techniques using textual data. -- [ ] Network analysis of patents with the citation data at [PATENTSVIEW]_. - -.. [BH2007] https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1530-9134.2007.00136.x -.. [PATENTSVIEW] http://www.patentsview.org/download/ +- [ ] Network analysis of patents with the citation data from PT. diff --git a/src/documentation/introduction.rst b/src/documentation/introduction.rst index a6210d3..3248278 100644 --- a/src/documentation/introduction.rst +++ b/src/documentation/introduction.rst @@ -1 +1,134 @@ -.. include:: ../../README.rst +Introduction +------------ + +This project deals with the identification of software patents and combines +multiple approaches from simple algorithms to novel machine learning models to +achieve this goal. + + +Background +---------- + +The origin of this project was a Bachelor's thesis built on the algorithmic +approach of :cite:`bessen2007empirical`. The authors wanted to estimate the +number of software patents and find out where software patents are used and +what economic indicators are correlated with the amount of software patents in +certain industries. + +To classify patents into categories of software and non-software, the authors +developed a simple algorithm based on the evaluation of a random sample of +patents. The algorithm is as follows: + +.. + + (("software" in specification) OR ("computer" AND "program" in + specification)) + + AND (utility patent excluding reissues) + + ANDNOT ("chip" OR "semiconductor" OR "bus" OR "circuit" OR "circuitry" in + title) + + ANDNOT ("antigen" OR "antigenic" OR "chromatography" in specification) + +Whereas the title is simply identified, the specification is defined as the +abstract and the description of the patent (`PatentsView`_ separates the +description in :cite:`bessen2007empirical` definition into description and +summary). + +To replicate the algorithm, the project relies on two strategies. The first +data source is `Google Patents `_ where the texts +can be crawled. As this procedure is not feasible for the whole corpus of +patents, the second data source is `PatentsView`_ which provides large data +files for all patents from 1976 on. + +The replication of the original algorithm succeeds in 398 of 400 cases as one +patent was retracted and in one case an indicator was overlooked which lead to +a error in the classification. + +Compared to the manual classification of the authors, the algorithm performed +in the following way: + ++-------------------+----------+--------------+ +| | Relevant | Not Relevant | ++===================+==========+==============+ +| **Retrieved** | 42 | 8 | ++-------------------+----------+--------------+ +| **Not Retrieved** | 12 | 337 | ++-------------------+----------+--------------+ + +Relevant refers to software patents according to the manual +classification whereas retrieved indicates software patents +detected by the algorithm. The upper left corner can also be called +true-positives whereas the lower right corner shows the number of +true-negatives. + +Applying the algorithm on the whole patent corpus, we get the following +distributions of patents and software versus non-software patents. + +.. raw:: html + +

+ Absolute Number of Utility Patents
+ +

+ +.. raw:: html + +

+ Absolute Number of Software vs. Non-Software Patents
+ +

+ +.. raw:: html + +

+ Relative Number of Software vs. Non-Software Patents
+ +

+ + +Installation +------------ + +To play with the project, clone the repository to your disk with + +.. code-block:: bash + + $ git clone https://github.com/tobiasraabe/software_patents + +After that create an environment with ``conda`` and activate it by running + +.. code-block:: bash + + $ conda env create -n sp -f environment.yml + $ activate sp + +If you only want to download the files for reproducing the analysis based on +the indicators, run the following commands for downloading and validating: + +.. code-block:: bash + + $ python prepare_data_for_project download --subset replication + $ python prepare_data_for_project validate + +(If you want to have the raw data or everything, use ``--subset raw`` or +``--subset all``. Note that, you need about 60GB of free space on your disk. +Furthermore, handling the raw data requires an additional step where the files +are splitted into smaller chunks, so that they can fit into the memory of your +machine. These steps require knowledge about `Dask +`_. You can find more on this `here +`_.) + +Then, run the following two commands to replicate the results. + +.. code-block:: bash + + $ python waf.py configure distclean + $ python waf.py build + +.. _PatentsView: http://www.patentsview.org/web/ diff --git a/tox.ini b/tox.ini index 3332793..5d0ba06 100644 --- a/tox.ini +++ b/tox.ini @@ -42,6 +42,7 @@ changedir = src/documentation deps = sphinx sphinxcontrib.bibtex + ipython jupyter_client nbsphinx sphinx_rtd_theme