The Search Patent application is intented to provide search functionality against the patent database hosted in Azure and made accessible through Azure Search with a Python Flask Front End with d3 extensions
The first time the application is pulled up it will take some time to load your cache - note the 1st time will take up to 1 minute (should be ~ 15 seconds). However loading the website after the initial load will be much faster.
Presentation of application is available at: https://youtu.be/E3A5gvE8YCg
Detailed presentation including application install to run locally can be found at: https://youtu.be/6pafuFNN4Dg
Good news This install is not required
More details than required Here are the instructions to install this on your computer locally. This code has been run in python 3.6.1 environment in Azure and 3.6.2 running locally on pc
Steps
- Clone Repository
$ git clone https://github.com/megado123/searchpatent.git
- Install requirements.
> python -m pip install wheel
Note the location of the requirements.txt file will be in your local repository - so that will need to be modifed in the example below
> python -m pip install --upgrade -r requirements.txt
-
Manually run the setup.py file to download the nltk stopwords
-
Note - you don't have to run this file. The SqlLite database is included, but it can be dropped and created through the runserver.py file Running the runserver.py file with the argument: 'dropdb' to drop the database Running the runserver.py file with the argument: 'initdb' to create the database
-
To Run the application - set app.py as the start up file
python.exe should provide indication that application is running on http://localhost:5555
Additional files for deployment into Azure have also been included as part of this repository, our mission was to use one repository for deployment and source control. The files needed for a successful deployment into Azure have been include below.
- web.config
- downloader.py (can be run in Azure environment to bring up the nltk ui)
- ptvs_virtualenv_proxy.py
- .skipPythonDeployment (skip the standard deployment)
- web.config (empty) file in the static folder
- The HTML5 Templates used for this application initially came from Initializr
- There is a great series on PluralSlight called 'Introduction to the Flask Microframework' that was served as a foundation for building out the front end using Python, Flask, and Jinja2 Templates
- SQLLite is the database used to hold past searches and User Login in formation
- The patent database is hosted in an Azure SQL Server database
- A view of the data was created and indexed using Azure Search
- The Python web application is hosted in Azure highlights for getting that working in the Azure platform
An organization that thrives on innovation, inventing a new and unique product that will be of value to the marketplace must have solid footing on existing patents. A failure to defend itself against a patent infringement lawsuit could totally erase the profit recognized to compensate for research and development of a new product, all of which could extend over many years. Due to the sensitive nature of the information that researchers are exploring, using Internet search tools is frequently prohibited. Internal proprietary tools are needed. Our vision is to put together an application that provides relevant patent data for a researcher’s needs and interests. The application will utilize text mining and analysis techniques to enhance the researcher’s experience. The ultimate vision is that the researcher gains better information more quickly.
The application is hosted at http://searchpatent.azurewebsites.net.
Alternatively, the application can be downloaded from github and installed locally. [In either case, the application issues a call to the Azure Search API to get the results from the primary repository hosted on Azure.]
If run remotely or locally - a SQL Lite Database is used to store recent searches and user information. When a search is requested, the Python Flask Web application uses Azure Search which has indexed patent information using IDF-TF ranking against data pulled from http://www.patentsview.org/download/ and placed into a SQL Server database housed in Azure.
The data returned from Azure Search is then used to populate a word cloud from the Titles, and gensim is used to generate LDA and HDP topic models. In addition the top ranked companies based on patent numbers are displayed to provide a researcher immediate insight into compention/contributors to a relavent technical area. Finally the research results are provided with a short table providing key infomration along with the abstract of a particular patent.
- Present form to end user for search criteria
- Provide criteria to the Azure Search API and receive results set
- Identify word tokens by frequency using the patent titles.
- Identify topics in the search results. Latent Dirichlet Allocation (LDA) and the Hierarchical Dirichlet Allocation (HDP) methods are utilized to generate topic models and both models are displayed.
- Display results including patent meta data and patent abstract for browsing.
- Creation of a user account with password
- User identification and authentication
- Maintaining a history of patent search criteria, date, time by user
- Initialize local SQLLite database
- Drop and re-create local SQLLite database
-
The application was developed in Python with the Flask library to provide the user interactive features. The search results are viewed in HTML frames which utilize d3 to present the data in graphical word cloud as well as HTML tables
-
During the application install, a subdirectory named “searchpatent” is created. Several application setup files are copied to that directory. A sub-directory named “patentsearch” is also created. Patentsearch functions as a python library and includes important library modules forms.py, views.py and models.py. The initiation program is app.py.
Hierarchy of major application files:
- Searchpatent (directory)
- Requirements.txt
- Setup.py
- Runserver.py
- App.py
- Patentsearch (directory)
- Forms.py
- Views.py
- Models.py
- Templates (directory)
- Find.html
- Results2.html
init.py holds configuation for flask login manager and SQLLite database
AnotherTest.py Simple test added to ensure NLTK library functionality from Kudu command console within the Azure Environment.
home.py was initial application start, and remained with simple functions to retrieve information from SQLLite database
Forms.py contains definition of 3 class forms for user interaction:
Function | Overview |
---|---|
Search | Search A set of search criteria fields are available to the user. A button labeled “submit” is available once user has supplied desired criteria. |
Login | Login Fields for user ID and password are presented to user. A button labeled “login” is available. |
Sign-Up | Sign-Up Fields for user name, email address, user ID, and 2 password are available to the user. A button labeled “create account” is available. |
The classes contain information pertaining to how the input is displayed and validated on the search before submssion using wt form validators and fields
Views.py contains these functions:
Function | Overview |
---|---|
Find | Confirm user-entered search criteria and call makerequest(). makerequest functions makes API call to Azure Search |
Login | Confirm user credentials |
Logout | Remove current user settings in app memory |
Load_user | Get user search history |
Models.py contains these functions:
Function | Overview |
---|---|
Search | Retrieve history of searches |
SearchFields | Process search fields |
SearchData | Initialize memory variables to process search results returned. Call bow(). Call GetTops(). |
bow | Tokenize title data. Remove stopwords, punctuation, and set lower case. Calculate term frequency. |
GetTops | Tokenize abstract text. Remove stopwords, punctuation, and set lower case. Call LDA model function in gensim library. Call HDP function in gensim library. Set up results for tabular display. |
HTML templates are utilized:
Function | Overview |
---|---|
Find.html | Form for user to enter search criteria and desired sort. |
Results2.html | Form to present results to user. Includes word cloud of frequent terms, LDA topic model, HDP topic model, and patent meta data and abstract data. |
404.html | When user puts in a page not found, ex: http://searchpatent.azurewebsites.net/dog |
505.html | When exception occurs provides ability to send email - if they think this occured on error. |
Base.html | Base Template in which other templates inherit from. |
form_macro.html | Template providing ability to display field errors. |
login.html | login template. |
signup.html | Allows for signing up for user (recall validation is provided in forms.py |
user.html | Allows viewing past searches ex: http://searchpatent.azurewebsites.net/user/megado123 |
index.html | Welcome page for application displaying recently performed searches. |
We selected the PatentsView data source (http://patentsview.org) for patent data. The site offers all data from 1976 to 2016 in different tables organized for a relational database. The PatentsView website (http://www.patentsview.org/download) offers a total of 52 tables for download. Many are reference tables to the multiple categorization codes available. The core patent data in scope for our solution is covered in 8 tables (rawassignee, rawlocation, brf_sum_text, patent, assignee, patent_assignee, location_assignee, location). This data was combined using a SQL Server View. A SQL database hosted on Azure is loaded with more than 6 million patents and related data.
An additional SQLLite database is part of the application which is used to store the user logon credential data. It also stores the search criteria submitted to date for each user. In the future this application will use Azure AD for Authentication and Authorization of users. Due to these security considerations we decided to use a SQLLite database which could easily be relaced due the usage of SQLAlchemy as the ORM.
Azure offers a search function and API. The solution designer would configure as many indices on the data as necessary. Our application requires only one index which is a search on patent abstract data. The search API is configured to receive an index name and search criteria fields including a sort field. The API is called by an HTTP GET. The results set is determined by the TF-IDF score or by the date. The result size returned can range from 10 to 30 results. Our index returns several fields including the patent title, company or organization that holds the patent, date patent granted, and location including country, state, and city.