git clone [email protected]:psd-project/wordfinder.git
If prompts you permission denied or enter your password or whatever, then you should follow this:
last step you need to put your local "id_rsa. pub" file content to your "SSH keys" where it is in your gitlab Settings(top-right corner) and then you would see it left side: SSH keys.
By the way if you had any problem, please feel for free to contact me.
Also, you can git clone https://oauth2:[email protected]/psd-project/wordfinder.git
but it's not the recommended way.
Current level we have a demo version following:
First, select English language
Second, enter the word: sink, then click "Find" button
Enjoy the demo! Only support to show.
what's the main functionality of our product?
- support multilingualism
- enter a word and word's Part-Of-Speech, return corresponding sentences as fast as possible
- Should then “cluster” those sentences into examples with related senses; Present to the user one or more “clusters” of example sentences
- Must allow the user to examine, then change the number of clusters
- Database: gather and store text corpora in many languages in a way that makes queries of the type we want (word/part-of-speech lookup) fast and easy
- Analysis: code to cluster example sentences containing given word; interesting machine learning approaches here that I’ll explain eventually!
- Front end: simple, usable interface; must work on any platform, and should support messages/menu items in multiple languages
what's the main functionality of the alpha version (deadline March 19)?
- support at least two languages
- finish design of database tables , including: word-tag-sentence table as language type, such as English table:
word_name | pos_tag | sentence |
---|---|---|
sink | NOUN | Don't just leave your dirty plates in the sink! |
sink | VERB | The wheels started to sink into the mud. |
sink | VERB | How could you sink so low? |
-
Also, as fields above, we should train data and get tags of each word in our selected corpus and then put results to write into tables such as table called English_data, another table called Chinese_data, etc.
-
finish front-end disign, including available to any platform, supporting to enter word text box and supporting to select messages/menu items in multiple languages, etc.
-
a simple alogrithm to implement “cluster” functionality: sentences found by search into examples with related senses.
-
support users to change the number of clusters.
Tips:
-
in alpha version we don't need to care to much about the number of words and maybe one millon words are OK, but need to support at least two language
-
Note a possible little trick: sort table accordig to alphabeta order.
-
Preference choices for sql database is mysql
-
NOTE We only return the sentences that exactly contain searched word, such as sink rather than sunk and sinking, etc.
-
universal dependencies POS tag types:
-
more important and useful links about how we develop this project have put at tmp folder
what's the main function of the beta version (deadline April 9 )?
to do
what's the main function of the final version (deadline )?
to do
tips:
-
based on Dr.Scannell materials that contains important corpus we need, like UD , and tools for POS tag like UDpipe. Once we build some codes, then we can write data to our tables of database, which is very important.
-
Python as development language and web application
-
our repository: https://git.cs.slu.edu/psd-project/wordfinder/-/project_members
-
flask as the web framework as convenient
-
unit test
HERE we make development plans, dicuss them and pass them. Then we should followthese plans to start. If happing a problem in development, you should tell us in time and then we group should sovle it together before deadline.
2/16/2021 - 2/21/2021 TASKS
Sprint 1
- Develop UI in any language
- Obtain Corpus
- Clean the Corpus(Tokenization, lemmatization and stemming)
- Tag the data according the POS
Sprint 2
Discussion list:
-
discuss NLTK and UDpipe, key is multiple language support
-
corpus for 7-8 languages need to decide
3 load UDpipe pre-train model, then train our corpus of 2
4 let result write to our database, and core fields: word , POS tag, sentence
5 cluster sentences to get example sentences.
User interface English corpora POS Tag
Decide NLTK or CorPy Multilingual functionality Start writing to csv to build database structure
select table_name from information_schema.tables where table_schema='mysql';
new features:
- finish development of POS tag, based on udpipe pre-train model, available to multiple languages, including:
- base_model.py
- train_model.py
- base data structure: result_model.py
-
finish application for database at hopper.slu.edu, which hosts our web servers and database store. Our train can be put on this server to keep running all time.
-
finish development for mysql store model, and the module is store.py
unfinished features:
- corpus for many other languages
- cluster
Sprint #3 planning
- 1 methods to get corpus for many languages
- 1.1 wikipedia : language abbreviation: https://zh.wikipedia.org/wiki/ISO_639-1
- 1.2 how to get via wikipedia https://jdhao.github.io/2019/01/10/two_chinese_corpus/
-
database, tables structures
- current tables structure
- wordpos table and sentence table
updating and cleaning the database all the time
-
add cluster function by word2vec the gensim library can do it
-
web interfaces update
-
add logging
-
test task, cleaning the database
-
deploy to hopper.slu.edu
-
alpha version release