git clone [email protected]:psd-project/wordfinder.git
If prompts you permission denied or enter your password or whatever, then you should follow this:
last step you need to put your local "id_rsa. pub" file content to your "SSH keys" where it is in your gitlab Settings(top-right corner) and then you would see it left side: SSH keys.
By the way if you had any problem, please feel for free to contact me.
Also, you can git clone https://oauth2:[email protected]/psd-project/wordfinder.git
but it's not the recommended way.
Current level we have a demo version following:
First, select English language
Second, enter the word: sink, then click "Find" button
Enjoy the demo! Only support to show.
what's the main functionality of our product?
- support multilingualism
- enter a word and word's Part-Of-Speech, return corresponding sentences as fast as possible
- Should then “cluster” those sentences into examples with related senses; Present to the user one or more “clusters” of example sentences
- Must allow the user to examine, then change the number of clusters
- Database: gather and store text corpora in many languages in a way that makes queries of the type we want (word/part-of-speech lookup) fast and easy
- Analysis: code to cluster example sentences containing given word; interesting machine learning approaches here that I’ll explain eventually!
- Front end: simple, usable interface; must work on any platform, and should support messages/menu items in multiple languages
what's the main functionality of the alpha version (deadline March 19)?
- support at least two languages
- finish design of database tables , including: word-tag-sentence table as language type, such as English table:
word_name | pos_tag | sentence |
---|---|---|
sink | NOUN | Don't just leave your dirty plates in the sink! |
sink | VERB | The wheels started to sink into the mud. |
sink | VERB | How could you sink so low? |
-
Also, as fields above, we should train data and get tags of each word in our selected corpus and then put results to write into tables such as table called English_data, another table called Chinese_data, etc.
-
finish front-end disign, including available to any platform, supporting to enter word text box and supporting to select messages/menu items in multiple languages, etc.
-
a simple alogrithm to implement “cluster” functionality: sentences found by search into examples with related senses.
-
support users to change the number of clusters.
Tips:
-
in alpha version we don't need to care to much about the number of words and maybe one millon words are OK, but need to support at least two language
-
Note a possible little trick: sort table accordig to alphabeta order.
-
Preference choices for sql database is mysql
-
NOTE We only return the sentences that exactly contain searched word, such as sink rather than sunk and sinking, etc.
-
universal dependencies POS tag types:
-
more important and useful links about how we develop this project have put at tmp folder
what's the main function of the beta version (deadline April 9 )?
to do
what's the main function of the final version (deadline )?
to do
tips:
-
based on Dr.Scannell materials that contains important corpus we need, like UD , and tools for POS tag like UDpipe. Once we build some codes, then we can write data to our tables of database, which is very important.
-
Python as development language and web application
-
our repository: https://git.cs.slu.edu/psd-project/wordfinder/-/project_members
-
flask as the web framework as convenient
-
unit test
HERE we make development plans, dicuss them and pass them. Then we should followthese plans to start. If happing a problem in development, you should tell us in time and then we group should sovle it together before deadline.
2/16/2021 - 2/21/2021 TASKS
- Develop UI in any language
- Obtain Corpus
- Clean the Corpus(Tokenization, lemmatization and stemming)
- Tag the data according the POS
Discussion list:
-
discuss NLTK and UDpipe, key is multiple language support
-
corpus for 7-8 languages need to decide
3 load UDpipe pre-train model, then train our corpus of 2
4 let result write to our database, and core fields: word , POS tag, sentence
5 cluster sentences to get example sentences.
User interface English corpora POS Tag
Decide NLTK or CorPy Multilingual functionality Start writing to csv to build database structure
select table_name from information_schema.tables where table_schema='mysql';
new features:
- finish development of POS tag, based on udpipe pre-train model, available to multiple languages, including:
- base_model.py
- train_model.py
- base data structure: result_model.py
-
finish application for database at hopper.slu.edu, which hosts our web servers and database store. Our train can be put on this server to keep running all time.
-
finish development for mysql store model, and the module is store.py
unfinished features:
- corpus for many other languages
- cluster
- 1 methods to get corpus for many languages
- 1.1 wikipedia : language abbreviation: https://zh.wikipedia.org/wiki/ISO_639-1
- 1.2 how to get via wikipedia https://jdhao.github.io/2019/01/10/two_chinese_corpus/
-
database, tables structures
- current tables structure
- wordpos table and sentence table
updating and cleaning the database all the time
-
add cluster function by word2vec the gensim library can do it @Zhen Guo
-
on web interfaces we should add the show for cluster task @Zhen Guo
-
add logging for every key step
-
test task, cleaning the database @Willie @Haris
-
deploy to hopper.slu.edu
-
alpha version release
Right now I found our repository has a problem considerable us to pay enough attention. Everyone has an individual file path and they are different from each other,
such as file path of train corpus, the file path of cluster model, the file path of database config. These file paths cannot be pushed to our base repository!
We should think of a nice way to solve this issue. And I have an idea. We should maintain a common file relative path and all data files and config data should be put inside it. Also, there's another important thing to remember: don't push these corpus and pre-train models to our base repository. We should maintain a common remote disk to store and then open and share a link to provide everyone in our group to use.
I have created a file named input, there are three files inside it: corpus, udpipemodel, and word2vecmodel. All files in them are hosted at
download: https://pan.baidu.com/s/14RzwuGjTZwsUhiyVSe-Pgg password: td3e
downloading them and put them on root directory of wordfiner folder
1、database: we should build a remote DB @Willie 2、word2vec: two methods of doing that @Zhen 3、we should label every sentence and show all sentences with a label to the cluster web interfaces @all
review codes we have pushed to the base github repo @all with models we had train more languages, train_model. py to database, cluster_model. py to get word2vec model(it doesn't need to store database so everyone can do it)@all test every py module and welcome to commit bugs we everyone find @all with logging module add logs before and after important events @all Time complexity for this task is a needed issue for us to consider.
-
DATABASE
- create accesible db for everyone
- Will have to change util.py to connect to new db
- check the ouput for application
- Also we need to train more languages.
- Add more text files
- create accesible db for everyone
-
KWIC
- we should highlight the selected word in each sentence
- Check the length of words on each side of selected word
- sentence by sentence
-
CLUSTERING
- We should adjust our cluster algorithms
- Apply various algorithms to our cluster_model.py.
- Cluster after user search word
- For example, if we select the word excellent, then find a sentence such as: He was an excellent journalist and a very fine man, after the cluster, we expect to get the sentence like he is a very good man.
- Also need to set a default k value...
- Elbow Method
- will try to determine default k based on length of characters in selected
- Evaluate quality of cluster