Skip to content

TechnionTDK/regex-ml

Repository files navigation

regex-ml

A tool for extracting refrences to the Babylonian Talmud from a given corpus, using weak supervision machine learning methods.

Table of Contents:


This project is testing whether using machine learning tools might be useful in tasks of information tagging. It is a part of a larger project, “The Jewish Book Closet”, and focuses on tagging references of Hebrew sources - in this case, the Babylonian Talmud.

In the past, regular expressions were used for the task of finding these references, but they have proven difficult to work with, especially with Hebrew sources, and therefore a machine learning approach was tested.
One of the most difficult steps when working with machine learning is the creation of a large enough data set for the machine to learn from. Our purpose was to create that data set using weak supervision machine learning methods.


This tool creates a labeled data set which will contain short sequences of an input text, and determine which sequence is a reference to the Babylonian Talmud and which is not:
The program receives a CSV file as input, breaks it into sentences, and then breaks each sentence into sequences in a range of sizes (can be changed). Using Snorkel and Pandas Python libraries, it uses predecided (manually) labeling functions to label the sequences and create the tagged data set.


For detailed information about the project, please visit the wiki page 📚📜.


In order to run this project, you'll need python version 3.7.1.
Install the requirements from the requirements file:

pip install -r requirements.txt

If you run in trouble with torch installation, try installing it manually, then install the requirements:

pip install torch===1.1.0 torchvision===0.3.0 -f https://download.pytorch.org/whl/torch_stable.html

pip install -r requirements.txt


https://www.snorkel.org/
https://scikit-learn.org/
https://jakevdp.github.io/PythonDataScienceHandbook/
https://pandas.pydata.org/


The process consisted of three steps:

  • First, creating a labeled data set - preparing the data set involved extraction of the text from a csv file, deviding it by sentences into ngrams of different sizes, creating the labeling functions and labeling using Snorkel Majority Label Voter model. Also, it involved cleaning the resulting labels from unnecessary duplications which using different ngram sizes may have caused.
  • Second, using transformations on the tagged dataset to enlarge it - the transformation were based on replacing masachtot and masachtot chapter names.
  • Third, training the classifier - training the classifier using the Logistic Regression linear model (scikit learn), with the labeled data set we have created as input.
  • Q: What is a reference to the Babylonian Talmud?
    A: Here is an example:

":ובפרק תינוקת (ברכות דף ס"ט)"

  • Q: How were the Labeling Functions decided?
    A: By perliminary manual overview of examples of references to the the Babylonian Talmud.
    more info on that you can find in the wiki.
  • Q: Why was the n-gram format chosen?
    A: Seemed most adequate and allowed us to include different sizes of references. That since we aspire that the tagging will be as acurate as possible, therefore we go over different n-gram sizes.

More clarifications will be added in the future if necessary


The project consists mainly of the following files:
-Root directory:
--->main.py - the main part of the project, includes the labeled data creation and training of the classifier
--->labeled_function.py – contains the labeling functions and their description.
--->transformation_functions.py - contains the transformation functions used to increase the labeled data set
--->utility.py – contains utility functions such as text parsing.
-Data directory:
--->analysis file - contains output analysis for every run, of functions coverage and classifier accuracy.
--->csvRes - name of expected input text in csv format.
--->df_test.csv and df_train.csv - 30-70 split of labeled data used to train the classifier
--->labeled_data - outputed labeled data
--->labeled_data_augmented.csv - outputed labeled data including additions of transformation functions.

  1. 🍴 Fork or 👯 Clone this repo to your local machine.
  2. Take the input file (Hebrew text of course) and turn it into CSV file, name it "csvRes.csv" and put it in the Data directory.
  3. Set the following constants which appear in the utility file -
  • SAMPLE_SIZE : the number of rows to use from the csv file.
  • MIN_N_GRAM_SIZE and MAX_N_GRAM_SIZE : determines the range of n-gram sizes.
  • TRANSFORMATION_FACTOR : determines the number of transformation of each label which contains a masechet or masechet chapter name. It needs to be between 0 and number of total masachtot/prakim.
  • TEST_RATIO = 0.30 : how to split train and test datasets for the classifier training.
  1. Run main.py .
  2. Check the results at the analysis file and explore 🔨


In main.py:

  • load_labeled_data - extracts ngrams from the csv input file
  • apply_lf_on_data - appplies the labeling functions on the data set and tags them
  • apply_tf_on_data - applies the transformation functions on the labeled data set
  • train_model - trains the classifier and outputs results


In general, you can take a look at the data folder to see some results.

An example of good results you can see in the data7000.rar file - a zip containing results from a run
with SAMPLE_SIZE of 7000.
History of our data creation and classifier runs you can find in the analysis file, check it out.
Please run the main.py with your own parameters to get some more results :)


The working process showed that the task at hand was much easier than it was using only regular expressions, especially when dealing with Hebrew sources. Most importantly, it resulted in a large tagged data set, which would have been impossible to create manually.
To test if the data set is satisfactory for a machine to learn from, we've created a basic classifier using the data set and then checked it on a small test set.

The next step is to take the data set this tool creates, and train a classifier that will tag any input text.
We believe that with further understanding of existing tools in machine learning it will be possible to achieve even better and more meaningful results.



For further explanation, please check out the wiki page.
Also, check out the Snorkel website mentioned under resources. Consider changing the labeling and transformation functions if see it fit.
The main function calls for several important functions which purpose is described thoroughly in the code.

good luck!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages