Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ngrams and topic number #39

Open
AhmetCakar opened this issue May 22, 2021 · 9 comments
Open

Remove ngrams and topic number #39

AhmetCakar opened this issue May 22, 2021 · 9 comments
Labels
question Further information is requested

Comments

@AhmetCakar
Copy link

Hi Andrew, again me :)
I want to ask two questions about the algorithm.
When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams?
My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?

@andrewtavis andrewtavis added the question Further information is requested label May 23, 2021
@andrewtavis
Copy link
Owner

@AhmetCakar, hi again :)

You could certainly try BERT without removing the n-grams, but I've found that kwx works better when they're removed. BERT is able to pick up semantics from sentences, so it's actually better if they're less cleaning and nothing added that's not in them originally. Basically we don't need to add in n-grams as BERT's able to find the relationships of the words from context itself - we don't need to add in word tokens representing these representations. I honestly think that some steps in the kwx cleaning process might even be too much for a BERT model - that maybe for BERT we should just be using the raw uncleaned texts. This could be something you could try :)

For your second question, could you explain what you mean by number of threads? Would be happy to get back to you with a bit more background on what that is and how it's confusing.

@Keamww2021
Copy link

Hello Andrew,
I'm learning LDA and BERT and KEYWORD extraction.
You apply all in your algorithm, which is great work.

I would like you to help in understanding some of your code.

What is the purpose of this piece of code:
import os
import sys

import numpy as np
import pandas as pd

from kwx.utils import load_data, prepare_data
from kwx.utils import organize_by_pos, translate_output
from kwx.model import extract_kws, gen_files
from kwx.visuals import graph_topic_num_evals, pyLDAvis_topics
from kwx.visuals import gen_word_cloud, t_sne

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="darkgrid")
sns.set(rc={"figure.figsize": (15, 5)})

pd.set_option("display.max_rows", 16)
pd.set_option("display.max_columns", None)
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:99% !important; }</style>"))

Another Thing:
Suppose I want to apply your code in ARABIC tweets. Would that will work.

Lastly:
I would like to apply in a set of documents. Can you refer to me helpful resources?

I really appreciate any help you can provide.

@andrewtavis
Copy link
Owner

andrewtavis commented Nov 7, 2021

Hi @Eman-2021-PhD :) Thanks for your compliments and your questions!

First question:
I'm assuming that the code that you're referring to is the imports at the top of examples/kw_extraction, but correct me if I'm wrong. The code is the imports of what's needed for running the notebook - everything from kwx, pandas, numpy, and the plotting packages - and along with that are some notebook specific imports that I always put at the top of my Jupyter notebooks. Again I'm assuming the notebook specific imports are what's confusing. Here's a rundown of those :)

sns.set(style="darkgrid")
sns.set(rc={"figure.figsize": (15, 5)})

The above sets the background style of the plots with seaborn, and also determines how big all the plots will be. You can see the output in the Graph of Topic Number Evaluations section.

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>"))

I'm noticing that the above should be together rather than separated by a line, and just fixed that. What this does is it expands the display of your Jupyter notebook to close to the full width of the screen so that you have more space to work with. You first import the Jupyter (IPython) notebook's ability to interact with the HTML display, and then you set the width to 99% of the width of the screen (I've found that 100% can cause the scroll bar to disappear).

Second question:
I'm very much hoping that kwx can work for Arabic, and wish you luck on your project. In the example you should just need to change the languge to "arabic" like so:

from kwx.utils import prepare_data

input_language = "arabic" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

input_language would then also need to be passed to extract_kws as in the examples. kwx doesn't allow for lemmatisation for Arabic, so it instead will stem the words using NLTK's SnowballStemmer("arabic").

Third question:
I'm not sure about resources, but you should be able to apply kwx to a set of documents directly. All you'd need to do is set up a list of the texts or put them into a pandas dataframe. Say that you have a dataframe df_arabic_texts where each row is a different text that can be found in the column "texts". In the above example you'd do:

from kwx.utils import prepare_data

input_language = "arabic"
text_corpus = prepare_data(
    data=df_arabic_texts,
    target_cols="texts",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

You'd then use text_corpus in extract_kws as directed by the examples.

Hope that the above helps :) Let me know if you have further questions, and again good luck!

@Keamww2021
Copy link

Keamww2021 commented Nov 7, 2021 via email

@andrewtavis
Copy link
Owner

You're very welcome!

@Keamww2021
Copy link

Keamww2021 commented Nov 10, 2021 via email

@andrewtavis
Copy link
Owner

Hi @Eman-2021-PhD :)

method="frequency" is just going to return the words that occur the most in the documents, which can be considered to be keywords in a simplistic sense.

One way that you could extract high frequency keywords is to run kwx twice over your documents: once with method="LDA" or method="BERT"; and a second time with method="frequency". You could then compare the outputs and take only those words from the first run that also appear in the second :) You might need to increase the value for num_keywords in extract_kws so that you get enough words that overlap in the two runs, but it definitely will work. This is actually an example of one of the original use cases for kwx :D

Let me know if you have further questions about the arguments in extract_kws.

All the best!

@Keamww2021
Copy link

Keamww2021 commented Nov 11, 2021 via email

@andrewtavis
Copy link
Owner

You're very welcome, and further regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants