-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ngrams and topic number #39
Comments
@AhmetCakar, hi again :) You could certainly try BERT without removing the n-grams, but I've found that kwx works better when they're removed. BERT is able to pick up semantics from sentences, so it's actually better if they're less cleaning and nothing added that's not in them originally. Basically we don't need to add in n-grams as BERT's able to find the relationships of the words from context itself - we don't need to add in word tokens representing these representations. I honestly think that some steps in the kwx cleaning process might even be too much for a BERT model - that maybe for BERT we should just be using the raw uncleaned texts. This could be something you could try :) For your second question, could you explain what you mean by number of threads? Would be happy to get back to you with a bit more background on what that is and how it's confusing. |
Hello Andrew, I would like you to help in understanding some of your code. What is the purpose of this piece of code: import numpy as np from kwx.utils import load_data, prepare_data import matplotlib.pyplot as plt sns.set(style="darkgrid") pd.set_option("display.max_rows", 16) display(HTML("<style>.container { width:99% !important; }</style>")) Another Thing: Lastly: I really appreciate any help you can provide. |
Hi @Eman-2021-PhD :) Thanks for your compliments and your questions! First question: sns.set(style="darkgrid")
sns.set(rc={"figure.figsize": (15, 5)}) The above sets the background style of the plots with from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>")) I'm noticing that the above should be together rather than separated by a line, and just fixed that. What this does is it expands the display of your Jupyter notebook to close to the full width of the screen so that you have more space to work with. You first import the Jupyter (IPython) notebook's ability to interact with the HTML display, and then you set the width to 99% of the width of the screen (I've found that 100% can cause the scroll bar to disappear). Second question: from kwx.utils import prepare_data
input_language = "arabic" # see kwx.languages for options
# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
data="df_or_csv_xlsx_path",
target_cols="cols_where_texts_are",
input_language=input_language,
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
)
Third question: from kwx.utils import prepare_data
input_language = "arabic"
text_corpus = prepare_data(
data=df_arabic_texts,
target_cols="texts",
input_language=input_language,
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
) You'd then use Hope that the above helps :) Let me know if you have further questions, and again good luck! |
Many thanks for your reply.
Your explanation is very clear and helpful.
…On Sun, 7 Nov 2021 at 1:02 PM Andrew Tavis McAllister < ***@***.***> wrote:
Hi @Eman-2021-PhD <https://github.com/Eman-2021-PhD> :) Thanks for your
compliments a your questions!
First question:
I'm assuming that the code that you're referring to is the imports at the
top of examples/kw_extraction
<https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb>,
but correct me if I'm wrong. The code is the imports of what's needed for
running the notebook - everything from kwx, pandas, numpy, and the plotting
packages - and along with that are some notebook specific imports that I
always put at the top of my Jupyter notebooks. Again I'm assuming the
notebook specific imports are what's confusing. Here's a rundown of those :)
sns.set(style="darkgrid")sns.set(rc={"figure.figsize": (15, 5)})
The above sets the background style of the plots with seaborn, and also
determines how big all the plots will be. You can see the output in the *Graph
of Topic Number Evaluations* section.
from IPython.core.display import display, HTMLdisplay(HTML("<style>.container { width:99% !important; }</style>"))
I'm noticing that the above should be together rather than separated by a
line, and just fixed that. What this does is it expands the display of your
Jupyter notebook to close to the full width of the screen so that you have
more space to work with. You first import the Jupyter (IPython) notebook's
ability to interact with the HTML display, and then you set the width to
99% of the width of the screen (I've found that 100% can cause the scroll
bar to disappear).
Second question:
I'm very much hoping that kwx can work for Arabic, and wish you luck on
your project. In the example you should just need to change the languge to
"arabic" like so:
from kwx.utils import prepare_data
input_language = "arabic" # see kwx.languages for options
# kwx.utils.clean() can be used on a list of liststext_corpus = prepare_data(
data="df_or_csv_xlsx_path",
target_cols="cols_where_texts_are",
input_language=input_language,
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
)
input_language would then also need to be passed to extract_kws as in the
examples. kwx doesn't allow for lemmatisation
<https://en.wikipedia.org/wiki/Lemmatisation> for Arabic, so it instead
will stem <https://en.wikipedia.org/wiki/Stemming> the words using NLTK's
<https://github.com/nltk/nltk> SnowballStemmer("arabic").
Third question:
I'm not sure about resources, but you should be able to apply kwx to a set
of documents directly. All you'd need to do is set up a list of the texts
or put them into a pandas dataframe. Say that you have a dataframe
df_arabic_texts where each row is a different text that can be found in
the column "texts". In the above example you'd do:
from kwx.utils import prepare_data
input_language = "arabic"text_corpus = prepare_data(
data=df_arabic_texts,
target_cols="texts",
input_language=input_language,
min_token_freq=0, # for BERT
min_token_len=0, # for BERT
remove_stopwords=False, # for BERT
verbose=True,
)
You'd then use text_corpus in extract_kws as directed by the examples.
Hope that the above helps :) Let me know if you have further questions,
and again good luck!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWHZTGUXZ3O4CLLHTVSWQQLUKZFDRANCNFSM45K2TVRA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
You're very welcome! |
Hello Again,
Could you please explain the function of this code:
freq_kws = extract_kws(
method="frequency",
bert_st_model=None,
text_corpus=text_corpus,
input_language=input_language,
output_language=None,
num_keywords=num_keywords,
num_topics=num_topics,
corpuses_to_compare=None,
ignore_words=None,
prompt_remove_words=False,
)
do you have any idea about how to extract "the top keywords after
extract all the keywords" , i mean how to extract the top keywords that
have high frequency.
Thanks in advance,
في الأحد، 7 نوفمبر 2021 في 6:38 م تمت كتابة ما يلي بواسطة Andrew Tavis
McAllister ***@***.***>:
… You're very welcome!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWHZTGTA5SFYV56YPECJA3DUK2MOJANCNFSM45K2TVRA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hi @Eman-2021-PhD :)
One way that you could extract high frequency keywords is to run kwx twice over your documents: once with Let me know if you have further questions about the arguments in All the best! |
Thank you for the clarification.
…On Thu, 11 Nov 2021 at 2:03 AM Andrew Tavis McAllister < ***@***.***> wrote:
Hi @Eman-2021-PhD <https://github.com/Eman-2021-PhD> :)
method="frequency" is just going to return the words that occur the most
in the documents, which can be considered to be keywords in a simplistic
sense.
One way that you could extract high frequency keywords is to run kwx twice
over your documents: once with method="LDA" or method="BERT"; and a
second time with method="frequency". You could then compare the outputs
and take only those words from the first run that also appear in the second
:) You might need to increase the value for num_keywords in extract_kws
so that you get enough words that overlap in the two runs, but it
definitely will work. This is actually an example of one of the original
use cases for kwx :D
Let me know if you have further questions about the arguments in
extract_kws.
All the best!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWHZTGQ6OTU2LITBFZZ3YWLULL235ANCNFSM45K2TVRA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
You're very welcome, and further regards! |
Hi Andrew, again me :)
I want to ask two questions about the algorithm.
When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams?
My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?
The text was updated successfully, but these errors were encountered: