Remove ngrams and topic number #39

AhmetCakar · 2021-05-22T17:24:41Z

Hi Andrew, again me :)
I want to ask two questions about the algorithm.
When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams?
My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?

andrewtavis · 2021-05-23T20:29:08Z

@AhmetCakar, hi again :)

You could certainly try BERT without removing the n-grams, but I've found that kwx works better when they're removed. BERT is able to pick up semantics from sentences, so it's actually better if they're less cleaning and nothing added that's not in them originally. Basically we don't need to add in n-grams as BERT's able to find the relationships of the words from context itself - we don't need to add in word tokens representing these representations. I honestly think that some steps in the kwx cleaning process might even be too much for a BERT model - that maybe for BERT we should just be using the raw uncleaned texts. This could be something you could try :)

For your second question, could you explain what you mean by number of threads? Would be happy to get back to you with a bit more background on what that is and how it's confusing.

Keamww2021 · 2021-11-07T09:21:17Z

Hello Andrew,
I'm learning LDA and BERT and KEYWORD extraction.
You apply all in your algorithm, which is great work.

I would like you to help in understanding some of your code.

What is the purpose of this piece of code:
import os
import sys

import numpy as np
import pandas as pd

from kwx.utils import load_data, prepare_data
from kwx.utils import organize_by_pos, translate_output
from kwx.model import extract_kws, gen_files
from kwx.visuals import graph_topic_num_evals, pyLDAvis_topics
from kwx.visuals import gen_word_cloud, t_sne

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="darkgrid")
sns.set(rc={"figure.figsize": (15, 5)})

pd.set_option("display.max_rows", 16)
pd.set_option("display.max_columns", None)
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:99% !important; }</style>"))

Another Thing:
Suppose I want to apply your code in ARABIC tweets. Would that will work.

Lastly:
I would like to apply in a set of documents. Can you refer to me helpful resources?

I really appreciate any help you can provide.

andrewtavis · 2021-11-07T10:02:22Z

Hi @Eman-2021-PhD :) Thanks for your compliments and your questions!

First question:
I'm assuming that the code that you're referring to is the imports at the top of examples/kw_extraction, but correct me if I'm wrong. The code is the imports of what's needed for running the notebook - everything from kwx, pandas, numpy, and the plotting packages - and along with that are some notebook specific imports that I always put at the top of my Jupyter notebooks. Again I'm assuming the notebook specific imports are what's confusing. Here's a rundown of those :)

sns.set(style="darkgrid")
sns.set(rc={"figure.figsize": (15, 5)})

The above sets the background style of the plots with seaborn, and also determines how big all the plots will be. You can see the output in the Graph of Topic Number Evaluations section.

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>"))

I'm noticing that the above should be together rather than separated by a line, and just fixed that. What this does is it expands the display of your Jupyter notebook to close to the full width of the screen so that you have more space to work with. You first import the Jupyter (IPython) notebook's ability to interact with the HTML display, and then you set the width to 99% of the width of the screen (I've found that 100% can cause the scroll bar to disappear).

Second question:
I'm very much hoping that kwx can work for Arabic, and wish you luck on your project. In the example you should just need to change the languge to "arabic" like so:

from kwx.utils import prepare_data

input_language = "arabic" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

input_language would then also need to be passed to extract_kws as in the examples. kwx doesn't allow for lemmatisation for Arabic, so it instead will stem the words using NLTK's SnowballStemmer("arabic").

Third question:
I'm not sure about resources, but you should be able to apply kwx to a set of documents directly. All you'd need to do is set up a list of the texts or put them into a pandas dataframe. Say that you have a dataframe df_arabic_texts where each row is a different text that can be found in the column "texts". In the above example you'd do:

from kwx.utils import prepare_data

input_language = "arabic"
text_corpus = prepare_data(
    data=df_arabic_texts,
    target_cols="texts",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

You'd then use text_corpus in extract_kws as directed by the examples.

Hope that the above helps :) Let me know if you have further questions, and again good luck!

Keamww2021 · 2021-11-07T12:17:10Z

Many thanks for your reply. Your explanation is very clear and helpful.

…

On Sun, 7 Nov 2021 at 1:02 PM Andrew Tavis McAllister < ***@***.***> wrote: Hi @Eman-2021-PhD <https://github.com/Eman-2021-PhD> :) Thanks for your compliments a your questions! First question: I'm assuming that the code that you're referring to is the imports at the top of examples/kw_extraction <https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb>, but correct me if I'm wrong. The code is the imports of what's needed for running the notebook - everything from kwx, pandas, numpy, and the plotting packages - and along with that are some notebook specific imports that I always put at the top of my Jupyter notebooks. Again I'm assuming the notebook specific imports are what's confusing. Here's a rundown of those :) sns.set(style="darkgrid")sns.set(rc={"figure.figsize": (15, 5)}) The above sets the background style of the plots with seaborn, and also determines how big all the plots will be. You can see the output in the *Graph of Topic Number Evaluations* section. from IPython.core.display import display, HTMLdisplay(HTML("<style>.container { width:99% !important; }</style>")) I'm noticing that the above should be together rather than separated by a line, and just fixed that. What this does is it expands the display of your Jupyter notebook to close to the full width of the screen so that you have more space to work with. You first import the Jupyter (IPython) notebook's ability to interact with the HTML display, and then you set the width to 99% of the width of the screen (I've found that 100% can cause the scroll bar to disappear). Second question: I'm very much hoping that kwx can work for Arabic, and wish you luck on your project. In the example you should just need to change the languge to "arabic" like so: from kwx.utils import prepare_data input_language = "arabic" # see kwx.languages for options # kwx.utils.clean() can be used on a list of liststext_corpus = prepare_data( data="df_or_csv_xlsx_path", target_cols="cols_where_texts_are", input_language=input_language, min_token_freq=0, # for BERT min_token_len=0, # for BERT remove_stopwords=False, # for BERT verbose=True, ) input_language would then also need to be passed to extract_kws as in the examples. kwx doesn't allow for lemmatisation <https://en.wikipedia.org/wiki/Lemmatisation> for Arabic, so it instead will stem <https://en.wikipedia.org/wiki/Stemming> the words using NLTK's <https://github.com/nltk/nltk> SnowballStemmer("arabic"). Third question: I'm not sure about resources, but you should be able to apply kwx to a set of documents directly. All you'd need to do is set up a list of the texts or put them into a pandas dataframe. Say that you have a dataframe df_arabic_texts where each row is a different text that can be found in the column "texts". In the above example you'd do: from kwx.utils import prepare_data input_language = "arabic"text_corpus = prepare_data( data=df_arabic_texts, target_cols="texts", input_language=input_language, min_token_freq=0, # for BERT min_token_len=0, # for BERT remove_stopwords=False, # for BERT verbose=True, ) You'd then use text_corpus in extract_kws as directed by the examples. Hope that the above helps :) Let me know if you have further questions, and again good luck! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWHZTGUXZ3O4CLLHTVSWQQLUKZFDRANCNFSM45K2TVRA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

andrewtavis · 2021-11-07T15:38:02Z

You're very welcome!

Keamww2021 · 2021-11-10T21:48:50Z

Hello Again, Could you please explain the function of this code: freq_kws = extract_kws( method="frequency", bert_st_model=None, text_corpus=text_corpus, input_language=input_language, output_language=None, num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, ignore_words=None, prompt_remove_words=False, ) do you have any idea about how to extract "the top keywords after extract all the keywords" , i mean how to extract the top keywords that have high frequency. Thanks in advance, ‫في الأحد، 7 نوفمبر 2021 في 6:38 م تمت كتابة ما يلي بواسطة ‪Andrew Tavis McAllister‬‏ ***@***.***‬‏>:‬

…

You're very welcome! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWHZTGTA5SFYV56YPECJA3DUK2MOJANCNFSM45K2TVRA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

andrewtavis · 2021-11-10T23:03:16Z

Hi @Eman-2021-PhD :)

method="frequency" is just going to return the words that occur the most in the documents, which can be considered to be keywords in a simplistic sense.

One way that you could extract high frequency keywords is to run kwx twice over your documents: once with method="LDA" or method="BERT"; and a second time with method="frequency". You could then compare the outputs and take only those words from the first run that also appear in the second :) You might need to increase the value for num_keywords in extract_kws so that you get enough words that overlap in the two runs, but it definitely will work. This is actually an example of one of the original use cases for kwx :D

Let me know if you have further questions about the arguments in extract_kws.

All the best!

Keamww2021 · 2021-11-11T09:25:02Z

Thank you for the clarification.

…

On Thu, 11 Nov 2021 at 2:03 AM Andrew Tavis McAllister < ***@***.***> wrote: Hi @Eman-2021-PhD <https://github.com/Eman-2021-PhD> :) method="frequency" is just going to return the words that occur the most in the documents, which can be considered to be keywords in a simplistic sense. One way that you could extract high frequency keywords is to run kwx twice over your documents: once with method="LDA" or method="BERT"; and a second time with method="frequency". You could then compare the outputs and take only those words from the first run that also appear in the second :) You might need to increase the value for num_keywords in extract_kws so that you get enough words that overlap in the two runs, but it definitely will work. This is actually an example of one of the original use cases for kwx :D Let me know if you have further questions about the arguments in extract_kws. All the best! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWHZTGQ6OTU2LITBFZZ3YWLULL235ANCNFSM45K2TVRA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

andrewtavis · 2021-11-11T09:47:24Z

You're very welcome, and further regards!

andrewtavis added the question Further information is requested label May 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove ngrams and topic number #39

Remove ngrams and topic number #39

AhmetCakar commented May 22, 2021

andrewtavis commented May 23, 2021

Keamww2021 commented Nov 7, 2021

andrewtavis commented Nov 7, 2021 •

edited

Loading

Keamww2021 commented Nov 7, 2021 via email

andrewtavis commented Nov 7, 2021

Keamww2021 commented Nov 10, 2021 via email

andrewtavis commented Nov 10, 2021

Keamww2021 commented Nov 11, 2021 via email

andrewtavis commented Nov 11, 2021

Remove ngrams and topic number #39

Remove ngrams and topic number #39

Comments

AhmetCakar commented May 22, 2021

andrewtavis commented May 23, 2021

Keamww2021 commented Nov 7, 2021

andrewtavis commented Nov 7, 2021 • edited Loading

Keamww2021 commented Nov 7, 2021 via email

andrewtavis commented Nov 7, 2021

Keamww2021 commented Nov 10, 2021 via email

andrewtavis commented Nov 10, 2021

Keamww2021 commented Nov 11, 2021 via email

andrewtavis commented Nov 11, 2021

andrewtavis commented Nov 7, 2021 •

edited

Loading