Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent exception is raised when series containing Nans is passed ro nlpretext.basic.preprocess.remove_stopwords #205

Closed
julesbertrand opened this issue Mar 22, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@julesbertrand
Copy link
Collaborator

🐛 Bug Report

When using the remove_stopwordsfunction, if your text column has empty values, nlpretext will raise inconsistent exceptions(about language choice).

🔬 How To Reproduce

Steps to reproduce the behavior:

  1. load data, convert to DataFrame, concatenate the two text columns without a space between them. some rows will be empty.

  2. Try using remove_stopwords

Code sample

import pandas as pd
from nlpretext.basic.preprocess import remove_stopwords

data = {'overview': {
  0: 'Comme les Mousquetaires dont elles possèdent le cran',
  1: 'New York, été 1977. Alors que la ville connait une canicule historique, un tueur en série, The Son of Sam, frappe dans le quartier italo-américain de South Bronx.',
  2: '',
  3: "Félicia, dix-sept ans, traverse la mer d'Irlande, avec pour tout renseignement le nom de la ville où habite son amant pour lui annoncer sa grossesse.",
  4: "Arthur Bishop pensait qu'il avait mis son passé de tueur à gages derrière lui. Il coule maintenant des jours heureux avec sa compagne dans l'anonymat."},
 'tagline': {0: '', 1: '', 2: '', 3: '', 4: 'Il reprend du service.'}
}

data = pd.DataFrame(data)

data["text"] = data["tagline"] +  data["overview"]

data["text"].map(lambda x: remove_stopwords(x, lang='fr'))

Environment

  • OS: google colab
  • Python version: 3.7

Screenshots

First exception:
Capture d’écran 2022-03-22 à 15 52 41
Then when replacing 'fr' by 'fr_scpacy':
Capture d’écran 2022-03-22 à 15 53 00

📈 Expected behavior

remove the stopwords without errors (convert nans to string ?), or get an excpetion saying "your text colum contains Nans, please fix it"

📎 Additional context

Workaround: data["text"] = data["tagline"] + " " + data["overview"] solves it as all rows will be non-empty strings.

@julesbertrand julesbertrand added the bug Something isn't working label Mar 22, 2022
@github-actions
Copy link

Hello @julesbertrand, thank you for your interest in our work!

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant