Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem parsing two-word ingredients that begin with lower-case 'a' #931

Open
Stuckyville opened this issue Jan 28, 2019 · 3 comments · May be fixed by #999
Open

problem parsing two-word ingredients that begin with lower-case 'a' #931

Stuckyville opened this issue Jan 28, 2019 · 3 comments · May be fixed by #999

Comments

@Stuckyville
Copy link

When entering a two-word ingredient where the first word begins with lower-case 'a', the parser strips the leading 'a' and treats it like a quantity. For example, 'apple juice' becomes 'pple juice' with a quantity of of '1'. Further detail discussed at https://answers.launchpad.net/gourmet/+question/678095

@saxon-s
Copy link
Collaborator

saxon-s commented Jan 28, 2019

Environment:
Gourmet 0.17.4 and master branch on Ubuntu and Windows.

Steps to reproduce:

  1. Click "New" button for new recipe
  2. Click "Ingredients" tab
  3. Add each of the following ingredients individually to "Add ingredient" text field:
    "apple juice"
    "Apple juice"
    "apricot"
    "an avocado"
    "a beet"
    "a dozen eggs"
    "a pair of Yubari King melons"

Expected Results:

  • Expect ingredients to be listed as:
    "apple juice"
    "Apple juice"
    "apricot"
    "1 avocado"
    "1 beet"
    "12 eggs"
    "2 Yubari King melons"

Actual Results:

  • Instead, ingredients are listed as:
    "1 pple juice"
    "Apple juice"
    "apricot"
    "1 avocado"
    "1 beet"
    "12 eggs"
    "2 Yubari King melons"

Analysis:
If the first word in an ingredient (more than one word) string starts with a lower case "a", the first letter ("a") of the first word is stripped off and substituted with quantity of "1", "a dozen" is substituted with quantity of "12" and "a pair" is substituted with quantity of "2".

  • Gourmet is designed to translate word numbers into equivalent numbers, for example:
    "a" --> "1"
    "an" --> "1"
    "a couple" --> "2"
    "a dozen" --> "12"
    "twenty" --> "20"

Conclusion:

  • There appears to be a bug in the ingredient parser. The ingredient parser should only translate "a" to "1" if it is single character.
  • In addition, the ingredient parser is not translating capitalized words number correctly, for example:
    "A dozen" is not translated to quantity of "12".

@martinp26
Copy link

martinp26 commented Jun 13, 2020

There are multiple problems here:

  • NUMBER_WORD_REGEXP is missing word boundaries around the individual regex elements, this leads to finding 'a' in the middle of words. Not sure if this would be enough.
  • The number words are also NOT put through translation. The German version still has "one" ... "ten" in the regex. This has the side effect of early terminating the search in the minutes translation "Minuten" -> "Minu" which then does not parse. Re-editing recipes leads to losing time annotations.

A simple workaround is this in gourmet/convert.py:

@@ -644,7 +648,7 @@ all_number_words.sort(
lambda x,y: ((len(y)>len(x) and 1) or (len(x)>len(y) and -1) or 0)
)

-NUMBER_WORD_REGEXP = '|'.join(all_number_words).replace(' ','\s+')
+NUMBER_WORD_REGEXP = None
FRACTION_WORD_REGEXP = '|'.join(filter(lambda n: NUMBER_WORDS[n]<1.0,
all_number_words)
).replace(' ','\s+')

I believe the NUMBER_FINDER.finditer(timestring) in timestring_to_seconds should not blindly look for the next num-like match, but only after the non-num words after the last match have been consumed.

"12 Minuten" is currently parsed as [12 Minu] [ten]

@saxon-s
Copy link
Collaborator

saxon-s commented Jun 17, 2020

@martinp26 Thank you for investigating the issue and the simple workaround.

martinp26 pushed a commit to martinp26/gourmet that referenced this issue Jun 20, 2020
Unit detection was not considering localization in two places.

Fix the simple issue in find_errors_in_progress() by translating units
to compare against.

The second error is more complex, details are in thinkle#931.  Disable broken
parsing of number words for now.

Signed-off-by: Martin Pohlack <martinp@gmx.de>
martinp26 pushed a commit to martinp26/gourmet that referenced this issue Jun 21, 2020
All_number_words is not working perfectly here.  The number words need
to go through localization and also need word boundaries, otherwise
they match other partial ingredients or time words in other languages.
E.g., "ten" match the tail of the German word for minutes (Minuten).

Disable broken parsing of number words for now.

Fixes thinkle#931.

Signed-off-by: Martin Pohlack <martinp@gmx.de>
@martinp26 martinp26 linked a pull request Jun 21, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants