Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Way to disable numerical parsing of s-expressions? #23

Closed
khannan-livefront opened this issue May 2, 2023 · 6 comments
Closed

Way to disable numerical parsing of s-expressions? #23

khannan-livefront opened this issue May 2, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@khannan-livefront
Copy link

khannan-livefront commented May 2, 2023

Describe the bug
I'm working with s-expressions from an NLP parser, and I'm having difficulty parsing items as a string into symbols without raising float errors.

For the following s-expression string:

(ROOT (NP (FRAG (CC And) (-LRB- -LRB-) (NP (NNP Gen.) (CD 32) (. .) (CD 1.)) (-RRB- -RRB-) (PP (IN in) (NP (NP (DT a) (NNP Vision)) (PP (IN of) (NP (NNPS Angels)))))) (: :) (CC And) (PP (IN to) (NP (NP (NNP Moses)) (-LRB- -LRB-) (NP (NP (NN Exod.)) (NP (CD 3.2.))) (-RRB- -RRB-))) (PP (IN in) (NP (NP (DT the) (NN apparition)) (PP (IN of) (NP (DT a) (NN flame)))))))

The SXP parser errors with:

ArgumentError: invalid value for Float(): "1."

To reproduce
Steps to reproduce the behavior:

  1. copy this code:
s_expression = "(ROOT (NP (FRAG (CC And) (LRB -LRB-) (NP (NNP Gen.) (CD 32) (PUNCTUATION .) (CD 1.)) (RRB -RRB-) (PP (IN in) (NP (NP (DT a) (NNP Vision)) (PP (IN of) (NP (NNPS Angels)))))) (: :) (CC And) (PP (IN to) (NP (NP (NNP Moses)) (LRB -LRB-) (NP (NP (NN Exod.)) (NP (CD 3.2.))) (RRB -RRB-))) (PP (IN in) (NP (NP (DT the) (NN apparition)) (PP (IN of) (NP (DT a) (NN flame)))))))"

SXP.read(s_expression)

  1. See error

Expected behavior
SXP parses the nodes as strings rather than as numericals. I want characters like periods to be seen as strings.

Desktop (please complete the following information):

  • OS: [Mac OS Monteray]

Additional context
It appears a possible fix for our problem is monkey patching the gem to remove numerical parsing altogether:

module SXP
  class Reader
    class Basic < Reader
      ## Monkey patch, we don't need numerics
      # @return [Object]
      def read_atom
        case buffer = read_literal
        when '.'      then buffer.to_sym
        else buffer.to_sym
        end
      end
    end
  end
end

I would like your advice on if that's correct and insight into if there is a way to fix this within the gem, perhaps as a form of configuration.

@khannan-livefront khannan-livefront added the bug Something isn't working label May 2, 2023
@gkellogg
Copy link
Collaborator

gkellogg commented May 3, 2023

The code which parses numbers from atoms is necessary, so can't easily be eliminated. However, the case of Float("1.") should be handled, as the DECIMAL expression specifically allows this, so should append a "0" in this case.

However, this may not be what your specific flavor of S-Expression requires if (1.) is expected to have the atom 1, and .) has some other specific meaning. How would you expect (NNP Gen.) and (. .) to parse?

You may need to add a reader subclass (similar to Schema, and SPARQL readers) to deal with any specific variations you need.

I'll fix the problem properly handling 1. as a DECIMAL, but this may still not be quite what you expect.

@gkellogg
Copy link
Collaborator

gkellogg commented May 3, 2023

Try the version from develop branch, and if satisfied,, I'll release a new version containing this fix.

@khannan-livefront
Copy link
Author

@gkellogg Oh yes this is an improvement! Feel free to release. :)

@khannan-livefront
Copy link
Author

Your release fixed the problem, thank you @gkellogg!

@khannan-livefront
Copy link
Author

khannan-livefront commented May 3, 2023

@gkellogg In case you're interested, I should clarify that from the original s-expression, which I accidentally copied in the first example (with (. .)), we actually sanitize that s-expression from Stanza to meet the requirements of the SXP parse and Nokogiri before we run it through SXP, which looks like below:

    S_EXPRESSION_REPLACEMENTS = {
      '(``' => '(INITIAL_QUOTE',
      "(''" => '(QUOTE',
      '(.' => '(PUNCTUATION',
      '-LRB- (' => 'LRB -LRB-',
      '-RRB- )' => 'RRB -RRB-',
      '-LRB- {' => 'LRB -LCB-',
      '-RRB- }' => 'RRB -RCB-',
      '-LRB- [' => 'LRB -LSB-',
      '-RRB- ]' => 'RRB -RSB-',
      '(,' => '(NFP', # Non-final punctuation, also known as superfluous punctuation
      /\(-([[:alpha:]]{3})-/ => '(\1', # E.g., (-LRB- to (LRB
      /[^(\\]"/ => ' "\""', # Escape unescaped right-side double quotes, places escaped double quote in quotes.
      'WP$' => 'WP-S', # You can't have a dollar sign as a valid XML element name. WP-S is the alternative POS name.
      'PRP$' => 'PRP-S', # You can't have a dollar sign as a valid XML element name. PRP-S is the alternative POS name.
      '0' => 'ZERO' # An integer that begins as zero is not a valid Integer for SXP to parse. e.g. 038
    }.freeze

      # There are several left-side atoms from the constituency parser's s-expression that would make invalid DOM
      # element names. This replaces those with names that work.
      # @example replacement
      #   (. !) means punctuation with value "!"
      #   (. !) will be transformed to (PUNCTUATION !)
      # @private
      # @param s_expression [String]
      # @return [void]
      def sanitize(s_expression)
        str = s_expression.dup
        S_EXPRESSION_REPLACEMENTS.each { |k, v| str.gsub!(k, v) }
        str
      end

Which we call with:

      sxp = SXP.read(sanitize(s_expression))

There could be possible fixes to SXP with our work here. e.g. An integer that begins as zero is not a valid Integer for SXP to parse. e.g. 038

@gkellogg
Copy link
Collaborator

gkellogg commented May 4, 2023

PRs welcome; you might consider a special reader.

Potentially the numeric components of an atom create stay unparsed in the Basic implementation, and the parsing left so subclasses for Common, SPARQL, Schema, and Common Lisp. That wouldn't be too DRY for the common use case, though.

Certainly other over-eager atom parsing for some of the workaround you've noted could potentially be addressed in the basic reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants