Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ✨ OpenAI parser #245

Merged
merged 11 commits into from
Nov 15, 2023
Merged

feat: ✨ OpenAI parser #245

merged 11 commits into from
Nov 15, 2023

Conversation

chadell
Copy link
Collaborator

@chadell chadell commented Oct 20, 2023

This PR tries to answer #225 , establishing a pattern to use LLM APIs (such as OpenAI) as last resource parsers.

TODO:

  • Documentation: how to use it (ENV), expected behavior, warnings
  • Tests: mock response
  • New Processor: when HTML and text available, only one used to not duplicate

@chadell chadell marked this pull request as ready for review October 25, 2023 13:44
@chadell
Copy link
Collaborator Author

chadell commented Oct 25, 2023

looking forward to your feedback!

README.md Outdated Show resolved Hide resolved
Comment on lines 273 to 279
_llm_question = (
"Can you extract the maintenance_id, the account_id, the impact, the status "
"(e.g., confirmed, cancelled, rescheduled), the summary, the circuit ids (also defined as service or order), "
"and the global start_time and end_time as EPOCH timestamps in JSON format with keys in "
"lowercase underscore format? Reply with only the answer in JSON form and "
"include no other commentary"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to the OpenAPI parser specifically or at least made overridable? I could see that different LLMs might need variant phrasings of the question to get the desired outcome.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is to have a generic one that is used by default, and then, every LLM could overwrite as needed


@staticmethod
def get_text_hook(raw: bytes) -> str:
"""Can be overwritten by subclasses."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose does this method serve? When would subclasses need/want to override it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it more explicit why it is used. the final parsers could have a different way to decode bytes to string

circuit_maintenance_parser/parser.py Outdated Show resolved Hide resolved
circuit_maintenance_parser/processor.py Outdated Show resolved Hide resolved
circuit_maintenance_parser/provider.py Show resolved Hide resolved
tasks.py Outdated Show resolved Hide resolved
Comment on lines 547 to 549
parsed_notifications = parser_class().parse(raw_data)
parsed_notifications = parser_class().parse(
raw_data, parser_class._data_types[0] # pylint: disable=protected-access
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why this change? It feels weird to be passing a private attribute of the parser_class into a method on said class - why doesn't the parser just look this up automatically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this PR, the Parsers were only able to manage one data type: HTML, CSV, text, etc. The LLM Parser supports two types: HTML and text, but it requires a specific transform of the data into text (done here).

However, I should change this private access to the public method to get them.

jvanderaa
jvanderaa previously approved these changes Nov 14, 2023
glennmatthews
glennmatthews previously approved these changes Nov 14, 2023
pyproject.toml Outdated
@@ -32,6 +32,7 @@ geopy = "^2.1.0"
timezonefinder = "^6.0.1"
backoff = "^1.11.1"
chardet = "^5"
openai = "^0.28.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How heavyweight is this dependency? Would it be preferable to make it an optional extra instead of a hard requirement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look crazy: https://github.com/openai/openai-python/blob/main/pyproject.toml#L10
However, I think that making it optional is an extra safeguard to make people aware of using it
And, I also noticed that there is a stable version!

@chadell chadell merged commit 13dfea4 into develop Nov 15, 2023
14 checks passed
@chadell chadell deleted the issue-229-mlparsing branch November 15, 2023 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants