Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added unstructured io parsers #274

Merged
merged 15 commits into from
Jul 16, 2024
Merged

Added unstructured io parsers #274

merged 15 commits into from
Jul 16, 2024

Conversation

S1LV3RJ1NX
Copy link
Contributor

@S1LV3RJ1NX S1LV3RJ1NX commented Jul 15, 2024

  • Cognita now supports the following file extensions: ".txt", ".eml", ".msg", ".xml", ".html", ".md", ".rst", ".json", ".rtf", ".jpeg", ".png", ".doc", ".docx", ".ppt", ".pptx", ".pdf", ".odt", ".epub", ".csv", ".tsv", ".xlsx"
  • Removed the other parsers to make files sleek
  • Modified multimodal parser to directly support images: ".png", ".jpeg", ".jpg"
  • Minor modifications in parser pydantic in FE and BE.

backend/modules/parsers/unstructured_io.py Outdated Show resolved Hide resolved
backend/modules/parsers/unstructured_io.py Show resolved Hide resolved
backend/settings.py Outdated Show resolved Hide resolved
docker-compose.yaml Outdated Show resolved Hide resolved
backend/Dockerfile Outdated Show resolved Hide resolved
Comment on lines +21 to 26
"additional_config": {
"model_configuration": {
"name": "truefoundry/openai-main/gpt-4-turbo"
},
"prompt": "Given an image containing one or more charts/graphs, and texts, provide a detailed analysis of the data represented in the charts. Your task is to analyze the image and provide insights based on the data it represents. Specifically, the information should include but not limited to: - Title of the Image: Provide a title from the charts or image if any. - Type of Chart: Determine the type of each chart (e.g., bar chart, line chart, pie chart, scatter plot, etc.) and its key features (e.g., labels, legends, data points). - Data Trends: Describe any notable trends or patterns visible in the data. This may include increasing/decreasing trends, seasonality, outliers, etc. - Key Insights: Extract key insights or observations from the charts. What do the charts reveal about the underlying data? Are there any significant findings that stand out? - Data Points: Identify specific data points or values represented in the charts, especially those that contribute to the overall analysis or insights. - Comparisons: Compare different charts within the same image or compare data points within a single chart. Highlight similarities, differences, or correlations between datasets. - Conclude with a summary of the key findings from your analysis and any recommendations based on those findings."
}
Copy link
Member

@chiragjn chiragjn Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's not stuff these things in additional_config
Can we please refactor

".pdf": {
    "class": "MultiModalParser",
    "kwargs": {
        "...": "..."
    }
}

docker-compose.yaml Outdated Show resolved Hide resolved
compose.env Outdated Show resolved Hide resolved
backend/settings.py Outdated Show resolved Hide resolved
Comment on lines 44 to 56
self.session = requests.Session()
self.retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["POST"],
)
self.adapter = HTTPAdapter(max_retries=self.retry_strategy)
self.session.mount("https://", self.adapter)
self.session.mount("http://", self.adapter)
self.headers = {
"accept": "application/json",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this retrying should be made a common utility across the repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take this up as a separate PR

Comment on lines 90 to 92
except Exception as e:
logger.exception(f"Final Exception: {e}")
return final_texts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too aggressive error handling in my opinion, the caller should decide what they want to do this errors
This is general comment across all parsers that we maintain

Copy link
Member

@chiragjn chiragjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping comments which will be addressed later as unresolved

@S1LV3RJ1NX S1LV3RJ1NX merged commit 4731b7e into main Jul 16, 2024
1 check passed
@S1LV3RJ1NX S1LV3RJ1NX deleted the ps_unstructured branch July 16, 2024 09:26
S1LV3RJ1NX added a commit that referenced this pull request Jul 16, 2024
Merge pull request #274 from truefoundry/ps_unstructured
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants