Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File and load_file should take extra params to give to pandas #482

Closed
5 of 9 tasks
fritz-astronomer opened this issue Jun 22, 2022 · 5 comments
Closed
5 of 9 tasks
Assignees
Labels
feature New feature or request product/python-sdk Label describing products
Milestone

Comments

@fritz-astronomer
Copy link
Contributor

Please describe the feature you'd like to see
I may need to pass additional parameters for Pandas, describing my data
e.g. if I need to approximate

pd.read_csv("s3://my_bucket/my_file.csv", sep='|')

I am currently unable to utilize `astro-sdk

Describe the solution you'd like
File() should take an kwargs or some other param which can be fed down to pandas or whatever is doing the underlying reading

Are there any alternatives to this feature?
🤷

Acceptance Criteria

  • All checks and tests in the CI should pass
  • Unit tests (90% code coverage or more, once available)
  • Integration tests (if the feature relates to a new database or external service)
  • Example DAG
  • Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • Exception handling in case of errors
  • Logging (are we exposing useful information to the user? e.g. source and destination)
  • Improve the documentation (README, Sphinx, and any other relevant)
  • How to use Guide for the feature (example)
@fritz-astronomer fritz-astronomer added the feature New feature or request label Jun 22, 2022
@tatiana
Copy link
Collaborator

tatiana commented Jul 4, 2022

@fritz-astronomer, we are currently improving the load performance, and we won't be using Pandas for many of the load operations. Therefore, I believe it is unlikely that the Astro SDK would support any Pandas kwargs.

That said, I believe supporting different CSV delimiters is a good feature. We could add support for this as part of the File object, passed to the load operation. Would this solve the issue, or do you have additional needs to load this CSV?

@fritz-astronomer
Copy link
Contributor Author

@tatiana - I think that would suffice.

A note though: I imagine that handling so much edge cases of file specifications could be a lot of code that may otherwise exist though. Maybe File can generally be some thin wrapper on top of some other file parsing python library? 🤷 Thinking of CSV, Avro, Parquet and how each of them can vary in really small annoying ways to parse.

@fhoda
Copy link

fhoda commented Sep 15, 2022

I had a similar issue, but with wanting to pass Column names. My file doesn't have a header row so I want to be able to use names=[col1, col2, ...], header=None when loading a file to a dataframe with the SDK, but cannot do this. I have a workaround as my next step is a transform step, but if that wasn't the case it would be a lot more difficult to supply column names.

If we allow the ability to load to a pandas dataframe I think we should consider allowing the passing of kwargs down to it.

@phanikumv phanikumv added this to the 1.3.0 milestone Sep 16, 2022
@kaxil kaxil added the product/python-sdk Label describing products label Oct 6, 2022
@utkarsharma2 utkarsharma2 modified the milestones: 1.3.0, 1.2.2 Nov 9, 2022
@utkarsharma2 utkarsharma2 added the priority/high High priority label Nov 9, 2022
@pankajastro
Copy link
Contributor

pankajastro commented Nov 9, 2022

As pointed out by @tatiana we are trying to avoid pandas because of the performance. but still, I have added the pd_kw #1225 to discuss more with the team and conclude on this.

@pankajastro
Copy link
Contributor

possible duplicate #1264

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request product/python-sdk Label describing products
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants