Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor $scan_csv() and $read_csv() #455

Merged
merged 33 commits into from
Nov 6, 2023
Merged

Conversation

etiennebacher
Copy link
Collaborator

@etiennebacher etiennebacher commented Nov 1, 2023

Close #444

This PR:

  • updates names of parameters to match py-polars
  • adds missing parameters
  • deals with multiple paths to CSV files
  • improves docs

TODO:

  • more tests
  • deal with row_count_ params
  • clean Rust side
  • deal with multiple download URLs
  • clean code for dtypes and null_values
  • replace the panic! related to wrong encoding value
  • integers are parsed as i64 which can't be converted to R, see Improve handling of Polars Int64 to R #465

Multiple URLs isn't implemented yet in Python, this fails:

import polars as pl

pl.read_csv(
    ["https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv",
     "https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv"]
)

#' @rdname IO_read_csv
#' @param path Path to a file or URL. It is possible to provide multiple paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid writing the same description twice?
I tried using the inheritParams tag but it doesn't seem to work.

Why not avoid duplication by combining them into the same Rd file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using the inheritParams tag but it doesn't seem to work.

Yeah, I don't know why, that's annoying

Why not avoid duplication by combining them into the same Rd file?

I'd rather have separate files, but I think I saw somewhere that we could store the roxygen docs as "template" and re-use them, so I'll try to find this again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include Rmd files in roxygen docs but not in the parameters section, so I don't have a solution for this. I'd still like to keep separate docs so that scan_csv and read_csv appears separately in the sidebar of "Reference" on the website

DESCRIPTION Show resolved Hide resolved
@etiennebacher
Copy link
Collaborator Author

I'm blocked on the encoding handling on the Rust side. Using the following code compiles but panicks if I run pl$read_csv(tmpf, encoding = "foo") instead of returning the expected error message:

let encoding = match encoding {
        "utf8" => pl::CsvEncoding::Utf8,
        "utf8-lossy" => pl::CsvEncoding::LossyUtf8,
        e => {
            // panic!("encoding {} not implemented.", e);
            let result = Err(format!("encoding {} not implemented.", e)).unwrap();
            let out = Ok(result)
                .map_err(polars_to_rpolars_err)
                .map(LazyFrame);
            return out
        }
    };

@eitsupi can you take a look?

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 2, 2023

I will look at it tomorrow if I have time.

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 3, 2023

Done.
I think the basic strategy is to notify R of the error message, so we should avoid panic without doing something like unwrap.

@etiennebacher
Copy link
Collaborator Author

I'm not sure how many use cases there are to convert to R data.frame

You may want to scan/read a csv, apply a bunch of operations to reduce its size and then convert it to R data.frame to use other packages on it. So I think this is quite standard.

And, just having the bit64 R package installed avoids that problem, right?

It needs to be loaded, not only installed, so this is annoying

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 3, 2023

I tried searching, but perhaps there is currently no place to check the bit64 installation?

duckdb is checking.
https://github.com/duckdb/duckdb-r/blob/163d4d5aef0f5f4e009d87588a4632f2e9da14f5/R/Driver.R#L35-L38

arrow is importing.
https://github.com/apache/arrow/blob/cd6e63570f81e96375a4c51ef5d925b5f32f5a57/r/DESCRIPTION#L34

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Nov 3, 2023

On the R side, we could check that there are i64 columns in the data. If there are some, we check that bit64 is installed and throw an error if it's not (sort of like duckplyr). Importing bit64 would break the no-R-dependency feature of polars

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 3, 2023

sort of like duckplyr

Maybe you mean duckdb?

@etiennebacher
Copy link
Collaborator Author

sort of like duckplyr

Maybe you mean duckdb?

Yes, I misread your previous message

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 3, 2023

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 3, 2023

In any case, these are not directly related to CSV reader and should be discussed in a separate issue.

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 3, 2023

I have created an issue #465 about int64 handling.

@etiennebacher
Copy link
Collaborator Author

Thanks, there's just the problem with missing_utf8_is_empty_string left before this is ready to review

@etiennebacher etiennebacher mentioned this pull request Nov 3, 2023
@eitsupi eitsupi added this to the 0.10 milestone Nov 4, 2023
@etiennebacher
Copy link
Collaborator Author

I don't understand why missing_utf8_is_empty_string doesn't work. It shouldn't block this whole PR so I mark this ready for review but I can remove this arg for now if needed

@etiennebacher etiennebacher marked this pull request as ready for review November 5, 2023 12:33
@sorhawell
Copy link
Collaborator

I could not see either why missing_utf8_is_empty_string is not responding. Did it work in py-polars?

@eitsupi
Copy link
Collaborator

eitsupi commented Nov 5, 2023

I'm sorry, but I'm on a business trip and can't let it run on hand for a few days.
Please merge without waiting for me. (I think it's better to just check the operation in Python)

@etiennebacher
Copy link
Collaborator Author

@sorhawell yes it works on Python. I decided to remove this argument for now since we know it doesn't work.

@eitsupi no problem, do you have time to release 0.10 after this is merged? Otherwise I can do it if it's just usethis::use_github_release()

@etiennebacher etiennebacher merged commit ed29154 into main Nov 6, 2023
2 checks passed
@etiennebacher etiennebacher deleted the refactor-csv-reading branch November 6, 2023 07:30
@etiennebacher etiennebacher mentioned this pull request Nov 6, 2023
@eitsupi
Copy link
Collaborator

eitsupi commented Nov 8, 2023

do you have time to release 0.10 after this is merged? Otherwise I can do it if it's just usethis::use_github_release()

#466 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scan multiple files
3 participants