Refactor `$scan_csv()` and `$read_csv()` #455

etiennebacher · 2023-11-01T12:07:45Z

Close #444

This PR:

updates names of parameters to match py-polars
adds missing parameters
deals with multiple paths to CSV files
improves docs

TODO:

more tests
deal with row_count_ params
clean Rust side
deal with multiple download URLs
clean code for dtypes and null_values
replace the panic! related to wrong encoding value
~~integers are parsed as i64 which can't be converted to R~~, see Improve handling of Polars Int64 to R #465

Multiple URLs isn't implemented yet in Python, this fails:

import polars as pl

pl.read_csv(
    ["https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv",
     "https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv"]
)

…ept multiple paths

…arams, accept multiple paths [skip ci]

eitsupi · 2023-11-01T13:55:47Z

R/csv.R

 #' @rdname IO_read_csv
+#' @param path Path to a file or URL. It is possible to provide multiple paths


Is it possible to avoid writing the same description twice?
I tried using the inheritParams tag but it doesn't seem to work.

Why not avoid duplication by combining them into the same Rd file?

I tried using the inheritParams tag but it doesn't seem to work.

Yeah, I don't know why, that's annoying

Why not avoid duplication by combining them into the same Rd file?

I'd rather have separate files, but I think I saw somewhere that we could store the roxygen docs as "template" and re-use them, so I'll try to find this again

We can include Rmd files in roxygen docs but not in the parameters section, so I don't have a solution for this. I'd still like to keep separate docs so that scan_csv and read_csv appears separately in the sidebar of "Reference" on the website

DESCRIPTION

etiennebacher · 2023-11-02T17:07:27Z

I'm blocked on the encoding handling on the Rust side. Using the following code compiles but panicks if I run pl$read_csv(tmpf, encoding = "foo") instead of returning the expected error message:

let encoding = match encoding {
        "utf8" => pl::CsvEncoding::Utf8,
        "utf8-lossy" => pl::CsvEncoding::LossyUtf8,
        e => {
            // panic!("encoding {} not implemented.", e);
            let result = Err(format!("encoding {} not implemented.", e)).unwrap();
            let out = Ok(result)
                .map_err(polars_to_rpolars_err)
                .map(LazyFrame);
            return out
        }
    };

@eitsupi can you take a look?

eitsupi · 2023-11-02T17:43:14Z

I will look at it tomorrow if I have time.

eitsupi · 2023-11-03T09:38:48Z

Done.
I think the basic strategy is to notify R of the error message, so we should avoid panic without doing something like unwrap.

etiennebacher · 2023-11-03T12:33:10Z

I'm not sure how many use cases there are to convert to R data.frame

You may want to scan/read a csv, apply a bunch of operations to reduce its size and then convert it to R data.frame to use other packages on it. So I think this is quite standard.

And, just having the bit64 R package installed avoids that problem, right?

It needs to be loaded, not only installed, so this is annoying

eitsupi · 2023-11-03T12:34:42Z

I tried searching, but perhaps there is currently no place to check the bit64 installation?

duckdb is checking.
https://github.com/duckdb/duckdb-r/blob/163d4d5aef0f5f4e009d87588a4632f2e9da14f5/R/Driver.R#L35-L38

arrow is importing.
https://github.com/apache/arrow/blob/cd6e63570f81e96375a4c51ef5d925b5f32f5a57/r/DESCRIPTION#L34

etiennebacher · 2023-11-03T12:40:45Z

On the R side, we could check that there are i64 columns in the data. If there are some, we check that bit64 is installed and throw an error if it's not (sort of like duckplyr). Importing bit64 would break the no-R-dependency feature of polars

eitsupi · 2023-11-03T13:01:16Z

sort of like duckplyr

Maybe you mean duckdb?

etiennebacher · 2023-11-03T13:07:17Z

sort of like duckplyr

Maybe you mean duckdb?

Yes, I misread your previous message

eitsupi · 2023-11-03T13:25:49Z

Since a certain range of int64 can be expressed as double, DuckDB seems to cast it to double by default.
https://github.com/duckdb/duckdb-r/blob/163d4d5aef0f5f4e009d87588a4632f2e9da14f5/R/Driver.R#L31-L41
https://github.com/duckdb/duckdb-r/blob/163d4d5aef0f5f4e009d87588a4632f2e9da14f5/src/types.cpp#L140-L144

eitsupi · 2023-11-03T13:38:10Z

In any case, these are not directly related to CSV reader and should be discussed in a separate issue.

eitsupi · 2023-11-03T22:38:52Z

I have created an issue #465 about int64 handling.

etiennebacher · 2023-11-03T22:42:44Z

Thanks, there's just the problem with missing_utf8_is_empty_string left before this is ready to review

etiennebacher · 2023-11-05T12:33:36Z

I don't understand why missing_utf8_is_empty_string doesn't work. It shouldn't block this whole PR so I mark this ready for review but I can remove this arg for now if needed

sorhawell · 2023-11-05T23:06:44Z

I could not see either why missing_utf8_is_empty_string is not responding. Did it work in py-polars?

eitsupi · 2023-11-05T23:50:32Z

I'm sorry, but I'm on a business trip and can't let it run on hand for a few days.
Please merge without waiting for me. (I think it's better to just check the operation in Python)

etiennebacher · 2023-11-06T07:30:19Z

@sorhawell yes it works on Python. I decided to remove this argument for now since we know it doesn't work.

@eitsupi no problem, do you have time to release 0.10 after this is merged? Otherwise I can do it if it's just usethis::use_github_release()

eitsupi · 2023-11-08T14:35:25Z

do you have time to release 0.10 after this is merged? Otherwise I can do it if it's just usethis::use_github_release()

#466 (comment)

etiennebacher added 11 commits October 30, 2023 18:58

init

7148963

Merge branch 'main' into refactor-csv-reading

9d7ab66

rust side: add missing args, rename old args, use Robj as inputs, acc…

f98d950

…ept multiple paths

R side: improve docs, reorder params to match py-polars, rename old p…

32bd002

…arams, accept multiple paths [skip ci]

remove old comments [skip ci]

a257d45

split tests for read and write csv, add some tests

4e5425a

reenable dtypes arg [skip ci]

03d707d

Merge branch 'main' into refactor-csv-reading

9fcd3e8

reenable URL reading

c119566

more tests [skip ci]

41b51d9

more tests

1a67c8c

eitsupi reviewed Nov 1, 2023

View reviewed changes

etiennebacher added 5 commits November 1, 2023 15:12

add row_count_ params

34793ed

docs for row_count_ args [skip ci]

3d0eeea

test wrong encoding fails [skip ci]

93a9f3c

fix panic when row_count_name is NULL

9beb453

need curl for skip_if_offline()

021c037

eitsupi reviewed Nov 2, 2023

View reviewed changes

DESCRIPTION Show resolved Hide resolved

uncomment test for encoding [skip ci]

06f3b06

etiennebacher and others added 3 commits November 2, 2023 22:11

Merge branch 'main' into refactor-csv-reading

789e7ae

Merge branch 'main' into refactor-csv-reading

a5e527f

fix: shoud not panic

0e94aa1

eitsupi and others added 5 commits November 3, 2023 09:50

test: update the test case

b7cf51a

update snapshots

b00cd70

update error message

c8ad2f4

Merge branch 'main' into refactor-csv-reading

1d6df78

bump NEWS [skip ci]

6fd4810

Merge branch 'main' into refactor-csv-reading

6e8cb4c

eitsupi mentioned this pull request Nov 3, 2023

Improve handling of Polars Int64 to R #465

Closed

etiennebacher mentioned this pull request Nov 3, 2023

0.10.0 release #466

Closed

etiennebacher and others added 2 commits November 4, 2023 10:49

Merge branch 'main' into refactor-csv-reading

1331039

cargo fmt, minor simplification [skip ci]

9fcb0a6

eitsupi added this to the 0.10 milestone Nov 4, 2023

etiennebacher marked this pull request as ready for review November 5, 2023 12:33

etiennebacher requested a review from eitsupi November 5, 2023 12:33

robj_to!(Option, usize

225c96f

sorhawell approved these changes Nov 5, 2023

View reviewed changes

etiennebacher added 2 commits November 6, 2023 08:09

Merge branch 'main' into refactor-csv-reading

236cb67

remove arg missing_utf8_is_empty_string

1eeda8a

etiennebacher merged commit ed29154 into main Nov 6, 2023
2 checks passed

etiennebacher deleted the refactor-csv-reading branch November 6, 2023 07:30

etiennebacher mentioned this pull request Nov 6, 2023

Update NEWS #469

Merged

eitsupi mentioned this pull request Dec 3, 2023

test: switch to use the new release version of testthat #559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `$scan_csv()` and `$read_csv()` #455

Refactor `$scan_csv()` and `$read_csv()` #455

etiennebacher commented Nov 1, 2023 •

edited

Loading

eitsupi Nov 1, 2023

etiennebacher Nov 1, 2023

etiennebacher Nov 5, 2023

etiennebacher commented Nov 2, 2023

eitsupi commented Nov 2, 2023

eitsupi commented Nov 3, 2023

etiennebacher commented Nov 3, 2023

eitsupi commented Nov 3, 2023 •

edited

Loading

etiennebacher commented Nov 3, 2023 •

edited

Loading

eitsupi commented Nov 3, 2023

etiennebacher commented Nov 3, 2023

eitsupi commented Nov 3, 2023 •

edited

Loading

eitsupi commented Nov 3, 2023 •

edited

Loading

eitsupi commented Nov 3, 2023

etiennebacher commented Nov 3, 2023

etiennebacher commented Nov 5, 2023

sorhawell commented Nov 5, 2023

eitsupi commented Nov 5, 2023

etiennebacher commented Nov 6, 2023

eitsupi commented Nov 8, 2023

		#' @rdname IO_read_csv
		#' @param path Path to a file or URL. It is possible to provide multiple paths

Refactor $scan_csv() and $read_csv() #455

Refactor $scan_csv() and $read_csv() #455

Conversation

etiennebacher commented Nov 1, 2023 • edited Loading

eitsupi Nov 1, 2023

Choose a reason for hiding this comment

etiennebacher Nov 1, 2023

Choose a reason for hiding this comment

etiennebacher Nov 5, 2023

Choose a reason for hiding this comment

etiennebacher commented Nov 2, 2023

eitsupi commented Nov 2, 2023

eitsupi commented Nov 3, 2023

etiennebacher commented Nov 3, 2023

eitsupi commented Nov 3, 2023 • edited Loading

etiennebacher commented Nov 3, 2023 • edited Loading

eitsupi commented Nov 3, 2023

etiennebacher commented Nov 3, 2023

eitsupi commented Nov 3, 2023 • edited Loading

eitsupi commented Nov 3, 2023 • edited Loading

eitsupi commented Nov 3, 2023

etiennebacher commented Nov 3, 2023

etiennebacher commented Nov 5, 2023

sorhawell commented Nov 5, 2023

eitsupi commented Nov 5, 2023

etiennebacher commented Nov 6, 2023

eitsupi commented Nov 8, 2023

Refactor `$scan_csv()` and `$read_csv()` #455

Refactor `$scan_csv()` and `$read_csv()` #455

etiennebacher commented Nov 1, 2023 •

edited

Loading

eitsupi commented Nov 3, 2023 •

edited

Loading

etiennebacher commented Nov 3, 2023 •

edited

Loading

eitsupi commented Nov 3, 2023 •

edited

Loading

eitsupi commented Nov 3, 2023 •

edited

Loading