Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CSV.read docs about WeakRefString conversions #118

Closed
wants to merge 3 commits into from

Conversation

nickeubank
Copy link

Closes #114

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nalimilan
Copy link
Member

@quinnj Maybe the docs should even recommend using String?

src/Source.jl Outdated
@@ -244,7 +244,7 @@ Keyword Arguments:
* `categorical::Bool=true`: read string column as a `CategoricalArray` ([ref](https://github.com/JuliaData/CategoricalArrays.jl)), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.

Note by default, "string" or text columns will be parsed as the [`WeakRefString`](https://github.com/quinnj/WeakRefStrings.jl) type. This is a custom type that only stores a pointer to the actual byte data + the number of bytes.
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)`, this also works on an entire column.
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)` for an individual observation, or `string.(::WeakRefString)` on an entire column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This looks like a nice small clarification, so I feel sorry for my lengthy comment ;) :

  • I think String has to be fixed too since it is a "standard Julia string type".
  • string.(::WeakRefString) will work, but ::WeakRefString is not a type of the column

An alternative:
"
To convert wstr::WeakRefString to a standard Julia string type, just call string(wstr), or string.(df[:wstr]) to convert an entire wstr column of df table.
"

However, the problem with this approach is that missing values in df[:wstr] would be converted into "missing" strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, good catch. So I guess the only correct approach is convert(Union{String, Missing}, ::WeakRefString) and convert(Array{Union{String, Missing}}, ::AbstractArray{WeakRefString})?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the problem with this approach is that missing values in df[:wstr] would be converted into "missing" strings.

Not sure I follow... is that a problem? Or do you mean they'd become "" instead of Missings.missing?

Copy link
Contributor

@alyst alyst Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only correct approach is convert(Union{String, Missing}, ::WeakRefString) and convert(Array{Union{String, Missing}}, ::AbstractArray{WeakRefString})?

I also think so. More precisely, the scalar form convert(Union{String, Missing}, wstr) would give the correct result both for wstr::WeakRefString and wstr === missing.
But convert(Array{Union{String, Missing}}, wstr) should only be called for wstr::AbstractArray{Union{WeakRefString, Missing}} unless we want to allow missingness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickeubank

julia> string(Missings.missing)
"missing"

string.(wstr::AbstractArray{Union{WeakRefString, Missing}}) doesn't preserve missingness: your Missings.missing missing values in the array would become non-missing strings "missing"::String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately

julia> using CSV, Missings
julia> df = CSV.read(joinpath(Pkg.dir("CSV"), "test/test_files/attenu.csv"), null="NA", rows_for_type_detect=200);
julia> eltype(df[3])
Union{Missings.Missing, WeakRefString{UInt8}}
julia> eltype(convert.(Union{String, Missing}, df[3]))
Any

Maybe it's somehow related to this

julia> typeof(df[1,3])
String
julia> collect(skipmissing(df[3]))
ERROR: TypeError: _collect: in typeassert, expected WeakRefString{UInt8}, got String
Stacktrace:
 [1] next at /home/astukalov/.julia/v0.6/Missings/src/Missings.jl:265 [inlined]
 [2] _collect(::UnitRange{Int64}, ::Missings.EachSkipMissing{WeakRefStrings.WeakRefStringArray{Union{Missings.Missing, WeakRefString{UInt8}},1}}, ::Base.HasEltype, ::Base.SizeUnknown) at ./array.jl:442
 [3] collect(::Missings.EachSkipMissing{WeakRefStrings.WeakRefStringArray{Union{Missings.Missing, WeakRefString{UInt8}},1}}) at ./array.jl:431
 [4] macro expansion at ./REPL.jl:97 [inlined]
 [5] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like WeakRefStringArray is lying about its eltype: it should be String.

Copy link
Contributor

@alyst alyst Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WeakRefString to String conversion is now supposed to be handled automatically, maybe we don't need this docstring section at all :)
I guess we need @quinnj assistance to clarify the situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And regarding the original problem that the eltype is Any, see JuliaLang/julia#24332.

So I guess the only solution for now is to recommend using either convert(Array{String}, x) or convert(Array{Union{String, Missing}}, x) depending on whether you want to support missing values or not.

@nalimilan
Copy link
Member

Sorry, the conversation was more complex than anticipated. Would you update the PR?

@nickeubank
Copy link
Author

@nalimilan Yup, sorry -- lost track of things. Submitting now.

@codecov
Copy link

codecov bot commented Dec 8, 2017

Codecov Report

Merging #118 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #118   +/-   ##
=======================================
  Coverage   86.71%   86.71%           
=======================================
  Files           8        8           
  Lines         858      858           
=======================================
  Hits          744      744           
  Misses        114      114
Impacted Files Coverage Δ
src/Source.jl 97.47% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2101155...aa38fd5. Read the comment docs.

@quinnj
Copy link
Member

quinnj commented Jan 6, 2018

Closing in favor of doc updates in #143

@quinnj quinnj closed this Jan 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants