Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Commit

Permalink
Port to Nulls.jl (#288)
Browse files Browse the repository at this point in the history
Replace NA with Nulls.null and NAtype with Nulls.Null. Use Nulls.levels
instead of defining our own function. Rename all functions and arguments
to use "null" instead of "na", with deprecations.

Move rounding and transpose operations have been moved to Nulls,
but drop functions from SpecialFunctions as we don't want Nulls to
depend on SpecialFunctions and keeping them in DataArrays
would be type piracy.

Deprecate dropnull(x) in favor of efficient specialization of
collect(Nulls.drop(x)). Unexport all iterators, which are
an implementation detail and should be used via similar Nulls functions.

Stop exporting nonexistent head() and tail() functions.
Remove method redundant with ==(::AbstractArray{>:Null, ::AbstractArray{>:Null}).
  • Loading branch information
nalimilan authored Oct 19, 2017
1 parent 8b9e896 commit 8a7003b
Show file tree
Hide file tree
Showing 53 changed files with 1,269 additions and 1,426 deletions.
22 changes: 7 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,30 +11,22 @@ Documentation:
[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/stable)
[![](https://img.shields.io/badge/docs-latest-blue.svg)](https://JuliaStats.github.io/DataArrays.jl/latest)

The DataArrays package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
In particular, it provides the following:

The DataArrays package extends Julia by introducing data structures that can contain missing data. In particular, the package introduces three new data types to Julia:

* `NA`: A singleton type that represents a single missing value.
* `DataArray{T}`: An array-like data structure that can contain values of type `T`, but can also contain missing values.
* `PooledDataArray{T}`: A variant of `DataArray{T}` optimized for representing arrays that contain many repetitions of a small number of unique values -- as commonly occurs when working with categorical data.

# The `NA` Value

Many languages represent missing values using a reserved value like `NULL` or `NA`. A missing integer value, for example, might be represented as a `NULL` value in SQL or as an `NA` value in R.

Julia takes its conception of `NA` from R, where `NA` denotes missingness based on lack of information. If, for example, we were to measure people's heights as integers, an `NA` might reflect our ignorance of a specific person's height.

Conceptualizing the use of `NA` as a signal of uncertainty will help you understand how `NA` interacts with other values. For example, it explains why `NA + 1` is `NA`, but `NA & false` is `false`. In general, `NA` corrupts any computation whose results cannot be determined without knowledge of the value that is `NA`.

# DataArray's

Most Julian arrays cannot contain `NA` values: only `Array{NAtype}` and heterogeneous Arrays can contain `NA` values. Of these, only heterogeneous arrays could contain values of any type other than `NAtype`.
Most Julian arrays cannot contain `null` values: only `Array{Union{T, Null}}` and more generally `Array{>:Null}` can contain `null` values.

The generic use of heterogeneous Arrays is discouraged in Julia because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `NA` values.
The generic use of heterogeneous `Array` is discouraged in Julia versions below 0.7 because it is inefficient: accessing any value requires dereferencing a pointer. The `DataArray` type allows one to work around this inefficiency by providing tightly-typed arrays that can contain values of exactly one type, but can also contain `null` values.

For example, a `DataArray{Int}` can contain integers and NA values. We can construct one as follows:
For example, a `DataArray{Int}` can contain integers and `null` values. We can construct one as follows:

da = @data([1, 2, NA, 4])
da = @data([1, 2, null, 4])

# PooledDataArray's

Expand Down
1 change: 1 addition & 0 deletions REQUIRE
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
julia 0.6
Nulls 0.1.2
StatsBase 0.15.0
Reexport
SpecialFunctions
8 changes: 4 additions & 4 deletions benchmark/operators.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ srand(1776)

const TEST_NAMES = [
"Vector",
"DataVector No NA",
"DataVector Half NA",
"DataVector No null",
"DataVector Half null",
"Matrix",
"DataMatrix No NA",
"DataMatrix Half NA"
"DataMatrix No null",
"DataMatrix Half null"
]

function make_test_types(genfunc, sz)
Expand Down
12 changes: 6 additions & 6 deletions benchmark/reduce.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ srand(1776)

const TEST_NAMES = [
"Vector",
"DataVector No NA skipna=false",
"DataVector No NA skipna=true",
"DataVector Half NA skipna=false",
"DataVector Half NA skipna=true"
"DataVector No null skipnull=false",
"DataVector No null skipnull=true",
"DataVector Half null skipnull=false",
"DataVector Half null skipnull=true"
]

function make_test_types(genfunc, sz)
Expand All @@ -29,9 +29,9 @@ macro perf(fn, replications)
println($fn)
fns = [()->$fn(Data[1]),
()->$fn(Data[2]),
()->$fn(Data[2]; skipna=true),
()->$fn(Data[2]; skipnull=true),
()->$fn(Data[3]),
()->$fn(Data[3]; skipna=true)]
()->$fn(Data[3]; skipnull=true)]
gc_disable()
df = compare(fns, $replications)
gc_enable()
Expand Down
12 changes: 6 additions & 6 deletions benchmark/reducedim.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ srand(1776)

const TEST_NAMES = [
"Matrix",
"DataMatrix No NA skipna=false",
"DataMatrix No NA skipna=true",
"DataMatrix Half NA skipna=false",
"DataMatrix Half NA skipna=true"
"DataMatrix No null skipnull=false",
"DataMatrix No null skipnull=true",
"DataMatrix Half null skipnull=false",
"DataMatrix Half null skipnull=true"
]

function make_test_types(genfunc, sz)
Expand All @@ -29,9 +29,9 @@ macro perf(fn, dim, replications)
println($fn, " (region = ", $dim, ")")
fns = [()->$fn(Data[1], $dim),
()->$fn(Data[2], $dim),
()->$fn(Data[2], $dim; skipna=true),
()->$fn(Data[2], $dim; skipnull=true),
()->$fn(Data[3], $dim),
()->$fn(Data[3], $dim; skipna=true)]
()->$fn(Data[3], $dim; skipnull=true)]
gc_disable()
df = compare(fns, $replications)
gc_enable()
Expand Down
11 changes: 1 addition & 10 deletions docs/src/da.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
# Representing missing data

```@meta
CurrentModule = DataArrays
```

```@docs
NA
NAtype
```

## Arrays with possibly missing data

```@docs
Expand All @@ -19,9 +12,7 @@ DataArray
DataVector
DataMatrix
@data
isna
dropna
padna
padnull
levels
```

Expand Down
7 changes: 3 additions & 4 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
# DataArrays.jl

This package provides functionality for working with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia.
This package provides array types for working efficiently with [missing data](https://en.wikipedia.org/wiki/Missing_data)
in Julia, based on the `null` value from the [Nulls.jl](https://github.com/JuliaData/Nulls.jl) package.
In particular, it provides the following:

* `NA`: A singleton representing a missing value
* `DataArray{T}`: An array type that can house both values of type `T` and missing values
* `DataArray{T}`: An array type that can house both values of type `T` and missing values (of type `Null`)
* `PooledDataArray{T}`: An array type akin to `DataArray` but optimized for arrays with a smaller set of unique
values, as commonly occurs with categorical data

Expand Down
27 changes: 10 additions & 17 deletions spec/literals.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,51 +19,44 @@ Julia's parser rewrites both of these literals as calls to the `vcat`
function. The `vcat` function computes the tightest type that would
enclose all of the values in the literal array. (REVISE)

Because of the strange place occupied by `NAtype` in Julia's type
hierarchy, the tightest type that would enclose any literal array
containing a single `NA` would be `Any`, which is not very useful.
As such, the DataArrays package needs to provide an alternative
tool for writing out literal DataArray's.

This is accomplished by using two macros, `@data` and `@pdata`,
which rewrite array literals into a form that will allow proper
typing.
Two macros, `@data` and `@pdata`, rewrite array literals into a form
that will allow direct construction of `DataArray`s and `PooledDataArray`s.

# Basic Principle

The basic mechanism that powers the `@data` and `@pdata` macros is the
rewriting of array literals as a call to DataArray or PooledDataArray
with a rewritten array literal and a Boolean mask that specifies where
`NA` occurred in the original literal.
`null` occurred in the original literal.

For example,

@data [1, 2, NA, 4]
@data [1, 2, null, 4]

will be rewritten as,

DataArray([1, 2, 1, 4], [false, false, true, false])

Note the added `1` created during the rewriting of the array literal.
This value is called a `stub` and is always the first value found
in the literal array that is not `NA`. The use of stubs explains two
in the literal array that is not `null`. The use of stubs explains two
important properties of the `@data` and `@pdata` macros:

* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
* It is not possible to specify a literal DataArray that contains only `NA` values.
* None of the variables used in a literal array can be called `NA`. This is just good style anyway, so it is not much of a limitation.
* It is not possible to specify a literal DataArray that contains only `null` values.
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.

# Limitations

We restate the limitations noted above:

* If the entries of the array literal are not fixed values, but function calls, these function calls must be pure. Otherwise the impure funcion may be called more times than expected.
* It is not possible to specify a literal DataArray that contains only `NA` values.
* None of the variables used in a literal array can be called `NA`. This is just good style anyway, so it is not much of a limitation.
* It is not possible to specify a literal DataArray that contains only `null` values.
* None of the variables used in a literal array can be called `null`. This is just good style anyway, so it is not much of a limitation.


Note that the latter limitation is not very important, because a DataArray
with only `NA` values is already problematic because it has no well-defined
with only `null` values is already problematic because it has no well-defined
type in Julia.

One final limitation is that the rewriting rules are not able to
Expand Down
21 changes: 3 additions & 18 deletions src/DataArrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ module DataArrays
using Base: promote_op
using Base.Cartesian, Reexport
@reexport using StatsBase
@reexport using Nulls
using SpecialFunctions

const DEFAULT_POOLED_REF_TYPE = UInt32
Expand All @@ -25,23 +26,10 @@ module DataArrays
DataArray,
DataMatrix,
DataVector,
dropna,
each_failna,
each_dropna,
each_replacena,
EachFailNA,
EachDropNA,
EachReplaceNA,
FastPerm,
getpoolidx,
gl,
head,
isna,
levels,
NA,
NAException,
NAtype,
padna,
padnull,
pdata,
PooledDataArray,
PooledDataMatrix,
Expand All @@ -51,11 +39,9 @@ module DataArrays
rep,
replace!,
setlevels!,
setlevels,
tail
setlevels

include("utils.jl")
include("natype.jl")
include("abstractdataarray.jl")
include("dataarray.jl")
include("pooleddataarray.jl")
Expand All @@ -71,7 +57,6 @@ module DataArrays
include("extras.jl")
include("grouping.jl")
include("statistics.jl")
include("predicates.jl")
include("literals.jl")
include("deprecated.jl")
end
Loading

0 comments on commit 8a7003b

Please sign in to comment.