Storing and retrieving data.frames #366

ThierryO · 2018-08-22T11:40:45Z

This PR replaces PR #303 and solves issue #301. The functionality is described in a vignette, hence the extra knitr and rmarkdown dependencies.

…rage Signed-off-by: Thierry Onkelinx <[email protected]>

Signed-off-by: Thierry Onkelinx <[email protected]>

…orrectly Signed-off-by: Thierry Onkelinx <[email protected]>

Signed-off-by: Thierry Onkelinx <[email protected]>

Swapping two variables when rewritten a data.frame results in a large diff while the information content of the data hasn't changed. Therefore the variables will be reordered to match the original order. Signed-off-by: Thierry Onkelinx <[email protected]>

When a line is moved in a file, the resulting diff is a deletion at the original location and an addition at the new location. Changing the order of the observations in a data.frame does not change the information content. Sorting the data before writing avoids unnecessary diffs. Signed-off-by: Thierry Onkelinx <[email protected]>

Signed-off-by: Thierry Onkelinx <[email protected]>

…within the sorting variables Signed-off-by: Thierry Onkelinx <[email protected]>

Signed-off-by: Thierry Onkelinx <[email protected]>

Signed-off-by: Floris Vanderhaeghe <[email protected]>

Signed-off-by: Thierry Onkelinx <[email protected]>

…tead of "data_repository" Signed-off-by: Thierry Onkelinx <[email protected]>

Signed-off-by: Thierry Onkelinx <[email protected]>

dir.exists() is not available in R < 3.2.0 Signed-off-by: Thierry Onkelinx <[email protected]>

coveralls · 2018-08-22T12:02:58Z

Coverage increased (+0.8%) to 82.526% when pulling b748146 on ThierryO:datarepos into e18e8f7 on ropensci:master.

stewid · 2018-08-23T21:05:16Z

Thanks @ThierryO.

I will try to find time the next few days to go through the pull request.

Signed-off-by: Thierry Onkelinx <[email protected]>

stewid · 2018-09-05T18:57:01Z

Hi Thierry,

Thanks for the pull request. I apologize for the delay in reviewing it. The pull request implements functionality to handle data.frames and the associated metadata in a Git repository. I fully agree that
it's important to place data under version control, and this PR clearly provides a workflow to facilitate this. My main concern is that the PR introduces much functionality that is unrelated to the core of git2r i.e. to be an interface to the libgit2 library. Therefore, I suggest that this work is bundled in a separate package that imports git2r.

I also have a technical comment. The vignette says that "Git stores the version history under the form of diffs: a list of lines which are deleted and a list of lines which are inserted at a specific line number in a file.". That's actually not correct, Git stores the entire content of each file (for efficiency, files may later be compressed into 'pack files'). The diff is calculated on the fly. I have created an example to illustrate this with some functions to read the internal content of a commit, tree and blob using base R. To read more about the git-internals, see https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

library("git2r")

## Create a directory in tempdir
path <- tempfile(pattern="git2r-")
dir.create(path)

## Initialize a repository
repo <- init(path)
config(repo, user.name="Alice", user.email="[email protected]")

## Create a csv file with data.
df <- data.frame(letters = c("a", "b", "c", "d", "e"),
                 numbers = 1:5,
                 date = as.Date(c("2018-08-20",
                                  "2018-08-21",
                                  "2018-08-22",
                                  "2018-08-23",
                                  "2018-08-24")))
write.csv(df, file.path(workdir(repo), "test.csv"), row.names = FALSE)

## Add the file to the repository and commit
add(repo, "test.csv")
commit(repo, "Initial commit")
#> [740b088] 2018-09-05: Initial commit

## A utility function to read a git commit or git blob data object.
read_git_object <- function(filename) {
    ## Read compressed data
    n <- file.info(filename)$size
    data <- readBin(filename, what = "raw", n = n)
    data <- memDecompress(data, "gzip")

    ## Find "\0" in data; separates header from content
    null_byte <- which(data == 0)

    ## Determine the type of git the object
    header <- rawToChar(data[seq_len(null_byte-1)])

    ## Read content
    i <- seq(from = null_byte + 1, to = length(data))
    content <- readLines(textConnection(rawToChar(data[i])))

    cat(c(header, content), sep = "\n")

    invisible(NULL)
}

## Read the contents of a git commit object.
read_git_commit <- function(filename) {
    read_git_object(filename)
}

## Read the contents of a git blob object.
read_git_blob <- function(filename) {
    read_git_object(filename)
}

## Read the contents of a git tree object.
read_git_tree <- function(filename) {
    ## Read compressed data
    n <- file.info(filename)$size
    data <- readBin(filename, what = "raw", n = n)
    data <- memDecompress(data, "gzip")

    i <- min(which(data == 0))
    content <- rawToChar(data[1:(i-1)])

    while (i < length(data)) {
        j <- which(data == 0)
        j <- min(j[j > i])
        name <- rawToChar(data[(i+1):(j-1)])

        sha <- as.integer(data[(j+1):(j+20)])
        sha <- sprintf("%02x", sha)
        sha <- paste0(sha, collapse = "")

        content <- c(content, paste(name, sha))
        i <- j + 20
    }

    cat(content, sep = "\n")

    invisible(NULL)
}

## Let us now inspect the git commit object to see what 'Git' stores.
filename <- file.path(repo$path,                              ## Repo path
                      "objects",                              ## Objects db
                      substr(sha(last_commit(repo)), 1, 2),   ## Subdirectory
                      substr(sha(last_commit(repo)), 3, 40))  ## Filename

read_git_commit(filename)
#> commit 164
#> tree bfb7a2dfa597b2870e4634cd5c55271114527646
#> author Alice <[email protected]> 1536173471 +0200
#> committer Alice <[email protected]> 1536173471 +0200
#> 
#> Initial commit

## Inspect the tree to determine the sha-1 of the blob.
filename <- file.path(repo$path,
                      "objects",
                      substr(sha(tree(last_commit(repo))), 1, 2),
                      substr(sha(tree(last_commit(repo))), 3, 40))

read_git_tree(filename)
#> tree 36
#> 100644 test.csv ef69efbfe95512ce19af7138e6ec344ff64cfa67

## Now inspect the blob with the data.frame.
filename <- file.path(repo$path,
                      "objects",
                      substr(sha(tree(last_commit(repo))[1]), 1, 2),
                      substr(sha(tree(last_commit(repo))[1]), 3, 40))

read_git_blob(filename)
#> blob 112
#> "letters","numbers","date"
#> "a",1,2018-08-20
#> "b",2,2018-08-21
#> "c",3,2018-08-22
#> "d",4,2018-08-23
#> "e",5,2018-08-24

## Let's remove the third row in the data.frame and inspect the git
## blob.
write.csv(df[-3, ], file.path(workdir(repo), "test.csv"), row.names = FALSE)

## Add the file to the repository and commit
add(repo, "test.csv")
commit(repo, "Remove third row")
#> [ecd139a] 2018-09-05: Remove third row

filename <- file.path(repo$path,
                      "objects",
                      substr(sha(tree(last_commit(repo))[1]), 1, 2),
                      substr(sha(tree(last_commit(repo))[1]), 3, 40))

## The entire content of 'test.csv' is written to the blob.
read_git_blob(filename)
#> blob 95
#> "letters","numbers","date"
#> "a",1,2018-08-20
#> "b",2,2018-08-21
#> "d",4,2018-08-23
#> "e",5,2018-08-24

## Display the diff between the two commits.
cat(diff(tree(commits(repo)[[2]]),
         tree(commits(repo)[[1]]),
         as_char = TRUE))
#> diff --git a/test.csv b/test.csv
#> index ef69efb..3bc6b78 100644
#> --- a/test.csv
#> +++ b/test.csv
#> @@ -1,6 +1,5 @@
#>  "letters","numbers","date"
#>  "a",1,2018-08-20
#>  "b",2,2018-08-21
#> -"c",3,2018-08-22
#>  "d",4,2018-08-23
#>  "e",5,2018-08-24

Created on 2018-09-05 by the reprex package (v0.2.0).

Kind regards
Stefan

ThierryO · 2018-09-06T07:42:54Z

Hi Stefan,

I agree that it is not the core business of the libgit2. But it seems a nice addition as stated by you in #303 and by a view other people which reviewed this (ThierryO#1). The extra functionaly is quite lightweight so it shouldn't place much burden on git2r.

Putting the functionality in a separate package would be overkill IMHO. It has only a few functions and it would add an extra dependency for the users. Having the functionality in git2r has the benefit it is will be much more likely to be used.

I stand corrected on the diff topic.

jennybc · 2018-09-06T13:03:24Z

Pardon my butting in uninvited (I watch this repo). But as someone who depends on git2r in a few ways, I tend to agree with @stewid's inclination to stay a fairly minimal wrapper around libgit2. I think the "git2r extension package" is an interesting idea and don't think it's overkill even for the functionality already in this PR.

There is a lot of interest in versioning data, so this PR is squarely in that space. Being it's own extension package would give it room to grow organically, as I think some lightweight tools in that space would be very well-received.

ThierryO · 2018-09-06T14:26:14Z

After a discussion with my coworkers, we decided to transform this PR into a standalone package: git2rdata.

ThierryO added 30 commits July 23, 2018 13:33

add a non exported meta() function which prepares the vectors for sto…

2d000ec

…rage Signed-off-by: Thierry Onkelinx <[email protected]>

Add write_delim_git()

03d8f6d

Signed-off-by: Thierry Onkelinx <[email protected]>

Add read_delim_git()

ec2b5fd

Signed-off-by: Thierry Onkelinx <[email protected]>

write_delim_git() and read_delim_git() handle directories with dots c…

e2d0e0e

…orrectly Signed-off-by: Thierry Onkelinx <[email protected]>

repository() gains a "project" argument

ecbf3f4

Signed-off-by: Thierry Onkelinx <[email protected]>

write_delim_git() and read_delim_git() use the project concept

9086194

Signed-off-by: Thierry Onkelinx <[email protected]>

init() and clone() gain the "project" argument

18b38bb

Signed-off-by: Thierry Onkelinx <[email protected]>

export meta()

7b490df

Signed-off-by: Thierry Onkelinx <[email protected]>

workdir() takes "project" into account when set

a19f3a3

Signed-off-by: Thierry Onkelinx <[email protected]>

move meta() and read_delim_git() and use 4 spaces per tab

66817dc

Signed-off-by: Thierry Onkelinx <[email protected]>

write_delim_git() stages the files

7dfa131

Signed-off-by: Thierry Onkelinx <[email protected]>

status() takes "project" into account

ebcc9c8

Signed-off-by: Thierry Onkelinx <[email protected]>

rm_file() handles data repos

a9349ca

Signed-off-by: Thierry Onkelinx <[email protected]>

add is_data_repo()

6cf0737

Signed-off-by: Thierry Onkelinx <[email protected]>

Fix typo

3bfceb9

Signed-off-by: Thierry Onkelinx <[email protected]>

reset() handles data repositories

cdb3e81

Signed-off-by: Thierry Onkelinx <[email protected]>

fix bugs

57df033

Signed-off-by: Thierry Onkelinx <[email protected]>

add unit tests for data repositories

8b90f13

Signed-off-by: Thierry Onkelinx <[email protected]>

Create a "data_repository" class

8566c41

Signed-off-by: Thierry Onkelinx <[email protected]>

update unit tests

2f24fcb

Signed-off-by: Thierry Onkelinx <[email protected]>

rm_file() can delete all .tsv. or all .yml files

edf9ee7

Signed-off-by: Thierry Onkelinx <[email protected]>

meta.character() checks for 'NA' values

84c96f9

Signed-off-by: Thierry Onkelinx <[email protected]>

allow the storage of logicals in data repositories

1599206

Signed-off-by: Thierry Onkelinx <[email protected]>

data repositories handle complex data

091f9b0

Signed-off-by: Thierry Onkelinx <[email protected]>

data repositories handle POSIXct timestamps

b550898

Signed-off-by: Thierry Onkelinx <[email protected]>

add draft version of vignette

98514f1

Signed-off-by: Thierry Onkelinx <[email protected]>

write_delim_git() yields a warning in case of duplicate observations …

a60c248

…within the sorting variables Signed-off-by: Thierry Onkelinx <[email protected]>

don't return the call with data_repository errors

c220999

Signed-off-by: Thierry Onkelinx <[email protected]>

ThierryO and others added 16 commits July 27, 2018 21:13

write_delim_git() returns the hashes of the files

6da3722

Signed-off-by: Thierry Onkelinx <[email protected]>

data repositories handles the Date class

cbf21f1

Signed-off-by: Thierry Onkelinx <[email protected]>

Vignette & rm_file documentation: fix typos & language

6a5341d

Signed-off-by: Floris Vanderhaeghe <[email protected]>

Vignette: copy over explanation from the rm_file documentation

d20537f

Signed-off-by: Thierry Onkelinx <[email protected]>

Vignette: replace 'format' by 'extension'

04b8ca0

Signed-off-by: Thierry Onkelinx <[email protected]>

write_delim_git() and read_delim_git() work with "git_repository" ins…

1ae9f9c

…tead of "data_repository" Signed-off-by: Thierry Onkelinx <[email protected]>

remove the data_repository class

dea6be9

Signed-off-by: Thierry Onkelinx <[email protected]>

add rm_data()

5281e40

Signed-off-by: Thierry Onkelinx <[email protected]>

update vignette

42bddcd

write_delim_git() gains a stage argument

cbc9829

Signed-off-by: Thierry Onkelinx <[email protected]>

write_delim_git() gains a optimize argument

f476e18

Signed-off-by: Thierry Onkelinx <[email protected]>

revert unneeded changes

d78f9f1

Signed-off-by: Thierry Onkelinx <[email protected]>

Add more unit tests

ee881ca

Signed-off-by: Thierry Onkelinx <[email protected]>

More work on the vignette

3fe6a6f

Signed-off-by: Thierry Onkelinx <[email protected]>

bugfixes

d26aaf9

Signed-off-by: Thierry Onkelinx <[email protected]>

use file.exists() instead of dir.exists()

24dc997

dir.exists() is not available in R < 3.2.0 Signed-off-by: Thierry Onkelinx <[email protected]>

ThierryO added 3 commits August 24, 2018 09:50

fix typos

51c4d7e

Signed-off-by: Thierry Onkelinx <[email protected]>

add recent_commit()

fa8cddb

Signed-off-by: Thierry Onkelinx <[email protected]>

try to debug unit test on recent_commit()

1ddbb14

Signed-off-by: Thierry Onkelinx <[email protected]>

ThierryO force-pushed the datarepos branch from 429dc38 to 1ddbb14 Compare August 29, 2018 12:21

ThierryO added 2 commits August 29, 2018 14:47

use forward slashes when normalising path

b427175

Signed-off-by: Thierry Onkelinx <[email protected]>

clean_data_path() gains a normalize argument

b748146

Signed-off-by: Thierry Onkelinx <[email protected]>

ThierryO closed this Sep 6, 2018

ThierryO mentioned this pull request Nov 14, 2018

git2rdata, a companion package for git2r ropensci/software-review#263

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing and retrieving data.frames #366

Storing and retrieving data.frames #366

ThierryO commented Aug 22, 2018

coveralls commented Aug 22, 2018 •

edited

Loading

stewid commented Aug 23, 2018

stewid commented Sep 5, 2018

ThierryO commented Sep 6, 2018

jennybc commented Sep 6, 2018

ThierryO commented Sep 6, 2018

Storing and retrieving data.frames #366

Storing and retrieving data.frames #366

Conversation

ThierryO commented Aug 22, 2018

coveralls commented Aug 22, 2018 • edited Loading

stewid commented Aug 23, 2018

stewid commented Sep 5, 2018

ThierryO commented Sep 6, 2018

jennybc commented Sep 6, 2018

ThierryO commented Sep 6, 2018

coveralls commented Aug 22, 2018 •

edited

Loading