Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing and retrieving data.frames #366

Closed
wants to merge 52 commits into from

Conversation

ThierryO
Copy link
Member

This PR replaces PR #303 and solves issue #301. The functionality is described in a vignette, hence the extra knitr and rmarkdown dependencies.

Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Swapping two variables when rewritten a data.frame results in a large diff
while the information content of the data hasn't changed. Therefore the
variables will be reordered to match the original order.

Signed-off-by: Thierry Onkelinx <[email protected]>
When a line is moved in a file, the resulting diff is a deletion at the
original location and an addition at the new location. Changing the order of
the observations in a data.frame does not change the information content.
Sorting the data before writing avoids unnecessary diffs.

Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
…within the sorting variables

Signed-off-by: Thierry Onkelinx <[email protected]>
ThierryO and others added 16 commits July 27, 2018 21:13
…tead of "data_repository"

Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
dir.exists() is not available in R < 3.2.0

Signed-off-by: Thierry Onkelinx <[email protected]>
@coveralls
Copy link

coveralls commented Aug 22, 2018

Coverage Status

Coverage increased (+0.8%) to 82.526% when pulling b748146 on ThierryO:datarepos into e18e8f7 on ropensci:master.

@stewid
Copy link
Member

stewid commented Aug 23, 2018

Thanks @ThierryO.

I will try to find time the next few days to go through the pull request.

Signed-off-by: Thierry Onkelinx <[email protected]>
Signed-off-by: Thierry Onkelinx <[email protected]>
@stewid
Copy link
Member

stewid commented Sep 5, 2018

Hi Thierry,

Thanks for the pull request. I apologize for the delay in reviewing it. The pull request implements functionality to handle data.frames and the associated metadata in a Git repository. I fully agree that
it's important to place data under version control, and this PR clearly provides a workflow to facilitate this. My main concern is that the PR introduces much functionality that is unrelated to the core of git2r i.e. to be an interface to the libgit2 library. Therefore, I suggest that this work is bundled in a separate package that imports git2r.

I also have a technical comment. The vignette says that "Git stores the version history under the form of diffs: a list of lines which are deleted and a list of lines which are inserted at a specific line number in a file.". That's actually not correct, Git stores the entire content of each file (for efficiency, files may later be compressed into 'pack files'). The diff is calculated on the fly. I have created an example to illustrate this with some functions to read the internal content of a commit, tree and blob using base R. To read more about the git-internals, see https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

library("git2r")

## Create a directory in tempdir
path <- tempfile(pattern="git2r-")
dir.create(path)

## Initialize a repository
repo <- init(path)
config(repo, user.name="Alice", user.email="[email protected]")

## Create a csv file with data.
df <- data.frame(letters = c("a", "b", "c", "d", "e"),
                 numbers = 1:5,
                 date = as.Date(c("2018-08-20",
                                  "2018-08-21",
                                  "2018-08-22",
                                  "2018-08-23",
                                  "2018-08-24")))
write.csv(df, file.path(workdir(repo), "test.csv"), row.names = FALSE)

## Add the file to the repository and commit
add(repo, "test.csv")
commit(repo, "Initial commit")
#> [740b088] 2018-09-05: Initial commit

## A utility function to read a git commit or git blob data object.
read_git_object <- function(filename) {
    ## Read compressed data
    n <- file.info(filename)$size
    data <- readBin(filename, what = "raw", n = n)
    data <- memDecompress(data, "gzip")

    ## Find "\0" in data; separates header from content
    null_byte <- which(data == 0)

    ## Determine the type of git the object
    header <- rawToChar(data[seq_len(null_byte-1)])

    ## Read content
    i <- seq(from = null_byte + 1, to = length(data))
    content <- readLines(textConnection(rawToChar(data[i])))

    cat(c(header, content), sep = "\n")

    invisible(NULL)
}

## Read the contents of a git commit object.
read_git_commit <- function(filename) {
    read_git_object(filename)
}

## Read the contents of a git blob object.
read_git_blob <- function(filename) {
    read_git_object(filename)
}

## Read the contents of a git tree object.
read_git_tree <- function(filename) {
    ## Read compressed data
    n <- file.info(filename)$size
    data <- readBin(filename, what = "raw", n = n)
    data <- memDecompress(data, "gzip")

    i <- min(which(data == 0))
    content <- rawToChar(data[1:(i-1)])

    while (i < length(data)) {
        j <- which(data == 0)
        j <- min(j[j > i])
        name <- rawToChar(data[(i+1):(j-1)])

        sha <- as.integer(data[(j+1):(j+20)])
        sha <- sprintf("%02x", sha)
        sha <- paste0(sha, collapse = "")

        content <- c(content, paste(name, sha))
        i <- j + 20
    }

    cat(content, sep = "\n")

    invisible(NULL)
}

## Let us now inspect the git commit object to see what 'Git' stores.
filename <- file.path(repo$path,                              ## Repo path
                      "objects",                              ## Objects db
                      substr(sha(last_commit(repo)), 1, 2),   ## Subdirectory
                      substr(sha(last_commit(repo)), 3, 40))  ## Filename

read_git_commit(filename)
#> commit 164
#> tree bfb7a2dfa597b2870e4634cd5c55271114527646
#> author Alice <[email protected]> 1536173471 +0200
#> committer Alice <[email protected]> 1536173471 +0200
#> 
#> Initial commit

## Inspect the tree to determine the sha-1 of the blob.
filename <- file.path(repo$path,
                      "objects",
                      substr(sha(tree(last_commit(repo))), 1, 2),
                      substr(sha(tree(last_commit(repo))), 3, 40))

read_git_tree(filename)
#> tree 36
#> 100644 test.csv ef69efbfe95512ce19af7138e6ec344ff64cfa67

## Now inspect the blob with the data.frame.
filename <- file.path(repo$path,
                      "objects",
                      substr(sha(tree(last_commit(repo))[1]), 1, 2),
                      substr(sha(tree(last_commit(repo))[1]), 3, 40))

read_git_blob(filename)
#> blob 112
#> "letters","numbers","date"
#> "a",1,2018-08-20
#> "b",2,2018-08-21
#> "c",3,2018-08-22
#> "d",4,2018-08-23
#> "e",5,2018-08-24

## Let's remove the third row in the data.frame and inspect the git
## blob.
write.csv(df[-3, ], file.path(workdir(repo), "test.csv"), row.names = FALSE)

## Add the file to the repository and commit
add(repo, "test.csv")
commit(repo, "Remove third row")
#> [ecd139a] 2018-09-05: Remove third row

filename <- file.path(repo$path,
                      "objects",
                      substr(sha(tree(last_commit(repo))[1]), 1, 2),
                      substr(sha(tree(last_commit(repo))[1]), 3, 40))

## The entire content of 'test.csv' is written to the blob.
read_git_blob(filename)
#> blob 95
#> "letters","numbers","date"
#> "a",1,2018-08-20
#> "b",2,2018-08-21
#> "d",4,2018-08-23
#> "e",5,2018-08-24

## Display the diff between the two commits.
cat(diff(tree(commits(repo)[[2]]),
         tree(commits(repo)[[1]]),
         as_char = TRUE))
#> diff --git a/test.csv b/test.csv
#> index ef69efb..3bc6b78 100644
#> --- a/test.csv
#> +++ b/test.csv
#> @@ -1,6 +1,5 @@
#>  "letters","numbers","date"
#>  "a",1,2018-08-20
#>  "b",2,2018-08-21
#> -"c",3,2018-08-22
#>  "d",4,2018-08-23
#>  "e",5,2018-08-24

Created on 2018-09-05 by the reprex package (v0.2.0).

Kind regards
Stefan

@ThierryO
Copy link
Member Author

ThierryO commented Sep 6, 2018

Hi Stefan,

I agree that it is not the core business of the libgit2. But it seems a nice addition as stated by you in #303 and by a view other people which reviewed this (ThierryO#1). The extra functionaly is quite lightweight so it shouldn't place much burden on git2r.

Putting the functionality in a separate package would be overkill IMHO. It has only a few functions and it would add an extra dependency for the users. Having the functionality in git2r has the benefit it is will be much more likely to be used.

I stand corrected on the diff topic.

@jennybc
Copy link
Member

jennybc commented Sep 6, 2018

Pardon my butting in uninvited (I watch this repo). But as someone who depends on git2r in a few ways, I tend to agree with @stewid's inclination to stay a fairly minimal wrapper around libgit2. I think the "git2r extension package" is an interesting idea and don't think it's overkill even for the functionality already in this PR.

There is a lot of interest in versioning data, so this PR is squarely in that space. Being it's own extension package would give it room to grow organically, as I think some lightweight tools in that space would be very well-received.

@ThierryO
Copy link
Member Author

ThierryO commented Sep 6, 2018

After a discussion with my coworkers, we decided to transform this PR into a standalone package: git2rdata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants