Git tracked bare DVC repo (only tracking .dvc file, but don't checkout real file) #8626
Replies: 6 comments
-
Have you thought about using
That looks like rather specific use case, which can be achieved with some parametrized script. The steps are repetitive, and it seems to me that this could be achieved with some bash/python parametrized with
Seems like
Depending on how you are using this logs later down the road, seems like
I am not sure I understand the DVC's responsibility here - this sounds to me like a git remote with restricted
If you use |
Beta Was this translation helpful? Give feedback.
-
@pared
First, I need to create git repo before I can use dvc, right? Second, I need to copy files into that git repo so that I can do Third, If I want to save space, I need to remove files in that git repo, only leave I think dvc's idea is great, but in many production environments, data like a stream, streaming to the data lake or data warehouse. If we want to use dvc to replace current commercial data management tool, it has many inconvenient due to the needed of git repo. But If dvc can do above things out-of-the-box without git repo and self-scripting, I think it can be very competitive to the commercial data management tool. |
Beta Was this translation helpful? Give feedback.
-
Ah, got it, so the problem here is data of a big size, where copying it between different machines does not make sense.
From existing features - external outputs and dependencies might be helpful here. In this case locally you only have a git repo and the
True, in aforementioned
True Please ping us if you considered using the external use case - it seems it might be helpful here. |
Beta Was this translation helpful? Give feedback.
-
@allenyllee @pared mentioned about the |
Beta Was this translation helpful? Give feedback.
-
@shcheklein Sorry, I'm not yet tested it. But I saw this question in the dvc fourm: I think his problem is similar to our's, and I think what I proposed can solve his problem either. |
Beta Was this translation helpful? Give feedback.
-
@allenyllee yep, and what I'm trying to understand what exactly is missing / how is it different from the proposal I have in that thread. It would be really helpful if you could try and if something is missing let us know. |
Beta Was this translation helpful? Give feedback.
-
Background
We have a lot of daily generated log file, we want to use dvc to tracking our daily log.
Current Method
If we want to use dvc to tracking our daily log, for now, we have to:
dvc init
dvc add
those files anddvc commit
to generate.dvc
file, and thendvc push
to transfer files to remote.git commit
the generated.dvc
file, andgit tag
to add a time stamp(or version).dvc
filesWhen new daily logs coming, we need to repeat 2-5 step for tracking.
When someone need to analyse log files, they need to:
Clone the git repo,
git checkout
a tagged version, anddvc checkout
to download files to the local.Proposed Method
dvc init --bare --remote
or a Python API) to create a "git tracked bare dvc repo" in remote machinedvc push --transfer --remote
or a Python API) to directly transfer daily log to the remote, this command has a--tag
option, it will do the above 2-5 step in the remote machine.Clone the "git tracked bare dvc repo" with only
.dvc
files,git checkout
a tagged version, anddvc checkout
to download files.Further, because the "git tracked bare dvc repo" should only be modified by the data owner, someone can not push their code to the "git tracked bare dvc repo" remote. Instead, they created a new git repo, and add "git tracked bare dvc repo" as a another git remote. In the git graph, they can see two parallel line: one for our data repo, one for their code repo.
They can cherry-pick a commit from data repo, move
.dvc
file into other folder, then dodvc checkout
, the file will pull from our data repo, downloaded into their folder, then they can start writing their code, commit to their git remote.Sum up
The "git tracked bare dvc repo" we can treat it as a combination of
git bare repo
anddvc cache
, it's a whole structure only for tracking data blob. It can see as a regular git remote, import as agit submodule
, but can only modified by data owner. For the developer, they just include it, pull the data, do their experiments, push to their own repo without touching the data repo.Also, If you don't use git, you can still treat it as a regular dvc cache remote. But with git, you have full power of git!
Advance
If you have multiple data source and want to share a single data repo, one can provide
--source
option in proposed step 2, then the command will create a git branch with provided source name. This newly created branch is parallel to other source branch (with no common commit). From developer's view, they can see many parallel branch resides in data repo, and they just need to pick a branch (a data source) to merge into their local working branch.In case the data owner needs to merge two data source into one, it can be as easy as using
git merge
in the data repo, to merge two parallel data source branch into one branch!Beta Was this translation helpful? Give feedback.
All reactions