Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic versioning #238

Merged
merged 416 commits into from
Sep 8, 2021
Merged

Add basic versioning #238

merged 416 commits into from
Sep 8, 2021

Conversation

zhiltsov-max
Copy link
Contributor

@zhiltsov-max zhiltsov-max commented May 11, 2021

Summary

Related: #131
Fixed: #328
Closes #402 (not applicable anymore)

General:

  • Added an option to record dataset operations and navigate over them
  • ​Added working tree and revision cache
  • Added data deduplication between revisions

Glossary:

  • Added new versioning concepts:
    ​- revision - corresponds to a Git commit
    ​- working / head / revision tree - a project build tree and plugins at a specified revision
    ​- object - a revision or a dataset
  • Added new dataset path concepts:
    ​- source / dataset / revision / project paths - a path to a dataset in a special format
    ​- local revpath - a path to a dataset within the current project
    ​- full revpath - an absolute path to a dataset
  • Added new dataset building concepts:
    ​- data source - basically, an URL + dataset format name
    ​- stage - a modification of a data source. A transformation, filter or something else.
    ​- build tree - a directed graph (tree) with leaf nodes at data sources and a single root node called "project"
    ​- build target - a data source or a stage
    ​- pipeline - a subgraph of a build target

CLI:

  • Added local revpath and full revpath concepts in CLI
  • Added source add, commit, checkout, log CLI commands
  • Removed import, project merge CLI commands
  • diff and ediff are joined into a single diff command
  • diff, merge, explain now accept dataset revpaths

API:

  • Project is completely rewritten and has a new interface.
  • Project file layout is changed
    ​- Added v1->v2 migration. It can be done with datum project migrate

How to test

Check this test file for examples: tests/cli/test_project.py

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@zhiltsov-max
Copy link
Contributor Author

zhiltsov-max commented May 26, 2021

@kirill-sizov, @yasakova-anastasia, please test the PR. To test, do:

pip install 'git+https://github.com/openvinotoolkit/datumaro@zm/vcs-simple-exp' [--force-reinstall]

datum create [-o dir/]
datum add <path> -f <format>
datum commit -m "pretty commit"
datum filter/transform/export [target_name] # targets are [source][.stage_name] or project
datum commit
datum log
datum checkout HEAD~2 # accepts commit ids like git
datum checkout master
# rm -rf .datumaro/cache
datum project info/stats HEAD~1:source1.transform1 # accepts [revision]:[target][.stage_name]

Look at tests/cli/test_project.py for examples.

What is done:

  • Build tree and recording of operations
  • Commits & history navigation, viewing
  • Independent use of arbitrary commits, few stages simultaneously and the working dir
  • Caching, retrieval of stages from the cache, re-downloading sources and reproducing sources from scratch
  • Migration from v1 projects

What is not done:

  • Pretty error checking
  • Full update of CLI
  • Model interaction
  • Remotes
  • Remote sources
  • Status
  • Comparison of different projects and revisions

@sizov-kirill
Copy link

@zhiltsov-max
I tested the PR from the user's point of view. In general, everything works well.
But I am having problems in some scenarios:

  1. Perform a sequence of transformations on a dataset
    My actions:
    datum create
    datum add -f yolo ../yolo_dataset
    datum commit -m "add yolo dataset"
    datum transform -t remap_labels -- -l label_0:apple
    datum filter -e '/item[subset="train"]'
    # ...
    # PermissionError: [Errno 13] Permission denied: <path/to/datumaro/cache>

My expectations: successful filtering
Current behavior: Permission denied error. It's a strange error because I'm working from user directory and all works well, if I commit changes before datum filter.

  1. Filter dummy dataset
    In this PR I could not perform simple filtering for dummy dataset from tests/assets/voc_dataset
    datum create 
    datum add -f voc tests/assets/voc_dataset
    datum commit -m "add dataset"
    datum filter -e '/item[subset="train"]'
    # ...
    # AttributeError: 'DatasetItemStorageDatasetView' object has no attribute 'select'

My expectations: successful filtering
Current behavior: AttributeError

3.Commit with no changes
As I know, git commit not allowed make commit without changes.
And I think in Datumaro we should do the same, because a user may not pay attention to the unsuccessful change of the dataset, make a commit and think that he has committed the changes, but in fact there are no changes.

Also, I have a question: how can I undo changes that I have not yet committed (in git I use git checkout . for this) ?

@yasakova-anastasia
Copy link

I tested this PR and got the following notes:

  1. git commit prints all the committed files, probably it would be nice to display the files, for example, as in datum checkout.
  2. datum checkout uses revisions. Is the revision a commit? Do I understand correctly if HEAD~N, N is the commit number? If there are a lot of commits, how do I know where I want to go? Also I think datum checkout should print what data we are on.
  3. If you use datum checkout on uncommitted data, there will be an unhandled exception (as I understand it, it's just not done yet)

@zhiltsov-max
Copy link
Contributor Author

@kirill-sizov, @yasakova-anastasia, thanks for the review and questions.

1 and 2 I'll check, as well as performance.

  1. You're right about these things. It is to be done.

@yasakova-anastasia,

  1. Yes, revisions == git commits. HEAD~N is the same as in Git: https://git-scm.com/docs/revisions#Documentation/revisions.txt-emltrevgtltngtemegemHEADmaster3em - the N-th parent commit of the current HEAD reference. Basically, you do datum log and find the desired revision by commit comments.

@zhiltsov-max
Copy link
Contributor Author

zhiltsov-max commented May 31, 2021

TBD:

  • Fix PermissionError: [Errno 13] Permission denied: <path/to/datumaro/cache>
  • [ ] Fix AttributeError: 'DatasetItemStorageDatasetView' object has no attribute 'select' To be addressed in Dataset.filter fails on merged datasets #259
  • Improve dataset import performance (~10 minutes for PASCAL VOC)
    • Improve copying performance (about 3m -> 20s)
    • Improve hashing performance (about 5m)
    • Allow to disable downloading
  • Add empty commit check and an option to ignore it
  • An option to retrieve specific files / sources (aka git checkout [rev] -- <paths>)
  • Display what is going to be committed in commit
  • Update models, do not provide versioning for them for now
  • Update diff, ediff and merge to support multiple revisions as arguments
  • Add pretty status command
  • Update old and write new documentation

@zhiltsov-max
Copy link
Contributor Author

zhiltsov-max commented Jun 4, 2021

Blocked by in-place saving problems: #257, #260

  • Filter does not provide info about removed items
  • Formats can't maintain input layout (nesting, multiple sources, names for subformats) in complex cases
  • Can't import sources with exact paths as whole datasets

Found a problem with Pytest and logging, which can lead to an infinite loop in testing pytest-dev/pytest#5502. Can be resolved with --capture=no.

"Project migration cannot be done inplace.")

old_aux_dir = osp.join(src_dir, '.datumaro')
old_config = Config.parse(osp.join(old_aux_dir, 'config.yaml'))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the old config being deleted anywhere. Won't that cause the new project code to still complain about the old config after migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the meaning of the operation. Now, it creates a new project in the destination directory, the old project is unaffected.


old_config_path = osp.join(found_path, 'config.yaml')
if osp.isfile(old_config_path):
if Config.parse(old_config_path).format_version != 2:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check shouldn't be here. New code can't handle old project files, so if one is found, the user should be told to run migration regardless of its content (the migration code will then tell them that 2 is not a valid version in old config files).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I told this before - the difference in the config file names is occasional. It is very simple to forget about this when changing the code.

@zhiltsov-max zhiltsov-max merged commit e3e5c0f into develop Sep 8, 2021
@zhiltsov-max zhiltsov-max deleted the zm/vcs-simple-exp branch December 1, 2021 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants