Releases: tensorwerk/hangar-py
Releases · tensorwerk/hangar-py
v0.5.2 Release
Release v0.5.1
v0.5.1 (2020-04-05)
BugFixes
v0.5.0 Release
v0.5.0 (2020-04-4)
Improvements
- Python 3.8 is now fully supported. (#193) @rlizzo
- Major backend overhaul which defines column layouts and data types in the same interchangable / extensable manner as storage backends. This will allow rapid development of new layouts and data type support as new use cases are discovered by the community. (#184) @rlizzo
- Column and backend classes are now fully serializable (pickleable) for
read-only
checkouts. (#180) @rlizzo - Modularized internal structure of API classes to easily allow new columnn layouts / data types to be added in the future. (#180) @rlizzo
- Improved type / value checking of manual specification for column
backend
andbackend_options
. (#180) @rlizzo - Standardized column data access API to follow python standard library
dict
methods API. (#180) @rlizzo - Memory usage of arrayset checkouts has been reduced by ~70% by using C-structs for allocating sample record locating info. (#179) @rlizzo
- Read times from the
HDF5_00
andHDF5_01
backend have been reduced by 33-38% (or more for arraysets with many samples) by eliminating redundant computation of chunked storage B-Tree. (#179) @rlizzo - Commit times and checkout times have been reduced by 11-18% by optimizing record parsing and memory allocation. (#179) @rlizzo
New Features
- Added
str
type column with same behavior asndarray
column (supporting both single-level and nested layouts) added to replace functionality of removedmetadata
container. (#184) @rlizzo - New backend based on
LMDB
has been added (specifier oflmdb_30
). (#184) @rlizzo - Added
.diff()
method toRepository
class to enable diffing changes between any pair of commits / branches without needing to open the diff base in a checkout. (#183) @rlizzo - New CLI command
hangar diff
which reports a summary view of changes made between any pair of commits / branches. (#183) @rlizzo - Added
.log()
method toCheckout
objects so graphical commit graph or machine readable commit details / DAG can be queried when operating on a particular commit. (#183) @rlizzo - "string" type columns now supported alongside "ndarray" column type. (#180) @rlizzo
- New "column" API, which replaces "arrayset" name. (#180) @rlizzo
- Arraysets can now contain "nested subsamples" under a common sample key. (#179) @rlizzo
- New API to add and remove samples from and arrayset. (#179) @rlizzo
- Added
repo.size_nbytes
andrepo.size_human
to report disk usage of a repository on disk. (#174) @rlizzo - Added method to traverse the entire repository history and cryptographically verify integrity. (#173) @rlizzo
Changes
- Argument syntax of
__getitem__()
andget()
methods ofReaderCheckout
andWriterCheckout
classes. The new format supports handeling arbitrary arguments specific to retrieval of data from any column type. (#183) @rlizzo
Removed
metadata
container forstr
typed data has been completly removed. It is replaced by a highly extensible and much more user-friendlystr
typed column. (#184) @rlizzo__setitem__()
method inWriterCheckout
objects. Writing data to columns via a checkout object is no longer supported. (#183) @rlizzo
Bug Fixes
- Backend data stores no longer use file symlinks, improving compatibility with some types file systems. (#171) @rlizzo
- All arrayset types ("flat" and "nested subsamples") and backend readers can now be pickled -- for parallel processing -- in a read-only checkout. (#179) @rlizzo
Breaking changes
- New backend record serialization format is incompatible with repositories written in version 0.4 or earlier.
- New arrayset API is incompatible with Hangar API in version 0.4 or earlier.
v0.5.0 Pre-Release 2
Pre-Release for v0.5.0. Full Changelog To Follow.
v0.5.0 Pre-Release
Pre-Release for v0.5.0. Full Changelog To Follow.
Release v0.4.0
Release Notes
New Features
- Added ability to delete branch names/pointers from a local repository via both API and CLI. #128 @rlizzo
- Added
local
keyword arg to arrayset key/value iterators to return only locally available samples #131 @rlizzo - Ability to change the backend storage format and options applied to an
arrayset
after initialization. #133 @rlizzo - Added blosc compression to HDF5 backend by default on PyPi installations. #146 @rlizzo
- Added Benchmarking Suite to Test for Performance Regressions in PRs. #155 @rlizzo
- Added new backend optimized to increase speeds for fixed size arrayset access. #160 @rlizzo
Improvements
- Removed
msgpack
andpyyaml
dependencies. Cleaned up and improved remote client/server code. #130 @rlizzo - Multiprocess Torch DataLoaders allowed on Linux and MacOS. #144 @rlizzo
- Added CLI options
commit
,checkout
,arrayset create
, &arrayset remove
. #150 @rlizzo - Plugin system revamp. #134 @hhsecond
- Documentation Improvements and Typo-Fixes. #156 @alessiamarcolini
- Removed implicit removal of arrayset schema from checkout if every sample was removed from arrayset. This could potentially result in dangling accessors which may or may not self-destruct (as expected) in certain edge-cases. #159 @rlizzo
- Added type codes to hash digests so that calculation function can be updated in the future without breaking repos written in previous Hangar versions. #165 @rlizzo
Bug Fixes
- Programatic access to repository log contents now returns branch heads alongside other log info. #125 @rlizzo
- Fixed minor bug in types of values allowed for
Arrayset
names vsSample
names. #151 @rlizzo - Fixed issue where using checkout object to access a sample in multiple arraysets would try to create a
namedtuple
instance with invalid field names. Now incompatible field names are automatically renamed with their positional index. #161 @rlizzo - Explicitly raise error if
commit
argument is set while checking out a repository withwrite=True
. #166 @rlizzo
Breaking changes
- New commit reference serialization format is incompatible with repositories written in version 0.3.0 or earlier.
v0.4.0b0 Beta Pre-Release
Merge pull request #145 from rlizzo/version-0-4-0b0 Version 0.4.0b0
v0.3.0 Release
New Features
- API addition allowing reading and writing arrayset data from a checkout object directly. (#115) @rlizzo
- Data importer, exporters, and viewers via CLI for common file formats. Includes plugin system for easy extensibility in the future. (#103) (@rlizzo, @hhsecond)
Improvements
- Added tutorial on working with remote data. (#113) @rlizzo
- Added Tutorial on Tensorflow and PyTorch Dataloaders. (#117) @hhsecond
- Large performance improvement to diff/merge algorithm (~30x previous). (#112) @rlizzo
- New commit hash algorithm which is much more reproducible in the long term. (#120) @rlizzo
- HDF5 backend updated to increase speed of reading/writing variable sized dataset compressed chunks (#120) @rlizzo
Bug Fixes
- Fixed ML Dataloaders errors for a number of edge cases surrounding partial-remote data and non-common keys. (#110) (@hhsecond, @rlizzo)
Breaking changes
- New commit hash algorithm is incompatible with repositories written in version 0.2.0 or earlier
v0.2.0 Release
See changelog for full details
New Features
- Numpy memory-mapped array file backend added.
- Remote server data backend added.
- Selection heuristics to determine appropriate backend from arrayset schema.
- Partial remote clones and fetch operations now fully supported.
- CLI has been placed under test coverage, added interface usage to docs.
- TensorFlow and PyTorch Machine Learning Dataloader Methods (Experimental Release).
Improvements
- Record format versioning and standardization so to not break backwards compatibility in the future.
- Backend addition and update developer protocols and documentation.
- Read-only checkout arrayset sample
get
methods now are multithread and multiprocess safe. - Read-only checkout metadata sample
get
methods are thread safe if used within a context manager. - Samples can be assigned integer names in addition to
string
names. - Forgetting to close a
write-enabled
checkout before terminating the python process will close the
checkout automatically for many situations. - Repository software version compatability methods added to ensure upgrade paths in the future.
- Many tests added (including support for Mac OSX on Travis-CI).
lead
Bug Fixes
- Diff results for fast forward merges now returns sensible results.
- Many type annotations added, and developer documentation improved.
Breaking changes
- Renamed all references to
datasets
in the API / world-view toarraysets
. - These are backwards incompatible changes. For all versions > 0.2, repository upgrade utilities will
be provided if breaking changes occur.
v0.1.1 Release
Fix for readme which had typos and was push to PyPi