gcs/CHANGES.txt

1.6.10 - 2018-09-20

  1. Add an option to specify max bytes rewritten per rewrite request when
     `fs.gs.copy.with.rewrite.enable` is set to `true`:

       fs.gs.rewrite.max.bytes.per.call (default: 536870912)

     Even though GCS does not require this parameter for rewrite requests,
     rewrite requests are flaky without it.


1.6.9 - 2018-09-04

  1. Change default values for GCS batch/directory operations properties to
     improve performance:

       fs.gs.copy.max.requests.per.batch (default: 1 -> 15)
       fs.gs.copy.batch.threads (default: 50 -> 15)
       fs.gs.max.requests.per.batch (default: 25 -> 15)
       fs.gs.batch.threads (default: 25 -> 15)

  2. Update all dependencies to latest versions.


1.6.8 - 2018-08-10

  1. Support parallel execution of GCS batch requests.

     Number of threads to execute batch requests configurable via property:

       fs.gs.batch.threads (default: 0)

     If `fs.gs.batch.threads` value is set to 0 then batch requests will be
     executed sequentially by caller thread.

  2. Do not send batch request when performing operations (rename, delete, copy)
     on 1 object.

  3. Add configuration properties to control batching of copy operations
     separately from other operations:

       fs.gs.copy.max.requests.per.batch (default: 30)
       fs.gs.copy.batch.threads (default: 0)

  4. Fix `RejectedExecutionException` during parallel execution of GCS batch
     requests.

  5. Change default values for GCS batch/directory operations properties:

       fs.gs.copy.with.rewrite.enable (default: false -> true)
       fs.gs.copy.max.requests.per.batch (default: 30 -> 1)
       fs.gs.copy.batch.threads (default: 0 -> 50)
       fs.gs.max.requests.per.batch (default: 30 -> 25)
       fs.gs.batch.threads (default: 0 -> 25)


1.6.7 - 2018-06-12

  1. Remove Hadoop3 support.

  2. Add interface through which user can directly provide the access token.

  3. Update Hadoop dependencies to 2.8.4 version.

  4. Add more retries and error handling in GoogleCloudStorageReadChannel,
     to make it more resilient to network errors; also add a property to allow users
     to specify number of retries on low level GCS HTTP requests in case of server
     errors and I/O errors.

  5. Add properties to allow users to specify connect timeout and read timeout on
     low level GCS HTTP requests.

  6. Include prefix/directory objects metadata into `storage.objects.list`
     requests response to improve performance (i.e. set
     `includeTrailingDelimiter` parameter for `storage.objects.list` GCS
     requests to `true`).


1.6.6 - 2018-05-08

  1. Support Hadoop 3.

  2. Change Maven project structure to be better compatible with IDEs.

  3. Fix thread leaks that were occurring when YARN log aggregation uploaded
     logs to GCS.


1.6.5 - 2018-04-12

  1. Add support for using Cloud Storage Rewrite requests for copy operation:

       fs.gs.copy.with.rewrite.enable (default: false)

     This allows to copy files between different locations and storage classes.

  2. Update all dependencies to latest versions.

  3. Decrease default value for max requests per batch from 1,000 to 30.

  4. Make max requests per batch value configurable with property:

       fs.gs.max.requests.per.batch (default: 30)


1.6.4 - 2018-03-19

  1. Fixed an issue where JSON auth files containing user auth e.g.
     application_default_credentials.json does not work with
     google.cloud.auth.service.account.json.keyfile

  2. Honor GOOGLE_APPLICATION_DEFAULT_CREDENTIALS environment variable. For
     Google Application Default Credentials (but not other defaults).

  3. Make fs.gs.project.id optional. It is still required for listing buckets,
     creating buckets, and entire BigQuery connector.

  4. Disable GCS Metadata Cache by default (e.g. set default value of
     `fs.gs.metadata.cache.enable` property to `false`).

  5. Support GCS Requester Pays feature that could be configured with new
     properties:

       fs.gs.requester.pays.mode (default=DISABLED)
       fs.gs.requester.pays.project.id (no default value)
       fs.gs.requester.pays.buckets (no default value)

  6. Add support for specifying marker files pattern that should be copied
     last during folder rename operation. Pattern is configured with property:

       fs.gs.marker.file.pattern


1.6.3 - 2018-01-25

  1. Use new GCS batch requests endpoint.

  6. Disable GCS Metadata Cache by default (e.g. set default value of
     `fs.gs.metadata.cache.enable` property to `false`).


1.6.2 - 2017-11-21

  1. Wire HTTP transport settings into Credential logic.


1.6.1 - 2017-04-14

  1. Added a polling loop when determining if a createEmptyObjects error can
     safely be ignored and expanded the cases in which we will attempt to
     determine if an empty object already exists.

     Previously, if a rate limiting exception was encountered while creating
     empty objects the connector would issue a single get request for that
     object. If the object exists and is zero length we would consider the
     createEmptyObjects call successful and suppress the rate limit exception.

     The new implementation will poll for the existence of the object, up to
     a user-configurable maximum, and will poll when either a rate limiting
     error occurs or when a 500-level error occurs. The maximum can be
     configured by the following setting:

     fs.gs.max.wait.for.empty.object.creation.ms

     Any positive value for this setting will be interpreted to mean "poll
     for up to this many milliseconds before making a final determination".
     The default value will cause a maximum wait of 3 seconds. Polling can
     be disabled by setting this key to 0.


1.6.0 - 2016-12-16

  1. Added new PerformanceCachingGoogleCloudStorage; unlike the existing
     CacheSupplementedGoogleCloudStorage which only serves as an advisory
     cache for enforcement of list consistency, the new optional caching
     layer is able to serving certain metadata and listing requests purely
     out of a short-lived in-memory cache to enhance performance of
     some workloads. By default this feature is disabled, and can be
     controlled with the config settings:

       fs.gs.performance.cache.enable=true (default=false)
       fs.gs.performance.cache.list.caching.enable=true (default=false)

     The first option enables the cache to serve getFileStatus requests,
     while the second option additionally enables serving listStatus. The
     duration of cache entries can be controlled with:

       fs.gs.performance.cache.max.entry.age.ms (default=3000)

     It is not recommended to always run with this feature enabled; it should
     be used specifically to address cases where frameworks perform redundant
     sequential list/stat operations in a non-distributed manner, and on
     datasets which are not frequently changing. It is additionally advised
     to validate data integrity separately whenever using this feature. There
     is no cooperative cache invalidation between different processes when
     using this feature, so concurrent mutations to a location from multiple
     clients *will* produce inconsistent/stale results if this feature is
     enabled.


1.5.5 - 2016-11-04

  1. Minor refactoring of logic in CacheSupplementedGoogleCloudStorage to
     extract a reusable ForwardingGoogleCloudStorage that can be used
     for other GCS-delegating implementations.


1.5.4 - 2016-10-05

  1. Fixed a bug in GoogleCloudStorageReadChannel where multiple in-place
     seeks without any "read" in between could throw a range exception.
  2. Fixed plumbing of GoogleCloudStorageReadOptions into the in-memory
     test helpers to improve unittest alignment with integration tests.
  3. Fixed handling of parent timestamp updating when a full URI is provided
     as a path for which timestamps should be updated. This allows specifying
     gs://bucket/p1/p2/object instead of simply p1/p2/object.
  4. Updated Hadoop 2 dependency to 2.7.2.
  5. Imported full set of Hadoop FileSystem contract tests, except for the
     currently-unsupported Concat and Append tests. Minor changes to pass
     all the tests:

     -available() now throws ClosedChannelException if called after close()
      instead of returning 0.
     -read() and seek() throw ClosedChannelException if called after close()
      instead of NullPointerException.
     -Out-of-bounds seeks now throw EOFException instead of
      IllegalArgumentException.
     -Blocked overwrites throw org.apache.hadoop.fs.FileAlreadyExistsException
      instead of generic IOException.
     -Deleting root '/' recursively or non-recursively when empty will no
      longer delete the underlying GCS bucket by default. To re-enable
      deletion of GCS buckets when recursively deleting root, set
      'fs.gs.bucket.delete.enable' = true.


1.5.3 - 2016-09-21

  1. Misc updates in credential acquisition on GCE.


1.5.2 - 2016-08-23

  1. Updated AbstractGoogleAsyncWriteChannel to always set the
     X-Goog-Upload-Desired-Chunk-Granularity header independently from the
     deprecated X-Goog-Upload-Max-Raw-Size; in general this improves
     performance of large uploads.


1.5.1 - 2016-07-29

  1. Optimized InMemoryDirectoryListCache to use a TreeMap internally instead
     of a HashMap, since most of its usage is "listing by prefix"; aside
     from some benefits for normal list-supplementation, glob listings
     with a large number of subdirectories should be orders of magnitude
     faster.


1.5.0 - 2016-07-15

  1. Changed the single-argument constructor
     GoogleCloudStorageFileSystem(GoogleCloudStorage) to inherit the inner
     GoogleCloudStorageOptions from the passed-in GoogleCloudStorage rather
     than simply falling back to default GoogleCloudStorageFileSystemOptions.
  2. Changed RetryHttpInitializer to treat HTTP code 429 as a retriable
     error with exponential backoff instead of erroring out.
  3. Added a new output stream type which can be used by setting:

         fs.gs.outputstream.type=SYNCABLE_COMPOSITE

     With the SYNCABLE_COMPOSITE type, output streams obtained with a call
     to FileSystem.create(Path) will have (limited) support for Hadoop's
     Syncable interface, where calling hsync() will commit the data written
     so far into persistent storage without closing the output stream; once
     a writer calls hsync(), readers will be able to discover and read the
     contents of the file written up to the last successful hsync() call
     immediately. This feature is implemented using "composite objects" in
     Google Cloud Storage, and thus has a current hard limit of 1023 calls
     to hsync() over the lifetime of the stream after which subsequent hsync()
     calls will throw a CompositeLimitExceededException, but will still
     allow writing additional data and closing the stream without losing data
     despite a failed hsync().
  4. Added several optimizations and options controlling the behavior of
     the GoogleHadoopFSInputStream:

     -Removed internal prepopulated ByteBuffer inside GoogleHadoopFSInputStream
      in favor of non-blocking buffering in the lower layer channel; this
      improves performance for independent small reads without noticeable
      impact on large reads. Old behavior can be specified by setting

         fs.gs.inputstream.internalbuffer.enable=true (default=false)

     -Added option to save an extra metadata GET on first read of a channel
      if objects are known not to use the 'Content-Encoding' header. To
      use this optimization set:

         fs.gs.inputstream.support.content.encoding.enable=false (default=true)

     -Added option to save an extra metadata GET on FileSystem.open(Path) if
      objects are known to exist already, or the caller is resilient to delayed
      errors for nonexistent objects which occur at read time, rather than
      immediately on open(). To use this optimization set:

         fs.gs.inputstream.fast.fail.on.not.found.enable=false (default=true)

     -Added support for "in-place" small seeks by defining a byte limit where
      forward seeks smaller than the limit will be done by reading/discarding
      bytes rather than opening a brand-new channel. This is set to 8MB by
      default, which is approximately the point where the cost of opening a
      new channel is equal to the cost of reading/discarding the bytes in-place.
      Disable by setting the value to 0, or adjust to other thresholds:

         fs.gs.inputstream.inplace.seek.limit=0 (default=8388608)


1.4.5 - 2016-03-24

  1. Add support for paths that cannot be parsed by Java's URI.create method.
     This support is off by default, but can be enabled by setting Hadoop
     configuration key fs.gs.path.encoding to the string uri-path. The
     current behavior, and default value of fs.gs.path.encoding, is 'legacy'.
     The new path encoding scheme will become the default in a future release.
  2. VerificationAttributes are now exposed in GoogleCloudStorageItemInfo. The
     current support is limited to reading these attributes from what was
     computed by GCS server side. A future release will add support for
     specifying VerificationAttributes in CreateOptions when creating new
     objects.


1.4.4 - 2016-02-02

  1. Add support for JSON keyfiles via a new configuration key:
     google.cloud.auth.service.account.json.keyfile. This key should
     point to a file that is available locally to jobs to clusters.


1.4.3 - 2015-11-12

  1. Minor bug fixes and enhancements.


1.4.2 - 2015-09-15

  1. Added checking in GoogleCloudStorageImpl.createEmptyObject(s) to handle
     rateLimitExceeded (429) errors by fetching the fresh underlying info
     and ignoring the error if the object already exists with the intended
     metadata and size. This fixes an issue which mostly affects Spark:
     https://github.com/GoogleCloudPlatform/bigdata-interop/issues/10
  2. Added logging in GoogleCloudStorageReadChannel for high-level retries.
  3. Added support for configuring the permissions reported to the Hadoop
     FileSystem layer; the permissions are still fixed per FileSystem instance
     and aren't actually enforced, but can now be set with:

        fs.gs.reported.permissions [default = "700"]

    This allows working around some clients like Hive-related daemons and tools
    which pre-emptively check for certain assumptions about permissions.


1.4.1 - 2015-07-09

  1. Switched from the custom SeekableReadableByteChannel to
     Java 7's java.nio.channels.SeekableByteChannel.
  2. Removed the configurable but default-constrained 250GB upload limit;
     uploads can now exceed 250GB without needing to modify config settings.
  3. Added helper classes related to GCS retries.
  4. Added workaround support for read retries on objects with content-encoding
     set to gzip; such content encoding isn't generally correct to use since
     it means filesystem reported bytes will not match actual read bytes, but
     for cases which accept byte mismatches, the read channel can now manually
     seek to where it left off on retry rather than having a GZIPInputStream
     throw an exception for a malformed partial stream.
  5. Added an option for enabling "direct uploads" in
     GoogleCloudStorageWriteChannel which is not directly used by the Hadoop
     layer, but can be used by clients which directly access the lower
     GoogleCloudStorage layer.
  6. Added CreateBucketOptions to the GoogleCloudStorage interface so that
     clients using the low-level GoogleCloudStorage directly can create buckets
     with different locations and storage classes.
  7. Fixed https://github.com/GoogleCloudPlatform/bigdata-interop/issues/5 where
     stale cache entries caused stuck phantom directories if the directories
     were deleted using non-Hadoop-based GCS clients.
  8. Fixed a bug which prevented the Apache HTTP transport from working with
     Hadoop 2 when no proxy was set.
  9. Misc updates in library dependencies; google.api.version
     (com.google.http-client, com.google.api-client) updated from 1.19.0 to
     1.20.0, google-api-services-storage from v1-rev16-1.19.0 to
     v1-rev35-1.20.0, and google-api-services-bigquery from v2-rev171-1.19.0
     to v2-rev217-1.20.0, and Guava from 17.0 to 18.0.


1.4.0 - 2015-05-27

  1. The new inferImplicitDirectories option to GoogleCloudStorage tells
     it to infer the existence of a directory (such as foo) when that
     directory node does not exist in GCS but there are GCS files
     that start with that path (such as as foo/bar).  This allows
     the GCS connector to be used on read-only filesystems where
     those intermediate directory nodes can not be created by the
     connector. The value of this option can be controlled by the
     Hadoop boolean config option "fs.gs.implicit.dir.infer.enable".
     The default value is true.
  2. Increased Hadoop dependency version to 2.6.0.
  3. Fixed a bug introduced in 1.3.2 where, during marker file creation,
     file info was not properly updated between attempts. This lead
     to backoff-retry-exhaustion with 412-preconditon-not-met errors.
  4. Added support for changing the HttpTransport implementation to use,
     via fs.gs.http.transport.type = [APACHE | JAVA_NET]
  5. Added support for setting a proxy of the form "host:port" via
     fs.gs.proxy.address, which works for both APACHE and JAVA_NET
     HttpTransport options.
  6. All logging converted to use slf4j instead of the previous
     org.apache.commons.logging.Log; removed the LogUtil wrapper which
     previously wrapped org.apache.commons.logging.Log.
  7. Automatic retries for premature end-of-stream errors; the previous
     behavior was to throw an unrecoverable exception in such cases.
  8. Made close() idempotent for GoogleCloudStorageReadChannel
  9. Added a low-level method for setting Content-Type metadata in the
     GoogleCloudStorage interface.
  10.Increased default DirectoryListCache TTL to 4 hours, wired out TTL
     settings as top-level config params:
     fs.gs.metadata.cache.max.age.entry.ms
     fs.gs.metadata.cache.max.age.info.ms


1.3.3 - 2015-02-26

  1. When performing a retry in GoogleCloudStorageReadChannel, attempts to
     close() the underlying channel are now performed explicitly instead of
     waiting for performLazySeek() to do it, so that SSLException can be
     caught and ignored; broken SSL sockets cannot be closed normally,
     and are responsible for already cleaning up on error.
  2. Added explicit check of currentPosition == size when -1 is read from
     underlying stream in GoogleCloudStorageReadChannel, in case the
     stream fails to identify an error case and prematurely reaches
     end-of-stream.


1.3.2 - 2015-01-22

  1. In the create file path, marker file creation is now configurable. By
     default, marker files will not be created. The default is most suitable
     for MapReduce applications. Setting fs.gs.create.marker.files.enable to
     true in core-site.xml will re-enable marker files. The use of marker files
     should be considered for applications that depend on early failing when
     two concurrent writes attempt to write to the same file. Note that file
     overwrite semantics are preserved with or without marker files, but
     failures will occur sooner with marker files present.


1.3.1 - 2014-12-16

  1. Fixed a rare NullPointerException in FileSystemBackedDirectoryListCache
     which can occur if a directory being listed is purged from the cache
     between a call to "exists()" and "listFiles()".
  2. Fixed a bug in GoogleHadoopFileSystemCacheCleaner where cache-cleaner
     fails to clean any contents when a bucket is non-empty but expired.
  3. Fixed a bug in FileSystemBackedDirectoryListCache which caused garbage
     collection to require several passes for large directory hierarchies;
     now we can successfully garbage-collect an entire expired tree in a
     single pass, and cache files are also processed in-place without having
     to create a complete in-memory list.
  4. Updated handling of new file creation, file copying, and file deletion
     so that all object modification requests sent to GCS contain preconditions
     that should prevent race-conditions in the face of retried operations.


1.3.0 - 2014-10-17

  1. Directory timestamp updating can now be controlled via user-settable
     properties "fs.gs.parent.timestamp.update.enable",
     "fs.gs.parent.timestamp.update.substrings.excludes". and
     "fs.gs.parent.timestamp.update.substrings.includes" in core-site.xml. By
     default, timestamp updating is enabled for the YARN done and intermediate
     done directories and excluded for everything else. Strings listed in
     includes take precedence over excludes.
  2. Directory timestamp updating will now occur on a background thread inside
     GoogleCloudStorageFileSystem.
  3. Attempting to acquire an OAuth access token will be now be retried when
     using .p12 or installed application (JWT) credentials if there is a
     recoverable error such as an HTTP 5XX response code or an IOException.
  4. Added FileSystemBackedDirectoryListCache, extracting a common interface
     for it to share with the (InMemory)DirectoryListCache; instead of using
     an in-memory HashMap to enforce only same-process list consistency, the
     FileSystemBacked version mirrors GCS objects as empty files on a local
     FileSystem, which may itself be an NFS mount for cluster-wide or even
     potentially cross-cluster consistency groups. This allows a cluster to
     be configured with a "consistent view", making it safe to use GCS as the
     DEFAULT_FS for arbitrary multi-stage or even multi-platform workloads.
     This is now enabled by default for machine-wide consistency, but it is
     strongly recommended to configure clusters with an NFS directory for
     cluster-wide strong consistency. Relevant configuration settings:
     fs.gs.metadata.cache.enable [default: true]
     fs.gs.metadata.cache.type [IN_MEMORY (default) | FILESYSTEM_BACKED]
     fs.gs.metadata.cache.directory [default: /tmp/gcs_connector_metadata_cache]
  5. Optimized seeks in GoogleHadoopFSDataInputStream which fit within
     the pre-fetched memory buffer by simply repositioning the buffer in-place
     instead of delegating to the underlying channel at all.
  6. Fixed a performance-hindering bug in globStatus where "foo/bar/*" would
     flat-list "foo/bar" instead of "foo/bar/"; causing the "candidate matches"
     to include things like "foo/bar1" and "foo/bar1/baz", even though the
     results themselves would be correct due to filtering out the proper glob
     client-side in the end.
  7. The versions of java API clients were updated to 1.19 derived versions.


1.2.9 - 2014-09-18

  1. When directory contents are updated e.g., files or directories are added,
     removed, or renamed the GCS connector will now attempt to update a
     metadata property on the parent directory with a modification time. The
     modification time recorded will be used as the modification time in
     subsequent FileSystem#getStatus(...), FileSystem#listStatus and
     FileSystem#globStatus(...) calls and is the time as reported by
     the system clock of the system that made the modification.


1.2.8 - 2014-08-07

  1. Changed the manner in which the GCS connector jar is built to A) reduce
     included dependencies to only those parts which are used and B) repackaged
     dependencies whose versions conflict with those bundled with Hadoop.
  2. Deprecated fs.gs.system.bucket config.


1.2.7 - 2014-06-23

  1. Fixed a bug where certain globs incorrectly reported the parent directory
     being not found (and thus erroring out) in Hadoop 2.2.0 due to an
     interaction with the fs.gs.glob.flatlist.enable feature; doesn't affect
     Hadoop 1.2.1 or 2.4.0.


1.2.6 - 2014-06-05

  1. When running in hadoop 0.23+ (hadoop 2+), listStatus will now throw a
     FileNotFoundException when a non-existent path is passed in.
  2. The GCS connector now uses the v1 JSON API when accessing Google
     Cloud Storage.
  3. The GoogleHadoopFileSystem now treats the parent of the root path as if
     it is the root path. This behavior mimics the POSIX behavior of "/.."
     being the same as "/".
  4. When creating a new file, a zero-length marker file will be created
     before the FSDataOutputStream is returned in create(). This allows for
     early detection of overwrite conflicts that may occur and prevents
     certain race conditions that could be encountered when relying on
     a single exists() check.
  5. The dependencies on cglib and asm were removed from the GCS connector
     and the classes for these are no longer included in the JAR.


1.2.5 - 2014-05-08

  1. Fixed a bug where fs.gs.auth.client.file was unconditionally being
     overwritten by a default value.
  2. Enabled direct upload for directory creation to save one round-trip call.
  3. Added wiring for GoogleHadoopFileSystem.close() to call through to close()
     its underlying helper classes as well.
  4. Added a new batch mode for creating directories in parallel which requires
     manually parallelizing in the client. Speeds up nested directory creation
     and repairing large numbers of implicit directories in listStatus.
  5. Eliminated redundant API calls in listStatus, speeding it up by ~half.
  6. Fixed a bug where globStatus didn't correctly handle globs containing '?'.
  7. Implemented new version of globStatus which initially performs a flat
     listing before performing the recursive glob logic in-memory to
     dramatically speed up globs with lots of directories; the new behavior is
     default, but can disabled by setting fs.gs.glob.flatlist.enable = false.


1.2.4 - 2014-04-09

  1. The value of fs.gs.io.buffersize.write is now rounded up to 8MB if set to
     a lower value, otherwise the backend will error out on unaligned chunks.
  2. Misc refactoring to enable reuse of the resumable upload classes in other
     libraries.


1.2.3 - 2014-03-21

  1. Fixed a bug where renaming a directory could cause the file contents to get
     shuffled between files when the fully-qualified file paths have different
     lengths. Does not apply to renames on files directly, such as when using
     a glob expression inside a flat directory.
  2. Changed the behavior of batch request API calls such that they are retried
     on failure in the same manner as non-batch requests.
  3. Eliminated an unnecessary dependency on com/google/protobuf which could
     cause version-incompatibility issues with Apache Shark.


1.2.2 - 2014-02-12

  1. Fixed a bug where filenames with '+' were unreadable due to premature
     URL-decoding.
  2. Modified a check to allow fs.gs.io.buffersize.write to be a non-multiple
     of 8MB, just printing out a warning instead of check-failing.
  3. Added some debug-level logging of exceptions before throwing in cases
     where Hadoop tends to swallows the exception along with its useful info.


1.2.1 - 2014-01-23

  1. Added CHANGES.txt for release notes.
  2. Fixed a bug where accidental URI decoding make it impossible to use
     pre-escaped filenames, e.g. foo%3Abar. This is necessary for Pig to work.
  3. Fixed a bug where an IOException was thrown when trying to read a
     zero-byte file. Necessary for Pig to work.


1.2.0 - 2014-01-14

  1. Preview release of GCS connector.