Releases: GoogleCloudDataproc/hadoop-connectors
2021-11-16 (GCS 2.2.4)
-
Support GCS fine-grained action in AuthorizationHandlers.
-
Upgrade Google Auth library to support ExternalAccount.
-
Decrease log-level for hflush rate-limit warning log message.
-
Update google dependencies to LTS versions
2021-10-19 (GCS 2.2.3)
-
Update all dependencies to latest versions.
-
Add support for downscoped tokens in
AccessTokenProvider
. -
Restore compatibility with pre-2.8 Hadoop versions.
-
Migrate gRPC channels to GCS v2 APIs.
-
Add zero-copy deserializer for gRPC reads, toggled with option:
fs.gs.grpc.read.zerocopy.enable (default : true)
2021-06-29 (GCS 2.2.2)
Changelog
Cloud Storage connector:
-
Support footer prefetch in gRPC read channel.
-
Fix in-place seek functionality in gRPC read channel.
-
Add option to buffer requests for resumable upload over gRPC:
fs.gs.grpc.write.buffered.requests (default : 20)
2021-05-28 (GCS 2.2.1)
Changelog
Cloud Storage connector:
-
Fix proxy configuration for Apache HTTP transport.
-
Update gRPC dependency to latest version.
2021-01-07 (GCS 2.2.0, BQ 1.2.0)
Changelog
Cloud Storage connector:
-
Delete deprecated methods.
-
Update all dependencies to latest versions.
-
Add support for Cloud Storage objects CSEK encryption:
fs.gs.encryption.algorithm (not set by default) fs.gs.encryption.key (not set by default) fs.gs.encryption.key.hash (not set by default)
-
Add a property to override storage service path:
fs.gs.storage.service.path (default: `storage/v1/`)
-
Added a new output stream type which can be used by setting:
fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
The
FLUSHABLE_COMPOSITE
output stream type behaves similarly to theSYNCABLE_COMPOSITE
type, except it also supportshflush()
, which uses the same implementation withhsync()
in theSYNCABLE_COMPOSITE
output stream type. -
Added a new output stream parameter
fs.gs.outputstream.sync.min.interval.ms (default: 0)
to configure the minimum time interval (milliseconds) between consecutive syncs. This is to avoid getting rate limited by GCS. Default is
0
- no wait between syncs.hsync()
when rate limited will block on waiting for the permits, buthflush()
will simply perform nothing and return. -
Added a new parameter to configure output stream pipe type:
fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
Valid values are
NIO_CHANNEL_PIPE
andIO_STREAM_PIPE
.Output stream now supports (when property value set to
NIO_CHANNEL_PIPE
) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.Note that when using
NIO_CHANNEL_PIPE
option maximum upload throughput can decrease by 10%. -
Add a property to impersonate a service account:
fs.gs.auth.impersonation.service.account (not set by default)
If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (
roles/iam.serviceAccountTokenCreator
) on the service account to impersonate. -
Throw
ClosedChannelException
inGoogleHadoopOutputStream.write
methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage. -
Add properties to impersonate a service account through user or group name:
fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default) fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
If any of these properties are set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (
roles/iam.serviceAccountTokenCreator
) on the service account to impersonate. -
Fix complex patterns globbing.
-
Added support for an authorization handler for Cloud Storage requests. This feature is configurable through the properties:
fs.gs.authorization.handler.impl=<FULLY_QUALIFIED_AUTHORIZATION_HANDLER_CLASS> fs.gs.authorization.handler.properties.<AUTHORIZATION_HANDLER_PROPERTY>=<VALUE>
If the
fs.gs.authorization.handler.impl
property is set, the specified authorization handler will be used to authorize Cloud Storage API requests before executing them. The handler will throwAccessDeniedException
for rejected requests if user does not have enough permissions (not authorized) to execute these requests.All properties with the
fs.gs.authorization.handler.properties.
prefix passed to an instance of the configured authorization handler class after instantiation before calling any Cloud Storage requests handling methods. -
Set default value for
fs.gs.status.parallel.enable
property totrue
. -
Tune exponential backoff configuration for Cloud Storage requests.
-
Increment Hadoop
FileSystem.Statistics
counters for read and write operations. -
Always infer implicit directories and remove
fs.gs.implicit.dir.infer.enable
property. -
Replace 2 glob-related properties (
fs.gs.glob.flatlist.enable
and fs.gs.glob.concurrent.enable`) with a single property to configure glob search algorithm:fs.gs.glob.algorithm (default: CONCURRENT)
-
Do not create the parent directory objects (this includes buckets) when creating a new file or a directory, instead rely on the implicit directory inference.
-
Use default logging backend for Google Flogger instead of Slf4j.
-
Add
FsBenchmark
tool for benchmarking HCFS. -
Remove obsolete
fs.gs.inputstream.buffer.size
property and related functionality. -
Fix unauthenticated access support (
fs.gs.auth.null.enable=true
). -
Improve cache hit ratio when
fs.gs.performance.cache.enable
property is set totrue
. -
Remove obsolete configuration properties and related functionality:
fs.gs.auth.client.id fs.gs.auth.client.file fs.gs.auth.client.secret
-
Add a property that allows to disable HCFS semantic enforcement. If set to
false
GSC connector will not check if directory with same name already exists when creating a new file and vise versa.fs.gs.create.items.conflict.check.enable (default: true)
-
Remove redundant properties:
fs.gs.config.override.file fs.gs.copy.batch.threads fs.gs.copy.max.requests.per.batch
-
Change default value of
fs.gs.inputstream.min.range.request.size
property from524288
to2097152
.
Big Query connector:
-
Update all dependencies to latest versions.
-
Fix BigQuery job status retrieval in non-US locations.
-
Use default logging backend for Google Flogger instead of Slf4j.
-
Remove unused
mapred.bq.output.buffer.size
configuration property. -
Fix unauthenticated access support (
mapred.bq.auth.null.enable=true
). -
Remove obsolete configuration properties and related functionality:
mapred.bq.auth.client.id mapred.bq.auth.client.file mapred.bq.auth.client.secret
2020-11-09 (GCS 2.1.6, BQ 1.1.6)
Changelog
Cloud Storage connector:
-
Increment Hadoop
FileSystem.Statistics
counters for read and write operations. -
Add
FsBenchmark
tool for benchmarking HCFS. -
Update all dependencies to latest versions.
Big Query connector:
-
Fix reads using
DirectBigQueryInputFormat
. -
Update all dependencies to latest versions.
2020-09-11 (GCS 2.1.5, BQ 1.1.5)
Changelog
Cloud Storage connector:
-
Fix complex patterns globbing.
-
Tune exponential backoff configuration for Cloud Storage requests.
-
Add a property to ignore Cloud Storage precondition failures when overwriting objects in concurrent environment:
fs.gs.overwrite.generation.mismatch.ignore (default: false)
-
Update all dependencies to latest versions.
Big Query connector:
-
Fix BigQuery job status retrieval in non-US locations.
-
Update all dependencies to latest versions.
2020-08-07 (GCS 1.9.18, BQ 0.13.18)
Changelog
Cloud Storage connector:
-
Fix complex patterns globbing.
-
Throw
ClosedChannelException
inGoogleHadoopOutputStream.write
methods
if stream already closed. This fixes Spark Streaming jobs checkpointing to
Cloud Storage. -
Fix proxy authentication when using
JAVA_NET
transport.
Big Query connector:
-
POM updates for GCS connector 1.9.18.
-
Fix proxy authentication when using
JAVA_NET
transport.
2020-07-15 (GCS 2.1.4, BQ 1.1.4)
Changelog
Cloud Storage connector:
-
Added a new parameter to configure output stream pipe type:
fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
Valid values are
NIO_CHANNEL_PIPE
andIO_STREAM_PIPE
.Output stream now supports (when property value set to
NIO_CHANNEL_PIPE
) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.Note that when using
NIO_CHANNEL_PIPE
option maximum upload throughput can decrease by 10%. -
Throw
ClosedChannelException
inGoogleHadoopOutputStream.write
methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage. -
Add a property to impersonate a service account:
fs.gs.auth.impersonation.service.account (not set by default)
If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (
roles/iam.serviceAccountTokenCreator
) on the service account to impersonate. -
Add properties to impersonate a service account through user or group name:
fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default) fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
If any of these properties is set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (
roles/iam.serviceAccountTokenCreator
) on the service account to impersonate. -
Update all dependencies to latest versions.
Big Query connector:
- Update all dependencies to latest versions.
2020-05-08 (GCS 2.1.3, BQ 1.1.3)
Changelog
Cloud Storage connector:
-
Add support for Cloud Storage objects CSEK encryption:
fs.gs.encryption.algorithm (not set by default) fs.gs.encryption.key (not set by default) fs.gs.encryption.key.hash (not set by default)
-
Update all dependencies to latest versions.
-
Added a new output stream type which can be used by setting:
fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
The
FLUSHABLE_COMPOSITE
output stream type behaves similarly to the
SYNCABLE_COMPOSITE
type, except it also supportshflush()
, which uses
the same implementation withhsync()
in theSYNCABLE_COMPOSITE
output
stream type. -
Added a new output stream parameter
fs.gs.outputstream.sync.min.interval.ms (default: 0)
to configure the minimum time interval (milliseconds) between consecutive
syncs. This is to avoid getting rate limited by GCS. Default is0
- no
wait between syncs.hsync()
when rate limited will block on waiting for
the permits, buthflush()
will simply perform nothing and return. -
Restore compatibility with pre-2.8 Hadoop versions.
Big Query connector:
- Update all dependencies to latest versions.