Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MSQ Durable Storage Connector for Google Cloud Storage and change current Google Cloud Storage client library #15398

Merged
merged 30 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
f40e285
Initial commit with the new GCS library
gargvishesh Nov 16, 2023
b4e719a
Changes for GCS Storage Connector
gargvishesh Nov 17, 2023
09fef50
Remove exception
gargvishesh Nov 17, 2023
10bad2f
Add prefix to paths
gargvishesh Nov 20, 2023
9211a69
Fixes
gargvishesh Nov 20, 2023
9a6ab03
Working version with fixes for removing try-with-resource and objectP…
gargvishesh Nov 21, 2023
546b1ef
Minor changes
gargvishesh Nov 21, 2023
26be863
Use full path for deleteRecursively and fix tests
gargvishesh Nov 21, 2023
937056b
Refactoring and checkstyle fixes
gargvishesh Nov 21, 2023
a166099
Doc updates
gargvishesh Nov 21, 2023
a84a368
Remove GCS BOM in dependency management
gargvishesh Nov 21, 2023
1e8b38f
Remove comments
gargvishesh Nov 21, 2023
8319d4c
Create temp dir in google storage connector contructor
gargvishesh Nov 30, 2023
dad4289
Address review comments
gargvishesh Dec 4, 2023
b0d15be
Limit internal buffer size of GCS storage reader and writer channels
gargvishesh Dec 4, 2023
f9105a9
Update write buffer size to 2 MB
gargvishesh Dec 6, 2023
c05a9c0
Add GoogleStorageConnectorTest
gargvishesh Dec 7, 2023
93a6df3
Merge branch 'master' into 35053-gcs-durable-storage-connector
gargvishesh Dec 7, 2023
55febef
Address review comments
gargvishesh Dec 7, 2023
f09c461
Remove exception handling
gargvishesh Dec 8, 2023
44e9b10
Fix exceptions and handling
gargvishesh Dec 8, 2023
4454afb
Fix exceptions and handling - part 2
gargvishesh Dec 12, 2023
9f30fd2
Fix exceptions and handling - part 3
gargvishesh Dec 12, 2023
fc51307
Fix build errors and pom
gargvishesh Dec 12, 2023
54a9687
Add licenses and tests
gargvishesh Dec 12, 2023
306d71a
fix license.yaml module name
gargvishesh Dec 13, 2023
3bd3125
Add toString to GoogleInputRange
gargvishesh Dec 13, 2023
cb38ccd
Add GoogleStorageTest
gargvishesh Dec 13, 2023
8294a33
Fix forbidden api error
gargvishesh Dec 13, 2023
9e08c84
Fix forbidden api error - part 2
gargvishesh Dec 13, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions distribution/bin/check-licenses.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ def build_compatible_license_names():
compatible_licenses['The BSD 3-Clause License'] = 'BSD-3-Clause License'
compatible_licenses['Revised BSD'] = 'BSD-3-Clause License'
compatible_licenses['New BSD License'] = 'BSD-3-Clause License'
compatible_licenses['BSD New license'] = 'BSD-3-Clause License'
compatible_licenses['3-Clause BSD License'] = 'BSD-3-Clause License'
compatible_licenses['BSD 3-Clause'] = 'BSD-3-Clause License'
compatible_licenses['BSD-3-Clause'] = 'BSD-3-Clause License'
Expand Down
8 changes: 4 additions & 4 deletions docs/multi-stage-query/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -356,24 +356,24 @@ SQL-based ingestion supports using durable storage to store intermediate files t

### Durable storage configurations

Durable storage is supported on Amazon S3 storage and Microsoft's Azure Blob Storage.
Durable storage is supported on Amazon S3 storage, Microsoft's Azure Blob Storage and Google Cloud Storage.
There are common configurations that control the behavior regardless of which storage service you use. Apart from these common configurations, there are a few properties specific to S3 and to Azure.

Common properties to configure the behavior of durable storage

|Parameter | Required | Description | Default |
|--|--|--|
|`druid.msq.intermediate.storage.enable` | Yes | Whether to enable durable storage for the cluster. Set it to true to enable durable storage. For more information about enabling durable storage, see [Durable storage](../operations/durable-storage.md). | false |
|`druid.msq.intermediate.storage.type` | Yes | The type of storage to use. Set it to `s3` for S3 and `azure` for Azure | n/a |
|`druid.msq.intermediate.storage.type` | Yes | The type of storage to use. Set it to `s3` for S3, `azure` for Azure and `google` for Google | n/a |
|`druid.msq.intermediate.storage.tempDir`| Yes | Directory path on the local disk to store temporary files required while uploading and downloading the data | n/a |
|`druid.msq.intermediate.storage.maxRetry` | No | Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | 10 |
|`druid.msq.intermediate.storage.chunkSize` | No | Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.| 100MiB |

To use S3 for durable storage, you also need to configure the following properties:
To use S3 or Google for durable storage, you also need to configure the following properties:

|Parameter | Required | Description | Default |
|-------------------|----------------------------------------|----------------------| --|
|`druid.msq.intermediate.storage.bucket` | Yes | The S3 bucket where the files are uploaded to and download from | n/a |
|`druid.msq.intermediate.storage.bucket` | Yes | The S3 or Google bucket where the files are uploaded to and download from | n/a |
|`druid.msq.intermediate.storage.prefix` | Yes | Path prepended to all the paths uploaded to the bucket to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | n/a |

To use Azure for durable storage, you also need to configure the following properties:
Expand Down
2 changes: 1 addition & 1 deletion docs/operations/durable-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ sidebar_label: "Durable storage"

You can use durable storage to improve querying from deep storage and SQL-based ingestion.

> Note that only S3 is supported as a durable storage location.
> Note that S3, Azure and Google are all supported as durable storage locations.

Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability.

Expand Down
18 changes: 4 additions & 14 deletions extensions-core/google-extensions/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,11 @@
</dependency>

<dependency>
<groupId>com.google.apis</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to update :

<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-storage</artifactId>
<version>${com.google.apis.storage.version}</version>
<exclusions>
<exclusion>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client</artifactId>
</exclusion>
</exclusions>
</dependency>
as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed now.

<artifactId>google-api-services-storage</artifactId>
<version>${com.google.apis.storage.version}</version>
<exclusions>
<exclusion>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client</artifactId>
</exclusion>
</exclusions>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>${com.google.cloud.storage.version}</version>
</dependency>

<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
Expand Down Expand Up @@ -114,11 +109,6 @@
<artifactId>google-http-client</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.google.code.findbugs</groupId>
<artifactId>jsr305</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@
import com.fasterxml.jackson.annotation.JsonCreator;
import com.fasterxml.jackson.annotation.JsonIgnore;
import com.fasterxml.jackson.annotation.JsonProperty;
import com.google.api.services.storage.model.StorageObject;
import com.google.common.collect.Iterators;
import org.apache.druid.data.input.InputEntity;
import org.apache.druid.data.input.InputSplit;
Expand All @@ -37,12 +36,12 @@
import org.apache.druid.storage.google.GoogleInputDataConfig;
import org.apache.druid.storage.google.GoogleStorage;
import org.apache.druid.storage.google.GoogleStorageDruidModule;
import org.apache.druid.storage.google.GoogleStorageObjectMetadata;
import org.apache.druid.storage.google.GoogleUtils;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.io.IOException;
import java.math.BigInteger;
import java.net.URI;
import java.util.Collections;
import java.util.Iterator;
Expand Down Expand Up @@ -139,23 +138,23 @@ public Iterator<LocationWithSize> getDescriptorIteratorForPrefixes(List<URI> pre
@Override
public long getObjectSize(CloudObjectLocation location) throws IOException
{
final StorageObject storageObject = storage.getMetadata(location.getBucket(), location.getPath());
final GoogleStorageObjectMetadata storageObject = storage.getMetadata(location.getBucket(), location.getPath());
return getSize(storageObject);
}
}

return new SplitWidget();
}

private static long getSize(final StorageObject object)
private static long getSize(final GoogleStorageObjectMetadata object)
{
final BigInteger sizeInBigInteger = object.getSize();
final Long sizeInLong = object.getSize();

if (sizeInBigInteger == null) {
if (sizeInLong == null) {
return Long.MAX_VALUE;
} else {
try {
return sizeInBigInteger.longValueExact();
return sizeInLong;
}
catch (ArithmeticException e) {
LOG.warn(
Expand All @@ -164,7 +163,7 @@ private static long getSize(final StorageObject object)
+ "The max long value will be used for its size instead.",
object.getBucket(),
object.getName(),
sizeInBigInteger
sizeInLong
);
return Long.MAX_VALUE;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,12 +51,12 @@ public String getPath()
@Override
public InputStream openStream() throws IOException
{
return storage.get(bucket, path);
return storage.getInputStream(bucket, path);
}

public InputStream openStream(long start) throws IOException
{
return storage.get(bucket, path, start);
return storage.getInputStream(bucket, path, start);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ FileUtils.FileCopyResult getSegmentFiles(final String bucket, final String path,
public InputStream getInputStream(URI uri) throws IOException
{
String path = StringUtils.maybeRemoveLeadingSlash(uri.getPath());
return storage.get(uri.getHost() != null ? uri.getHost() : uri.getAuthority(), path);
return storage.getInputStream(uri.getHost() != null ? uri.getHost() : uri.getAuthority(), path);
}

@Override
Expand Down
Loading
Loading