Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCSHook only perform list if delimiter specified #36130

Closed
wants to merge 1 commit into from

Conversation

atrbgithub
Copy link
Contributor

This fixes the issue raised here against #34919

When testing locally the list only appears to be required when a delimiter is passed in.

A sample test can be performed with:

from airflow.providers.google.cloud.hooks.gcs import GCSHook
res = GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/'
)
print(res)

With this fix, the code no longer fails with:

ValueError: ('Iterator has already started', <google.api_core.page_iterator.HTTPIterator object at 0x7efe067c3880>)

To confirm that this block:

                if delimiter:
                    list(blobs)

Is still required, when this is removed, and this is ran:

res = GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/',
        delimiter='.csv'
)
print(res)

No results are returned. Everything works as expected however once the block is in place.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@atrbgithub
Copy link
Contributor Author

@eladkal @shahar1 Would you be able to take a look at this if you get a chance?

Without this, #34919 should not be released (#36117)

@shahar1
Copy link
Contributor

shahar1 commented Dec 8, 2023

Good catch!
If it's not too hard, could you try adding tests for both conditions? (e.g., by patching builtins.list function)
I wouldn't be too strict about it - but if it's possible, it would be quite nice to have (for example, if we make any changes to the delimiter param in the future).

@eladkal eladkal changed the title Only perform list if delimiter specified GCSHook only perform list if delimiter specified Dec 8, 2023
Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @shahar1 we need tests to avoid regression.

Can you please add unit tests?

Comment on lines 824 to 825
list(blobs)
if delimiter:
list(blobs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, there is a problem with the original PR (#34919) that added this statement, where we create the generators to avoid loading all the data in the memory, but by applying list on the blobs generator, it looks like you are forcing the generator to loop over all the pages and load them in memory.

Instead, the solution should be based on a loop on the generator elements similar to what we do here:

ids.extend(
blob.name
for blob in blobs
if timespan_start <= blob.updated.replace(tzinfo=timezone.utc) < timespan_end
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a delimiter is used, the iterator must be started so that blobs.prefixes is populated, which is how the blob names are then captured.

Whereas, when match_glob is used, blobs.prefixes is never populated and instead you must iterate over blobs and access each blob.name.

I'm not sure why this is the case. The original version of the code started the iterator by using a for loop to iterate over blobs no matter whether a delimiter had been passed in or not.

blob_names = []
for blob in blobs:
blob_names.append(blob.name)
prefixes = blobs.prefixes
if prefixes:
ids += list(prefixes)
else:
ids += blob_names

When delimiter is used, if you modify the code to be the following:

for blob in blobs:
     print("here")
     blob_names.append(blob.name) 

Nothing is printed. Simply by starting the iterator blobs.prefixes is populated.

The code has been modified so rather than using a list(, it will use next(. I don't think it will make any difference since blobs appears to be an empty list.

The code has been tested against a bucket with over 100k files, the same number of files is returned when either delimiter or match_glob is passed in.

You also make an interesting point with referencing the code in list_by_timespan, I don't believe that this method would have ever filtered the files based on timespan when delimiter has been passed in:

if blobs.prefixes:
ids.extend(blobs.prefixes)
else:
ids.extend(
blob.name
for blob in blobs
if timespan_start <= blob.updated.replace(tzinfo=timezone.utc) < timespan_end
)

It would simply add all the prefixes without doing any filtering.

@atrbgithub
Copy link
Contributor Author

atrbgithub commented Dec 11, 2023

@hussein-awala I agree I think that would be better.

I found an issue where this would fail if a prefix is passed in which does not exist, it would throw the Iterator has already started exception. To fix that, and iterate over the generator rather than listing all objects into memory, how does something like this look?

            else:
                blobs = bucket.list_blobs(
                    max_results=max_results,
                    page_token=page_token,
                    prefix=prefix,
                    delimiter=delimiter,
                    versions=versions,
                )
                if delimiter:
                    blobs_iter = next(blobs, None)

            if blobs.prefixes:
                ids.extend(blobs.prefixes)
            elif not delimiter:
                ids.extend(blob.name for blob in blobs)

            page_token = blobs.next_page_token
            if page_token is None:
                # empty next page token
                break
        return ids

I don't understand why the google storage client is behaving differently when a delimiter is used, and when results exist. The issue seems to be stemming from that.

The above snippet however is able to list objects when a delimiter is passed in, handles prefixes with no objects and avoids using a list. The correct number of objects are also returned.

Using the above however, it doesn't seem to page. Lets say we have:

GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/',
	delimiter='.csv',
        max_results=10
)

When running this on a bucket with 1000s of files, everything is captured with ids.extend(blobs.prefixes)

However, if called with:

GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/',
	match_glob='**/*/*.csv',
        max_results=10
)

page_token is populated and we loop through the while True: loop here

I'm not sure if this is expected behaviour of the gcs client. It's possible that this behaviour may change in the future and I'm not sure how the above code would handle that.

@atrbgithub
Copy link
Contributor Author

atrbgithub commented Dec 11, 2023

The old existing code before this file was refactored seems to work fine for match_glob and delimiter, perhaps we just revert to that?

blob_names = []
for blob in blobs:
blob_names.append(blob.name)
prefixes = blobs.prefixes
if prefixes:
ids += list(prefixes)
else:
ids += blob_names

@atrbgithub atrbgithub force-pushed the fix-listing-no-delimiter branch 2 times, most recently from 70352af to 2f9d483 Compare December 11, 2023 19:00
When delimiter is passed into list_blobs, prefixes is not populated, loading is forced with next(blobs, None).

Do not attempt to iterate again over blobs if blobs.prefixes is empty and delimiter has been passed in.

This prevents the "iterator has already started" error message.
@atrbgithub
Copy link
Contributor Author

I think we should close this in favour of #36202

@atrbgithub atrbgithub closed this Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants