GCSHook only perform list if delimiter specified #36130

atrbgithub · 2023-12-08T19:08:34Z

This fixes the issue raised here against #34919

When testing locally the list only appears to be required when a delimiter is passed in.

A sample test can be performed with:

from airflow.providers.google.cloud.hooks.gcs import GCSHook
res = GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/'
)
print(res)

With this fix, the code no longer fails with:

ValueError: ('Iterator has already started', <google.api_core.page_iterator.HTTPIterator object at 0x7efe067c3880>)

To confirm that this block:

                if delimiter:
                    list(blobs)

Is still required, when this is removed, and this is ran:

res = GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/',
        delimiter='.csv'
)
print(res)

No results are returned. Everything works as expected however once the block is in place.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

atrbgithub · 2023-12-08T19:15:05Z

@eladkal @shahar1 Would you be able to take a look at this if you get a chance?

Without this, #34919 should not be released (#36117)

shahar1 · 2023-12-08T19:31:26Z

Good catch!
If it's not too hard, could you try adding tests for both conditions? (e.g., by patching builtins.list function)
I wouldn't be too strict about it - but if it's possible, it would be quite nice to have (for example, if we make any changes to the delimiter param in the future).

eladkal

I agree with @shahar1 we need tests to avoid regression.

Can you please add unit tests?

hussein-awala · 2023-12-09T15:22:42Z

airflow/providers/google/cloud/hooks/gcs.py

-                list(blobs)
+                if delimiter:
+                    list(blobs)


IMHO, there is a problem with the original PR (#34919) that added this statement, where we create the generators to avoid loading all the data in the memory, but by applying list on the blobs generator, it looks like you are forcing the generator to loop over all the pages and load them in memory.

Instead, the solution should be based on a loop on the generator elements similar to what we do here:

airflow/airflow/providers/google/cloud/hooks/gcs.py

Lines 941 to 945 in 5d74ffb

ids.extend(

blob.name

for blob in blobs

if timespan_start <= blob.updated.replace(tzinfo=timezone.utc) < timespan_end

)

When a delimiter is used, the iterator must be started so that blobs.prefixes is populated, which is how the blob names are then captured.

Whereas, when match_glob is used, blobs.prefixes is never populated and instead you must iterate over blobs and access each blob.name.

I'm not sure why this is the case. The original version of the code started the iterator by using a for loop to iterate over blobs no matter whether a delimiter had been passed in or not.

airflow/airflow/providers/google/cloud/hooks/gcs.py

Lines 828 to 838 in 9079093

blob_names = []

for blob in blobs:

blob_names.append(blob.name)

prefixes = blobs.prefixes

if prefixes:

ids += list(prefixes)

else:

ids += blob_names

When delimiter is used, if you modify the code to be the following:

for blob in blobs: print("here") blob_names.append(blob.name)

Nothing is printed. Simply by starting the iterator blobs.prefixes is populated.

The code has been modified so rather than using a list(, it will use next(. I don't think it will make any difference since blobs appears to be an empty list.

The code has been tested against a bucket with over 100k files, the same number of files is returned when either delimiter or match_glob is passed in.

You also make an interesting point with referencing the code in list_by_timespan, I don't believe that this method would have ever filtered the files based on timespan when delimiter has been passed in:

airflow/airflow/providers/google/cloud/hooks/gcs.py

Lines 938 to 945 in 5d74ffb

if blobs.prefixes:

ids.extend(blobs.prefixes)

else:

ids.extend(

blob.name

for blob in blobs

if timespan_start <= blob.updated.replace(tzinfo=timezone.utc) < timespan_end

)

It would simply add all the prefixes without doing any filtering.

atrbgithub · 2023-12-11T08:49:24Z

@hussein-awala I agree I think that would be better.

I found an issue where this would fail if a prefix is passed in which does not exist, it would throw the Iterator has already started exception. To fix that, and iterate over the generator rather than listing all objects into memory, how does something like this look?

            else:
                blobs = bucket.list_blobs(
                    max_results=max_results,
                    page_token=page_token,
                    prefix=prefix,
                    delimiter=delimiter,
                    versions=versions,
                )
                if delimiter:
                    blobs_iter = next(blobs, None)

            if blobs.prefixes:
                ids.extend(blobs.prefixes)
            elif not delimiter:
                ids.extend(blob.name for blob in blobs)

            page_token = blobs.next_page_token
            if page_token is None:
                # empty next page token
                break
        return ids

I don't understand why the google storage client is behaving differently when a delimiter is used, and when results exist. The issue seems to be stemming from that.

The above snippet however is able to list objects when a delimiter is passed in, handles prefixes with no objects and avoids using a list. The correct number of objects are also returned.

Using the above however, it doesn't seem to page. Lets say we have:

GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/',
	delimiter='.csv',
        max_results=10
)

When running this on a bucket with 1000s of files, everything is captured with ids.extend(blobs.prefixes)

However, if called with:

GCSHook().list(
        bucket_name='a-testbucket',
        prefix='a/prefix/in/the/bucket/',
	match_glob='**/*/*.csv',
        max_results=10
)

page_token is populated and we loop through the while True: loop here

I'm not sure if this is expected behaviour of the gcs client. It's possible that this behaviour may change in the future and I'm not sure how the above code would handle that.

atrbgithub · 2023-12-11T12:42:07Z

The old existing code before this file was refactored seems to work fine for match_glob and delimiter, perhaps we just revert to that?

airflow/airflow/providers/google/cloud/hooks/gcs.py

Lines 828 to 838 in 9079093

    
           blob_names = [] 
        
           for blob in blobs: 
        
               blob_names.append(blob.name) 
        
           prefixes = blobs.prefixes 
        
           if prefixes: 
        
               ids += list(prefixes) 
        
           else: 
        
               ids += blob_names

When delimiter is passed into list_blobs, prefixes is not populated, loading is forced with next(blobs, None). Do not attempt to iterate again over blobs if blobs.prefixes is empty and delimiter has been passed in. This prevents the "iterator has already started" error message.

atrbgithub · 2023-12-13T12:01:25Z

I think we should close this in favour of #36202

boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Dec 8, 2023

This was referenced Dec 8, 2023

Fix gcs listing - ensure blobs are loaded #34919

Merged

Status of testing Providers that were prepared on December 08, 2023 #36117

Closed

eladkal changed the title ~~Only perform list if delimiter specified~~ GCSHook only perform list if delimiter specified Dec 8, 2023

eladkal requested changes Dec 9, 2023

View reviewed changes

hussein-awala reviewed Dec 9, 2023

View reviewed changes

atrbgithub force-pushed the fix-listing-no-delimiter branch 2 times, most recently from 70352af to 2f9d483 Compare December 11, 2023 19:00

atrbgithub force-pushed the fix-listing-no-delimiter branch from 2f9d483 to 0728cfa Compare December 12, 2023 14:17

Lee-W mentioned this pull request Dec 13, 2023

iterate through blobs before checking prefixes #36202

Merged

atrbgithub closed this Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCSHook only perform list if delimiter specified #36130

GCSHook only perform list if delimiter specified #36130

atrbgithub commented Dec 8, 2023

atrbgithub commented Dec 8, 2023

shahar1 commented Dec 8, 2023

eladkal left a comment

hussein-awala Dec 9, 2023

atrbgithub Dec 13, 2023

atrbgithub commented Dec 11, 2023 •

edited

Loading

atrbgithub commented Dec 11, 2023 •

edited

Loading

atrbgithub commented Dec 13, 2023

	ids.extend(
	blob.name
	for blob in blobs
	if timespan_start <= blob.updated.replace(tzinfo=timezone.utc) < timespan_end
	)


	blob_names = []
	for blob in blobs:
	blob_names.append(blob.name)

	prefixes = blobs.prefixes
	if prefixes:
	ids += list(prefixes)
	else:
	ids += blob_names

	if blobs.prefixes:
	ids.extend(blobs.prefixes)
	else:
	ids.extend(
	blob.name
	for blob in blobs
	if timespan_start <= blob.updated.replace(tzinfo=timezone.utc) < timespan_end
	)

GCSHook only perform list if delimiter specified #36130

GCSHook only perform list if delimiter specified #36130

Conversation

atrbgithub commented Dec 8, 2023

atrbgithub commented Dec 8, 2023

shahar1 commented Dec 8, 2023

eladkal left a comment

Choose a reason for hiding this comment

hussein-awala Dec 9, 2023

Choose a reason for hiding this comment

atrbgithub Dec 13, 2023

Choose a reason for hiding this comment

atrbgithub commented Dec 11, 2023 • edited Loading

atrbgithub commented Dec 11, 2023 • edited Loading

atrbgithub commented Dec 13, 2023

atrbgithub commented Dec 11, 2023 •

edited

Loading

atrbgithub commented Dec 11, 2023 •

edited

Loading