-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Failed to run BQ task when cache is enabled #2864
Comments
I'm not sure I have enough context on StructuredDatasets to completely understand this issue. It sounds like this type supports many different formats (ie. parquet, csv, etc), but the column definitions are not known at compile-time. So currently, we type check only the format. The issue is that sometimes the format can be defined as empty (ie. "")? It seems like with how opinionated Flyte is with statically typing data allowing an empty format to satisfy everything is only asking for problems downstream. Where does the default parquet format come from? Where exactly does the empty format come from? |
Type validation happens at compile(register) time and runtime. @task(cache=True, cache_version="2.0")
def t1() -> StructuredDataset:
...
return StructuredDataset(df=df, uri="bq://...")
@task(cache=True, cache_version="2.0")
def t2(sd: StructuredDataset) -> Annotated[StructuredDataset, "csv"]
return return StructuredDataset(df=df)
@workflow
def wf():
t2(sd: t1())
|
OK, last two questions:
In this case it will obviously break in local execution. When executed in a cluster, how does this break? Will FlyteKit try to read / write the data and fail? If we just remove the runtime check on StructuredDataset format will this fail the same way? |
Quick update - the following example:
fails with:
I'm still wondering if it makes sense to remove the runtime check on the data type during the cache lookup and just let flytekit fail if there's an issue?!?! |
There are still some benefits to do a runtime check on the data type during the cache. For example, If the error happens on the propeller side, then we don't need to spend additional time and resources to run the task.
I think you're probably right. @task(cache=True, cache_version="2.0")
def t1() -> Annotated[StructuredDataset, "csv"]:
...
return StructuredDataset(df=df) # here we already have "format" and "URI" in StructuredDatasetMetadata
@task(cache=True, cache_version="2.0")
def t2(sd: StructuredDataset):
...
@workflow
def wf():
t2(sd: t1()) In the above example, it fails at compile time because the format in the SD doesn't match. As a result, I think we can remove this line. just remove the check on the format. |
any update on the fix for this issue ? |
Describe the bug
Slack Thread
Failed to run BQ task when the cache is enabled because type validation is failing.
When the cache is enabled, we'll retrieve artifacts from datacatalog and check if the structured dataset's schema and format match the expected type.
However, the default format of the structured dataset in the expected type is always
Parquet
, but the format of the output structured dataset is""
.Two ways to fix it.
structuredDatasetType
is input typet.literalType.GetStructuredDatasetType()
is expected typeExpected behavior
BQ task should run successfully even if the cache is enabled
Additional context to reproduce
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: