-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT]: Throw error for invalid ** usage outside folder segments (e.g. /tmp/**.csv) #3100
Conversation
Some things I'm not too sure about:
|
CodSpeed Performance ReportMerging #3100 will not alter performanceComparing Summary
|
8f0db00
to
e46264a
Compare
05113af
to
3d0a066
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Regarding tests, we don't actually need to weave this into our integration tests for every single object source. Totally fine to have verify_glob
just unit-tested rigorously in Rust in the object_store_glob.rs
module , and then have one (non-integration) test that hits the local disk to verify the user experience from Python.
src/daft-io/src/object_store_glob.rs
Outdated
if re.is_match(segment) && segment != "**" { | ||
return Err(super::Error::InvalidArgument { | ||
msg: format!( | ||
"Invalid usage of '**' in glob pattern. The '**' wildcard must occupy an entire path segment and be surrounded by '{}' characters. Found invalid usage in '{}'.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we be more helpful with the error message as well? Would love to add a suggestion here for the user to do this glob path instead: {re_group_1}/**/*{re_group_2}/{re_group_3}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've slightly rewritten the regex to process the path fully instead of segmenting it delimited portions.
This should allow us to suggest a corrected glob path for the user as well.
Have rewritten it to give suggestions in this manner: is this the behaviour you expect?
- Original:
invalid/blahblah**.txt
→ Corrected:invalid/blahblah/**/*.txt
- Original:
invalid/\***.txt
→ Corrected:invalid/\*/**/*.txt
- Original:
invalid/\**blahblah**.txt
→ Corrected:invalid/\**blahblah/**/*.txt
src/daft-io/src/object_store_glob.rs
Outdated
@@ -404,6 +404,29 @@ pub async fn glob( | |||
}; | |||
let glob = glob.as_str(); | |||
|
|||
// We need to do some validation on the glob pattern before compiling it, since the globset crate is very permissive | |||
// and will happily compile patterns that don't make sense without throwing an error. | |||
fn verify_glob(glob: &str) -> super::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can move this out of the function, and just run Rust unit-tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes a lot of sense.
Wasn't familiar with how Rust unit tests worked, have moved the testing logic for the verify_glob
function to object_store_glob.rs
.
Hey @jaychia, thanks for the comments. I've made some changes accordingly:
|
Hey @jaychia, sorry to ping! Do these changes look good to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Awesome, thanks for the contribution! 🚀 🚀 🚀 |
…. /tmp/**.csv) (Eventual-Inc#3100) Closes Eventual-Inc#1820. Main issue seems to be that the `globset` crate is permissive for what kind of pattern it builds (no error is thrown when we try to build a pattern for `/tmp/**.csv`, for instance, so we have to check ourselves for any such patterns.
Closes #1820.
Main issue seems to be that the
globset
crate is permissive for what kind of pattern it builds (no error is thrown when we try to build a pattern for/tmp/**.csv
, for instance, so we have to check ourselves for any such patterns.