-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Orca read file path #2525
Fix Orca read file path #2525
Conversation
hkvision
commented
Jul 2, 2020
•
edited
Loading
edited
- Fix Orca read file sometimew would report path not exist if don't specify file:/// for local file analytics-zoo#750 do not use Scala HadoopFS API to list local file paths.
- Do not filter files by its suffix as neither Spark or Pandas does this. We just pass all the files to Spark or Pandas and let them handle or raise an error. For pandas during read time error would be raised if the file format is wrong. For spark seems it can read anything, but when you do transformation on the resulting DataFrame, the error would be raised if the file format is not as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pyzoo/zoo/orca/data/utils.py
Outdated
file_paths = [abspath(join(file_path, file)) for file in listdir(file_path)] | ||
# Only get json/csv files. | ||
file_paths = [file for file in file_paths | ||
if os.path.isfile(file) and os.path.splitext(file)[1] == "." + file_type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not work on mastercard case. The input is one *.dat file, so my original logic is not to check suffix if path is a file. For directory, I filter suffix to remove unrelated files. Actually pandas does not check file suffix. We may change logic to not filter but skip the pandas read result if pandas return null or throw exception. But this may cause some partition empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But your previous logic is wrong, you are only focusing on the mastercard's single case but not general. What about users have a directory of .dat files? Then you still filter all of them and get nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the function name is read_csv, it is reasonable that all the input files should be .csv. We can add read_dat if necessary. What do you think? @jason-dai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know your concern. My original design is assuming the file suffix is csv. But it cannot apply to mastercard case, so I make this workaround. Pandas has no constraint for file name. Maybe let us check what spark do. I think read_dat is not necessary, as data file can store different formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the format of the .dat file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would throw exception. Do we remove the filter for filenames and throw exception if pandas fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I checked mastercard code, it uses the ml-1m dataset which are .dat
files.
Seems pd.read_csv can only support reading one file; if I input a directory it directly raises an exception. If it can't parse the file, it also raises an exception.
But we are supporting multiple files, just throwing an exception if one file fails may not be reasonable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we can do something like ImageProcessing, provide a flag to ignore reading error. If user choose to ignore, we may only give warning message. If not choose ignore, throw exception. There may be possible that user won't like to continue if the some data is incorrect. So give the option may satisfy different requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as you mentioned before, how to handle possible empty partitions? We still need to add a remove empty method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the functionality, I think our code may work even there are empty partitions. I checked SparkXShards operations, only partition_by() and to_ray() use rdd.mapPartitions(). Others use rdd.map(). rdd.map() can work even there are empty partitions. In partition_by() operation, it first calls rdd.partitionBy(), which rearranges the elements almost evenly to partitions so the empty partitions would be removed, then calls rdd.mapPartitions(). For to_ray() operation, I think we may often call xshards.repartition() before to_ray(), repartition would re-dispatch elements too, so to_ray() may not store empty partition data to plasma. For the performance, there may have some influence. We may do the repartition after read, then do other operations.
I tested that |
@jenniew Take a look. Removed the logic of filtering files. Let pandas or spark to handle and report the error. As spark.read.csv/json already accepts multiple files or a folder, the file extraction logic only applies for pandas backend. |
if not file_paths: | ||
raise Exception("The file path is invalid/empty or does not include csv/json files") | ||
if not file_paths: | ||
raise Exception("The file path is invalid/empty or does not include csv/json files") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise Exception("The file path is invalid/empty). As we don't filter csv/json files, we cannot tell if the folder don't include csv/json files.
@@ -40,7 +40,7 @@ def test_read_local_csv(self): | |||
file_path = os.path.join(self.resource_path, "abc") | |||
with self.assertRaises(Exception) as context: | |||
xshards = zoo.orca.data.pandas.read_csv(file_path) | |||
self.assertTrue('The file path is invalid/empty' in str(context.exception)) | |||
self.assertTrue('No such file or directory' in str(context.exception)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add negative test of file/folder is/contains invalid csv/json file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Take a look.
LGTM. |
jenkins passed: http://10.239.47.210:18888/job/ZOO-PR-Validation/3786/ |
Thanks for your review @qiyuangong @jenniew |
* resolve conflict * update * fix * meet review
* resolve conflict * update * fix * meet review
* resolve conflict * update * fix * meet review
* resolve conflict * update * fix * meet review
* resolve conflict * update * fix * meet review
* resolve conflict * update * fix * meet review