Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Orca read file path #2525

Merged
merged 4 commits into from
Jul 16, 2020
Merged

Conversation

hkvision
Copy link
Contributor

@hkvision hkvision commented Jul 2, 2020

  • Fix Orca read file sometimew would report path not exist if don't specify file:/// for local file analytics-zoo#750 do not use Scala HadoopFS API to list local file paths.
  • Do not filter files by its suffix as neither Spark or Pandas does this. We just pass all the files to Spark or Pandas and let them handle or raise an error. For pandas during read time error would be raised if the file format is wrong. For spark seems it can read anything, but when you do transformation on the resulting DataFrame, the error would be raised if the file format is not as expected.

@hkvision hkvision requested review from jason-dai and jenniew July 2, 2020 06:57
@hkvision hkvision added the orca label Jul 2, 2020
Copy link
Contributor

@qiyuangong qiyuangong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

file_paths = [abspath(join(file_path, file)) for file in listdir(file_path)]
# Only get json/csv files.
file_paths = [file for file in file_paths
if os.path.isfile(file) and os.path.splitext(file)[1] == "." + file_type]
Copy link
Contributor

@jenniew jenniew Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not work on mastercard case. The input is one *.dat file, so my original logic is not to check suffix if path is a file. For directory, I filter suffix to remove unrelated files. Actually pandas does not check file suffix. We may change logic to not filter but skip the pandas read result if pandas return null or throw exception. But this may cause some partition empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But your previous logic is wrong, you are only focusing on the mastercard's single case but not general. What about users have a directory of .dat files? Then you still filter all of them and get nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the function name is read_csv, it is reasonable that all the input files should be .csv. We can add read_dat if necessary. What do you think? @jason-dai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know your concern. My original design is assuming the file suffix is csv. But it cannot apply to mastercard case, so I make this workaround. Pandas has no constraint for file name. Maybe let us check what spark do. I think read_dat is not necessary, as data file can store different formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the format of the .dat file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would throw exception. Do we remove the filter for filenames and throw exception if pandas fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I checked mastercard code, it uses the ml-1m dataset which are .dat files.
Seems pd.read_csv can only support reading one file; if I input a directory it directly raises an exception. If it can't parse the file, it also raises an exception.
But we are supporting multiple files, just throwing an exception if one file fails may not be reasonable...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we can do something like ImageProcessing, provide a flag to ignore reading error. If user choose to ignore, we may only give warning message. If not choose ignore, throw exception. There may be possible that user won't like to continue if the some data is incorrect. So give the option may satisfy different requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as you mentioned before, how to handle possible empty partitions? We still need to add a remove empty method?

Copy link
Contributor

@jenniew jenniew Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the functionality, I think our code may work even there are empty partitions. I checked SparkXShards operations, only partition_by() and to_ray() use rdd.mapPartitions(). Others use rdd.map(). rdd.map() can work even there are empty partitions. In partition_by() operation, it first calls rdd.partitionBy(), which rearranges the elements almost evenly to partitions so the empty partitions would be removed, then calls rdd.mapPartitions(). For to_ray() operation, I think we may often call xshards.repartition() before to_ray(), repartition would re-dispatch elements too, so to_ray() may not store empty partition data to plasma. For the performance, there may have some influence. We may do the repartition after read, then do other operations.

@hkvision
Copy link
Contributor Author

hkvision commented Jul 9, 2020

I tested that spark.read.csv can accept a folder and will read all the files under the folder. But surprisingly it can read any file, including images, but the result could be wrong or messy.
My opinion is to filter csv files only to play safe since seems both pandas and spark have no guarantee that the result is correct given any file type. @jason-dai @jenniew

@hkvision
Copy link
Contributor Author

@jenniew Take a look. Removed the logic of filtering files. Let pandas or spark to handle and report the error. As spark.read.csv/json already accepts multiple files or a folder, the file extraction logic only applies for pandas backend.

if not file_paths:
raise Exception("The file path is invalid/empty or does not include csv/json files")
if not file_paths:
raise Exception("The file path is invalid/empty or does not include csv/json files")
Copy link
Contributor

@jenniew jenniew Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise Exception("The file path is invalid/empty). As we don't filter csv/json files, we cannot tell if the folder don't include csv/json files.

@@ -40,7 +40,7 @@ def test_read_local_csv(self):
file_path = os.path.join(self.resource_path, "abc")
with self.assertRaises(Exception) as context:
xshards = zoo.orca.data.pandas.read_csv(file_path)
self.assertTrue('The file path is invalid/empty' in str(context.exception))
self.assertTrue('No such file or directory' in str(context.exception))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add negative test of file/folder is/contains invalid csv/json file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Take a look.

@jenniew
Copy link
Contributor

jenniew commented Jul 16, 2020

LGTM.

@hkvision
Copy link
Contributor Author

@hkvision
Copy link
Contributor Author

Thanks for your review @qiyuangong @jenniew

@hkvision hkvision merged commit 35149b1 into intel-analytics:master Jul 16, 2020
@hkvision hkvision deleted the fix-local-path branch July 16, 2020 07:57
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 23, 2021
* resolve conflict

* update

* fix

* meet review
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 23, 2021
* resolve conflict

* update

* fix

* meet review
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 26, 2021
* resolve conflict

* update

* fix

* meet review
yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 26, 2021
* resolve conflict

* update

* fix

* meet review
yangw1234 pushed a commit that referenced this pull request Sep 27, 2021
* resolve conflict

* update

* fix

* meet review
dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Oct 4, 2021
* resolve conflict

* update

* fix

* meet review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Orca read file sometimew would report path not exist if don't specify file:/// for local file
4 participants