Fix Orca read file path #2525

hkvision · 2020-07-02T06:57:33Z

Fix Orca read file sometimew would report path not exist if don't specify file:/// for local file analytics-zoo#750 do not use Scala HadoopFS API to list local file paths.
Do not filter files by its suffix as neither Spark or Pandas does this. We just pass all the files to Spark or Pandas and let them handle or raise an error. For pandas during read time error would be raised if the file format is wrong. For spark seems it can read anything, but when you do transformation on the resulting DataFrame, the error would be raised if the file format is not as expected.

qiyuangong

LGTM

jenniew · 2020-07-02T19:23:03Z

pyzoo/zoo/orca/data/utils.py

+            file_paths = [abspath(join(file_path, file)) for file in listdir(file_path)]
+    # Only get json/csv files.
+    file_paths = [file for file in file_paths
+                  if os.path.isfile(file) and os.path.splitext(file)[1] == "." + file_type]


This would not work on mastercard case. The input is one *.dat file, so my original logic is not to check suffix if path is a file. For directory, I filter suffix to remove unrelated files. Actually pandas does not check file suffix. We may change logic to not filter but skip the pandas read result if pandas return null or throw exception. But this may cause some partition empty.

But your previous logic is wrong, you are only focusing on the mastercard's single case but not general. What about users have a directory of .dat files? Then you still filter all of them and get nothing.

Since the function name is read_csv, it is reasonable that all the input files should be .csv. We can add read_dat if necessary. What do you think? @jason-dai

I know your concern. My original design is assuming the file suffix is csv. But it cannot apply to mastercard case, so I make this workaround. Pandas has no constraint for file name. Maybe let us check what spark do. I think read_dat is not necessary, as data file can store different formats.

what's the format of the .dat file?

It would throw exception. Do we remove the filter for filenames and throw exception if pandas fails?

OK. I checked mastercard code, it uses the ml-1m dataset which are .dat files.
Seems pd.read_csv can only support reading one file; if I input a directory it directly raises an exception. If it can't parse the file, it also raises an exception.
But we are supporting multiple files, just throwing an exception if one file fails may not be reasonable...

Or maybe we can do something like ImageProcessing, provide a flag to ignore reading error. If user choose to ignore, we may only give warning message. If not choose ignore, throw exception. There may be possible that user won't like to continue if the some data is incorrect. So give the option may satisfy different requirements.

But as you mentioned before, how to handle possible empty partitions? We still need to add a remove empty method?

For the functionality, I think our code may work even there are empty partitions. I checked SparkXShards operations, only partition_by() and to_ray() use rdd.mapPartitions(). Others use rdd.map(). rdd.map() can work even there are empty partitions. In partition_by() operation, it first calls rdd.partitionBy(), which rearranges the elements almost evenly to partitions so the empty partitions would be removed, then calls rdd.mapPartitions(). For to_ray() operation, I think we may often call xshards.repartition() before to_ray(), repartition would re-dispatch elements too, so to_ray() may not store empty partition data to plasma. For the performance, there may have some influence. We may do the repartition after read, then do other operations.

hkvision · 2020-07-09T07:07:55Z

I tested that spark.read.csv can accept a folder and will read all the files under the folder. But surprisingly it can read any file, including images, but the result could be wrong or messy.
My opinion is to filter csv files only to play safe since seems both pandas and spark have no guarantee that the result is correct given any file type. @jason-dai @jenniew

hkvision · 2020-07-10T09:28:32Z

@jenniew Take a look. Removed the logic of filtering files. Let pandas or spark to handle and report the error. As spark.read.csv/json already accepts multiple files or a folder, the file extraction logic only applies for pandas backend.

jenniew · 2020-07-14T23:41:57Z

pyzoo/zoo/orca/data/pandas/preprocessing.py

-    if not file_paths:
-        raise Exception("The file path is invalid/empty or does not include csv/json files")
+        if not file_paths:
+            raise Exception("The file path is invalid/empty or does not include csv/json files")


raise Exception("The file path is invalid/empty). As we don't filter csv/json files, we cannot tell if the folder don't include csv/json files.

jenniew · 2020-07-14T23:51:38Z

pyzoo/test/zoo/orca/data/test_pandas_backend.py

@@ -40,7 +40,7 @@ def test_read_local_csv(self):
        file_path = os.path.join(self.resource_path, "abc")
        with self.assertRaises(Exception) as context:
            xshards = zoo.orca.data.pandas.read_csv(file_path)
-        self.assertTrue('The file path is invalid/empty' in str(context.exception))
+        self.assertTrue('No such file or directory' in str(context.exception))


Can you add negative test of file/folder is/contains invalid csv/json file?

Added. Take a look.

jenniew · 2020-07-16T06:38:50Z

LGTM.

hkvision · 2020-07-16T07:52:14Z

jenkins passed: http://10.239.47.210:18888/job/ZOO-PR-Validation/3786/

hkvision · 2020-07-16T07:57:21Z

Thanks for your review @qiyuangong @jenniew

* resolve conflict * update * fix * meet review

hkvision requested review from jason-dai and jenniew July 2, 2020 06:57

hkvision added the orca label Jul 2, 2020

qiyuangong approved these changes Jul 2, 2020

View reviewed changes

jenniew reviewed Jul 2, 2020

View reviewed changes

hkvision force-pushed the fix-local-path branch from 9a5e9b7 to 9d2e801 Compare July 3, 2020 07:17

resolve conflict

06a2a0b

hkvision force-pushed the fix-local-path branch from 9d2e801 to 06a2a0b Compare July 10, 2020 09:01

update

1ef9a28

fix

98c7b53

jenniew reviewed Jul 14, 2020

View reviewed changes

meet review

86b460b

hkvision merged commit 35149b1 into intel-analytics:master Jul 16, 2020

hkvision deleted the fix-local-path branch July 16, 2020 07:57

yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 23, 2021

Fix Orca read file path (intel-analytics#2525)

03a8428

* resolve conflict * update * fix * meet review

yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 23, 2021

Fix Orca read file path (intel-analytics#2525)

e9c045d

* resolve conflict * update * fix * meet review

yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 26, 2021

Fix Orca read file path (intel-analytics#2525)

e550ab2

* resolve conflict * update * fix * meet review

yangw1234 pushed a commit to yangw1234/analytics-zoo that referenced this pull request Sep 26, 2021

Fix Orca read file path (intel-analytics#2525)

e198daf

* resolve conflict * update * fix * meet review

yangw1234 pushed a commit that referenced this pull request Sep 27, 2021

Fix Orca read file path (#2525)

0b55a63

* resolve conflict * update * fix * meet review

dding3 pushed a commit to dding3/analytics-zoo that referenced this pull request Oct 4, 2021

Fix Orca read file path (intel-analytics#2525)

3658319

* resolve conflict * update * fix * meet review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Orca read file path #2525

Fix Orca read file path #2525

hkvision commented Jul 2, 2020 •

edited

Loading

qiyuangong left a comment

jenniew Jul 2, 2020 •

edited

Loading

hkvision Jul 3, 2020

hkvision Jul 3, 2020

jenniew Jul 3, 2020

jason-dai Jul 3, 2020

jenniew Jul 3, 2020

hkvision Jul 3, 2020

jenniew Jul 3, 2020

hkvision Jul 6, 2020

jenniew Jul 6, 2020 •

edited

Loading

hkvision commented Jul 9, 2020

hkvision commented Jul 10, 2020

jenniew Jul 14, 2020 •

edited

Loading

jenniew Jul 14, 2020

hkvision Jul 16, 2020

jenniew commented Jul 16, 2020

hkvision commented Jul 16, 2020

hkvision commented Jul 16, 2020

Fix Orca read file path #2525

Fix Orca read file path #2525

Conversation

hkvision commented Jul 2, 2020 • edited Loading

qiyuangong left a comment

Choose a reason for hiding this comment

jenniew Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenniew Jul 6, 2020 • edited Loading

Choose a reason for hiding this comment

hkvision commented Jul 9, 2020

hkvision commented Jul 10, 2020

jenniew Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenniew commented Jul 16, 2020

hkvision commented Jul 16, 2020

hkvision commented Jul 16, 2020

hkvision commented Jul 2, 2020 •

edited

Loading

jenniew Jul 2, 2020 •

edited

Loading

jenniew Jul 6, 2020 •

edited

Loading

jenniew Jul 14, 2020 •

edited

Loading