[onprem] HDFS Interface #719

HaileyJang · 2022-12-08T07:29:25Z

Continued from previous PR

HaileyJang · 2022-12-08T07:34:50Z

skyplane/obj_store/hdfs_interface.py

+    def region_tag(self) -> str:
+        return ""
+
+    def bucket(self) -> str:


In the previous meeting, we said that we can think of buckets as directories. However, considering that we are stateless with files, would keeping track of the entered directory (when they initially initiate the connection) be okay?

HaileyJang · 2022-12-08T07:38:12Z

skyplane/obj_store/hdfs_interface.py

+    def delete_bucket(self):
+        return None
+
+    def bucket_exists(self) -> bool:


Where would we need this function in the scope of HDFS? If the bucket (the directory) didn't exist and considered the "bucket" as the folder that the connection is entered into, it wouldn't be able to make the connection from the start if it didn't exist.

gilv · 2022-12-25T18:33:11Z

@HaileyJang how you will obtain parallelism when reading data from HDFS?

ShishirPatil · 2022-12-30T01:12:12Z

@HaileyJang how you will obtain parallelism when reading data from HDFS?

Hey @gilv for now, the parallelism is restricted to reading multiple files concurrently. Obviously, this isn't very smart as we are bottlenecked by the name node. Once we have an end 2 end system and a good understanding of how this is impacting the performance we plan to revisit it. If you have thoughts, happy to hear!

gilv · 2022-12-30T05:01:50Z

@HaileyJang @ShishirPatil HDFS can perfectly deal with many reads and name node is never an issue, as all cached in memory. I would suggest you to extend hadoop-distcp. This way you will get perfect parallelism over HDFS cluster and then use your techniques to efficiently copy data

HaileyJang and others added 10 commits November 17, 2022 18:29

Adding HDFS File System interface

d827236

Adding the file

4434d8d

Script for benchmarking

39d8720

linting

825151b

Make some changes to arguments

42d2f44

Fix merge conflict

592f7ae

Add detecting available CPU

89c2b2f

Change from LocalFileInterface to ObjectStoreInterface

71004cf

Add mimetype and simple hdfs unit test

49b7722

Merge branch 'dev/shishir/on-prem' into on-prem-hdfs

c800a28

HaileyJang commented Dec 8, 2022

View reviewed changes

HaileyJang added 5 commits December 8, 2022 00:10

Refactor the the constructor

dfaec29

Generalizing for multiple threads

0788cff

Fix unit test

11a4210

Implement multiprocessing to benchmark

c8a33f2

Making sure unit test works

54cefb0

HaileyJang added 4 commits December 26, 2022 16:26

Commit new hdfs test file

34fa05d

Change hdfs authentication

177517c

Wait until roles are created

e58d9a6

Add pyarrow to pyproject

adbb8a2

ShishirPatil merged commit ee4fa78 into dev/shishir/on-prem Dec 30, 2022

ShishirPatil deleted the on-prem-hdfs branch December 30, 2022 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[onprem] HDFS Interface #719

[onprem] HDFS Interface #719

HaileyJang commented Dec 8, 2022

HaileyJang Dec 8, 2022

HaileyJang Dec 8, 2022

gilv commented Dec 25, 2022

ShishirPatil commented Dec 30, 2022

gilv commented Dec 30, 2022 •

edited

Loading

[onprem] HDFS Interface #719

[onprem] HDFS Interface #719

Conversation

HaileyJang commented Dec 8, 2022

HaileyJang Dec 8, 2022

Choose a reason for hiding this comment

HaileyJang Dec 8, 2022

Choose a reason for hiding this comment

gilv commented Dec 25, 2022

ShishirPatil commented Dec 30, 2022

gilv commented Dec 30, 2022 • edited Loading

gilv commented Dec 30, 2022 •

edited

Loading