-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[onprem] HDFS Interface #719
Conversation
def region_tag(self) -> str: | ||
return "" | ||
|
||
def bucket(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous meeting, we said that we can think of buckets as directories. However, considering that we are stateless with files, would keeping track of the entered directory (when they initially initiate the connection) be okay?
def delete_bucket(self): | ||
return None | ||
|
||
def bucket_exists(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would we need this function in the scope of HDFS? If the bucket (the directory) didn't exist and considered the "bucket" as the folder that the connection is entered into, it wouldn't be able to make the connection from the start if it didn't exist.
@HaileyJang how you will obtain parallelism when reading data from HDFS? |
Hey @gilv for now, the parallelism is restricted to reading multiple files concurrently. Obviously, this isn't very smart as we are bottlenecked by the name node. Once we have an end 2 end system and a good understanding of how this is impacting the performance we plan to revisit it. If you have thoughts, happy to hear! |
@HaileyJang @ShishirPatil HDFS can perfectly deal with many reads and name node is never an issue, as all cached in memory. I would suggest you to extend hadoop-distcp. This way you will get perfect parallelism over HDFS cluster and then use your techniques to efficiently copy data |
Continued from previous PR