Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[onprem] HDFS Interface #719

Merged
merged 19 commits into from
Dec 30, 2022
Merged

[onprem] HDFS Interface #719

merged 19 commits into from
Dec 30, 2022

Conversation

HaileyJang
Copy link
Contributor

Continued from previous PR

def region_tag(self) -> str:
return ""

def bucket(self) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous meeting, we said that we can think of buckets as directories. However, considering that we are stateless with files, would keeping track of the entered directory (when they initially initiate the connection) be okay?

def delete_bucket(self):
return None

def bucket_exists(self) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would we need this function in the scope of HDFS? If the bucket (the directory) didn't exist and considered the "bucket" as the folder that the connection is entered into, it wouldn't be able to make the connection from the start if it didn't exist.

@gilv
Copy link
Collaborator

gilv commented Dec 25, 2022

@HaileyJang how you will obtain parallelism when reading data from HDFS?

@ShishirPatil
Copy link
Member

@HaileyJang how you will obtain parallelism when reading data from HDFS?

Hey @gilv for now, the parallelism is restricted to reading multiple files concurrently. Obviously, this isn't very smart as we are bottlenecked by the name node. Once we have an end 2 end system and a good understanding of how this is impacting the performance we plan to revisit it. If you have thoughts, happy to hear!

@ShishirPatil ShishirPatil merged commit ee4fa78 into dev/shishir/on-prem Dec 30, 2022
@ShishirPatil ShishirPatil deleted the on-prem-hdfs branch December 30, 2022 01:12
@gilv
Copy link
Collaborator

gilv commented Dec 30, 2022

@HaileyJang @ShishirPatil HDFS can perfectly deal with many reads and name node is never an issue, as all cached in memory. I would suggest you to extend hadoop-distcp. This way you will get perfect parallelism over HDFS cluster and then use your techniques to efficiently copy data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants