-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Data Lake integration #134
Comments
Isaac: agreed. To frame it differently, having a 'real big data' example would be great. What do you think would be the best way to approach that? Perhaps work through an example? |
That's a good question :-) I'm speaking to the Data Lake guys to see what the story is for plugging third party components onto the DL store (MBrace has CloudFlow so no need for the USQL side of things). I do think investigating the HDFS side of things is worth spending some time on though. cc: @palladin @dsyme @eiriktsarpalis |
Totally agree on the HDFS side. Having guidance / a story on how to work against 'stuff in HDFS' would be awesome. |
What kind of HDFS support do you have in mind? Because AFAIK HDinsight HDFS acts as an access interface for blob storage. |
@palladin there's definitely a side of WASB / HDFS interop that allows HDFS to talk to blob storage without realising it. I'm talking about things from the other side of the fence i.e. allowing people to access resources in MBrace via HDFS or WASB. Currently the mechanism for accessing blobs in MBrace is: -
Support for WASB addressing would allow us to consistently address blobs, particularly when connecting multiple storage accounts (this is IMHO a really important feature because it allows us to create clusters and perform data analysis on other storage accounts e.g. customer data without writing MBrace-specific data to their storage account). For me this issue is about looking at Data Lake integration - one way is via WASB, another is HDFS, and another is Data Lake's own ADL naming format. The HDFS part would be interesting if you have e.g. an HDFS cluster running somewhere with data on it - can we access it? Again, maybe that's another issue (there is definitely one either here or on MBrace.Core about this) - but the ability to e.g. index files based on wildcard paths e.g. "data/customers/january/*" using HDFS noation rather than explicitly providing a list of files to operate over would be great. |
@isaacabraham Support for wasb urls is certainly useful and actually Eirik has some ideas for multiple storage-account management that will enable the account resolution part of wasb urls. |
That's great. @eiriktsarpalis and I have had some chats about this. Whilst we're on the subject - something that is valuable is the ability to clearly segregate internal MBrace data in terms of Store from data access. I'm thinking here in terms of running MBrace on Service Fabric - which has in-built support for local, replicated state across a cluster. This could be a perfect fit for MBrace's internal state and would have potentially large performance benefits for things like persisted cloud flows. |
https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-overview/
Data Lake is Microsoft's new "big data" store - auto-scaling, support for HDFS etc. etc.. There are several components to it, such as Data Lake Analytics (i.e. U-SQL) but we should look into the possibility of hooking an MBrace cluster up to Data Lake in addition to simply blob storage.
The text was updated successfully, but these errors were encountered: