Investigate Data Lake integration #134

isaacabraham · 2016-01-15T11:09:21Z

https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-overview/

Data Lake is Microsoft's new "big data" store - auto-scaling, support for HDFS etc. etc.. There are several components to it, such as Data Lake Analytics (i.e. U-SQL) but we should look into the possibility of hooking an MBrace cluster up to Data Lake in addition to simply blob storage.

mathias-brandewinder · 2016-01-17T11:54:27Z

Isaac: agreed. To frame it differently, having a 'real big data' example would be great. What do you think would be the best way to approach that? Perhaps work through an example?

isaacabraham · 2016-01-17T12:02:03Z

That's a good question :-) I'm speaking to the Data Lake guys to see what the story is for plugging third party components onto the DL store (MBrace has CloudFlow so no need for the USQL side of things). I do think investigating the HDFS side of things is worth spending some time on though. cc: @palladin @dsyme @eiriktsarpalis

mathias-brandewinder · 2016-01-17T12:10:59Z

Totally agree on the HDFS side. Having guidance / a story on how to work against 'stuff in HDFS' would be awesome.

palladin · 2016-01-17T18:05:23Z

What kind of HDFS support do you have in mind? Because AFAIK HDinsight HDFS acts as an access interface for blob storage.

isaacabraham · 2016-01-21T01:42:33Z

@palladin there's definitely a side of WASB / HDFS interop that allows HDFS to talk to blob storage without realising it. I'm talking about things from the other side of the fence i.e. allowing people to access resources in MBrace via HDFS or WASB. Currently the mechanism for accessing blobs in MBrace is: -

One storage account only.
Limited / somewhat inconsistent way to navigate to blobs.

Support for WASB addressing would allow us to consistently address blobs, particularly when connecting multiple storage accounts (this is IMHO a really important feature because it allows us to create clusters and perform data analysis on other storage accounts e.g. customer data without writing MBrace-specific data to their storage account).

For me this issue is about looking at Data Lake integration - one way is via WASB, another is HDFS, and another is Data Lake's own ADL naming format.

The HDFS part would be interesting if you have e.g. an HDFS cluster running somewhere with data on it - can we access it? Again, maybe that's another issue (there is definitely one either here or on MBrace.Core about this) - but the ability to e.g. index files based on wildcard paths e.g. "data/customers/january/*" using HDFS noation rather than explicitly providing a list of files to operate over would be great.

palladin · 2016-01-21T11:20:38Z

@isaacabraham Support for wasb urls is certainly useful and actually Eirik has some ideas for multiple storage-account management that will enable the account resolution part of wasb urls.

isaacabraham · 2016-01-21T15:40:36Z

That's great. @eiriktsarpalis and I have had some chats about this. Whilst we're on the subject - something that is valuable is the ability to clearly segregate internal MBrace data in terms of Store from data access. I'm thinking here in terms of running MBrace on Service Fabric - which has in-built support for local, replicated state across a cluster. This could be a perfect fit for MBrace's internal state and would have potentially large performance benefits for things like persisted cloud flows.

isaacabraham changed the title ~~Data Lake compatibility~~ Investigate Data Lake integration Jan 15, 2016

isaacabraham added the discussion label Jan 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Data Lake integration #134

Investigate Data Lake integration #134

isaacabraham commented Jan 15, 2016

mathias-brandewinder commented Jan 17, 2016

isaacabraham commented Jan 17, 2016

mathias-brandewinder commented Jan 17, 2016

palladin commented Jan 17, 2016

isaacabraham commented Jan 21, 2016

palladin commented Jan 21, 2016

isaacabraham commented Jan 21, 2016

Investigate Data Lake integration #134

Investigate Data Lake integration #134

Comments

isaacabraham commented Jan 15, 2016

mathias-brandewinder commented Jan 17, 2016

isaacabraham commented Jan 17, 2016

mathias-brandewinder commented Jan 17, 2016

palladin commented Jan 17, 2016

isaacabraham commented Jan 21, 2016

palladin commented Jan 21, 2016

isaacabraham commented Jan 21, 2016