-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBFS (Databricks) Support #4336
Comments
Hi @Guichaguri ! I'm not familiar with DBFS, but I would double check if maybe it supports one of the already supported APIs (e.g. we have s3 support, ssh, hdfs etc). From quickly looking into the docs, seems like it is based on s3 somewhere internally, but I can't see any common APIs that we could use from the list that we already support. So looks like a new remote type would indeed be necessary. Do you use databricks yourself, or do you just want to contribute the support for it? |
Well they have their dbutils.fs in Python. I think that would be a bit more 'native'. I'm using databricks. |
Hello, Thanks, |
@omrihar, unfortunately, no one has worked on it yet. We are more than happy to accept the contribution for the feature. |
Besides the mapping, I don't know how this would work when using databricks notebooks; there is no git integration. |
Unfortunately I'm not sure I am able to add the databricks support - I'm not well versed enough in databricks and dvc for that. @majidaldo I was thinking of using databricks DBFS for data storage (as S3 is used). You are correct that the current git integration is very lacking... However it still may be possible to use the dvc API to get the files from a databricks notebook with data lying somewhere in dbfs, maybe? I'm quite new to databricks and am not yet 100% sure what's the best strategy to deal with the various aspects of solid, full-blown ML workflow on databricks. |
If you're just getting and uploading files from dbfs, dvc isn't adding value here. The dbfs api and dbfs mount is pretty straightforward to use. Databricks wants you to use MLFlow. After using Databricks for a while, I almost despise any platform that throws me into notebook-land. |
We are rapidly moving towards fsspec, and it comes bundled with a databricks implementation. So if there is nothing on the dbfs that would contradict with the DVC's assumption regarding filesystem-related stuff, DVC might start supporting it after the full-adoption (1-2 months) https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/dbfs.py |
I'd like to ressurect this issue. I can think about several
I'm considering the way to start using Databricks for ML project workflows. So the second requirement is way important to me. For now I'm thinking about using EC2 instance with s3 bucket DBFS mounted and checking out multiple versions of code and datasets into separate folders. So for Databricks tools information abut version will be lost - it would just see multiple folders with files. I'm also going to try using Databricks Repo feature with git repo that had DVC set up in it - and run DVC commands after git checkout and before git commit. |
Was this resurrected? |
@edmondop How are you trying to use DVC with DBFS? |
@dberenbaum the idea is to put some versioned excel files in DVC (the input data source is really in excel) and have a job that clone the code and ingest that as a delta table on databricks. DBFS is not the concern here, is how to version the input of the job |
Are you hitting an issue trying to do that, or are you looking for general advice on what the workflow should be? |
Should I install DVC in a notebook and perform a dvc pull from the notebook? That would probably work |
Sorry, I don't have handy access to Databricks to test, but you could try to set up your git repo on there and try it. You might also want to configure your cache directory to be on DBFS or some mounted storage since the default may be limited and unreliable. |
@dberenbaum This is actually very interesting for me as well, since we work with Databricks quite a lot (though not exclusively) and an increasing number of our clients are interested in data traceability and versioning. Specifically, I am looking for a way to be able to work inside a Databricks notebook. It could be interactively, as a data scientist might, or programmatically in case those notebooks are run in a pipeline. By the way Databricks notebooks are really just .py files with some sugar, Databricks notebooks are not the same as Jupyter notebooks and so they are a bit more production and versioning friendly and so using them in production is a real possibility. Either way, that is probably of little import, as the biggest issue seems to be the way Databricks interacts with the remote git versioning. You can set up what is called "Databricks repos" which basically allows Databricks to clone a repository from GitHub/GitLab/whatever to your Databricks environment and provides a simplified UI which can be used to version changes users make in their local copy (local = in Databricks environment in this case). Basically, it can be seen as a VM which serves as your "local" development environment where you cloned the repo...except that it cannot because that is not how the repos are handled by Databricks. When you run Databricks notebooks (or .py scripts, doesn't matter) in the Databricks environment, you run them on a Databricks cluster, or rather its driver node. Databricks supports repo files programmatically from inside the notebooks, so you are able to use python to read, write and modify the files which are notionally in the repo. If you use the GUI, you can then commit and push the changes. Note that you can only commit AND push at the same time, there are no local commits, because all the actual code versioning with git is hidden from the users for some reason. If you try to use the git python API to get info about the repository from inside databricks like this
you get the Running a git shell command (git status for instance) from inside the notebook (there is a %sh magic which allows you to do that) has the same effect, returning
To be honest, I am not quite sure how code versioning is handled by Databricks under the hood. It restricts you to using their GUI which limits what you can do and to access info about the repository from inside the notebooks you have to do things like this. You need to use their I am not sure why Databricks decided to implement versioning inside their environment like this, but as it stands I think is in fact quite difficult to combine Databricks with DVC in a meaningful way. A shame, really, since if nothing else then Databricks is a great platform for processing big data (as Databricks is among other things a managed Spark environment) and these big data pipelines can often benefit from data versioning (especially in large corporate environments where traceability and certification is important). I'd love to use the DVCFileSystem and the new DVC Cloud versioning feature for exactly these types of use-cases but unfortunately it clashes with the way Databricks handles code versioning. I guess I might actually ask them (Databricks devs) about this as well. cc @edmondop - this might be interesting and I guess also @shcheklein - you were asking me a few months ago about any feature requests I might have on Iterative, this would be a big one. But I am not sure if Iterative can actually help with this, the way I understand it the issue seems to be on the Databricks side. At best a feature request from you might help or perhaps they could help you find a workaround for integration...that is of course if you find working on something like this worthwhile. |
Come to think of it, I guess there is a workaround that doesn't do everything, but can at least make it partially work together:
Unfortunately, this won't allow you to store references to the data directly with the code for your ML training/big data pipelines/whatever, but it will at least allow you to be sure the pipelines always run using a clearly specified and versioned dataset as an input. cc @edmondop By the way, let me know @dberenbaum if you find this interesting, I probably want to try this approach out anyway and I might as well write a short blog post about it while I'm at it... |
Thanks as always @tibor-mach! I have been a Databricks user before (including Databricks repos) but am out of practice, so I appreciate the refresher. Your idea to use DVC purely as a data registry for consumption through the DVCFileSystem makes sense as a starting point.
I think running Databricks notebooks as jobs is another potential integration point. It should be possible to run using the jobs cli in a DVC pipeline. Interactively running DVC pipelines or experiments from within a notebook might be possible, but how useful would you find it? As you mention, Databricks is especially useful for large Spark data processing jobs, and I think DVC is often (but not always) more useful downstream from these jobs. Caching and versioning all of the data for a large Spark job may not be realistic or desireable, but it may be useful to have a "no-cache" DVC pipeline that can provide traceability about changes to the dependencies and outputs in the pipeline, even if it can't recover past datasets. |
@tibor-mach thanks for following up! @dberenbaum the idea is that you have reference tables that you want to store under version control, and version them with your code. These are typically small tables , so you don't need spark, but not small enough to be stored in Git |
Great, thanks @edmondop! @tibor-mach Do you plan to give this a shot with the data registry + dvcfs idea you proposed? |
@dberenbaum Sure, I intend to test it either tomorrow or next week, depending on how much time I have. Triggering Databricks jobs as a part of a dvc pipeline is actually a very interesting idea as well for some use-cases. I suppose that a combination of that and the external data feature of DVC, you should actually be able to run arbitrary pipelines in Databricks while actually taking a full advantage of DVC and even dvc pipelines (which might be very relevant, because the ability to skip some steps of the pipeline in case nothing changed since the previous run can save a lot of costs in scenarios where you actually need this big data wrangling as a part of your ML training pipeline. So my plan is basically to try the first approach first and then also experiment with jobs+external data management. I'll see how that goes. Since this would address a few thinks that are related to our current projects in my job, I think I should be able to find some time for both. |
Sounds great @tibor-mach!
👍 The external data part is where it gets a bit awkward IMO because it is both clunky to set up and doesn't fit well into the typical DVC workflow. Without each team member having their own copy of the dataset, lots of problems can arise, like accidentally overwriting each other's work. However, I think for the use case of simply skipping steps in case nothing has changed (which I agree is probably the most useful part of this scenario), it may make sense to set |
@dberenbaum It seems that in the newest version of databricks, the databricks Repos actually work like git repositories. However, I would still need to pass credentials (say storage account name and key on Azure) to I assume that normally this information is looked up in |
@tibor-mach Sadly we don't have a way to do that yet (see #9154) |
@dberenbaum I see. I guess it will be difficult to go forward with the way I wanted to test the integration with Databricks (though I think I can still try out the option with jobs you suggested). I will link this to #9154 , thanks. |
Good discussion. @tibor-mach thanks for the input. I've installed the environment on Azure to play with it a bit.
Could you clarify on this? I have some idea, but would be great to hear your thoughts.
|
In a lot of companies, particularly in engineering, the ML models are used as one of the tools in development of new machines. When you are talking about things like trains, airplanes (even cars but there it is slightly different), there is a certification process they have to go through and as a part of that what they need is that all data that are used as a part of the development are clearly traceable and reproducible. This is also why tools like DVC are interesting for these companies. But at the same ti DeltaLake does versioning but in a a way that does not lead to clear reproducibility in the context of ML. To me DeltaLake table versioning is great for instance when you accidentally overwrite a critical table that is computed in a large batch job that runs once a month...then you can simply revert to the previous version. But it is not quite as handy to be used as a tool for ensuring reproducibility of ML models. DVC is great in this regard in that it really provides a GitOps approach to this...so then in the end, git is the only source of truth and (at least as log as you also use containers for training) you can talk about full reproducibility. That is an ironclad argument for any certification process which requires you to trace back and be able to reproduce all the outcome data (some of which is generated with ML) which you use as a basis of a certification of your new train motor for instance.
I guess so. It is actually one of the things I would like to figure out how to do better with DVC - not having to actually pull anything while still versioning the data. If we are talking about working with either big or somehow restricted data (often the case with things like airplane turbines and the like!), then pulling it locally might not be an option.
I guess the best case scenario would be to be able to use DVC for versioning of DeltaLake (or even simple parquet) files stored in a datalake without the need to pull them for processing. So for instance if you run any actual ML on Databricks you could still use a GitOps approach for the entire data processing and ML (this to me is one of the biggest strengths of DVC and biggest weaknesses of Databricks). If you just run your ML training in a container on an EC2 instance (for instance) then this is not that big an issue I guess since you can always use CML for that and keep all data secure and out of reach of any user's local machine (it is still a problem with data which are not allowed on cloud but then you can simply use dvc with self-hosted on-prem runners). But Databricks is also quite handy as a tool for data exploration and prototyping (basically using it like you would use jupyter lab with the added benefit of having access to managed spark). There, what I would love to see is the ability to have data versioned via git so that if I want to share a project with a colleague, I just tell them a name of a specific branch/tag/commit in the project repo and rest assured that they will be looking at the same data I do. You can do that with DVC already, but not in the context of the Databricks environment and so I am also trying to figure out how to do that best. |
I was able to do at least rudimentary work in a DVC repo using Databricks repos by configuring a few settings at the top of my notebook:
@tibor-mach Does it seem useful to you? Do you want to try working with those settings? Edit: dropped this part as I realized the linking wasn't working on dbfs:
|
@dberenbaum Thanks for the update. I will have a look at it, but now I will be away (and without a computer) until the end of next week. I'll let you know afterwards. |
Thanks everyone for a great and detailed discussion, it was a pleasure going through it. After reading all the points made here, discussing this with @dberenbaum and @shcheklein , and playing a little bit with databricks myself, it seems like we have a few points here:
Then we have different levels of functionality that one could want:
So far we've been mostly worried here about tracking large data in clouds/dbfs, but not caching smaller stuff locally. The latter should work as is already and we might only want to recommend creating a local remote or local cache (as Dave suggested above) on dbfs to be able to easily share the cache. So in terms of action points, I think I'll try to get the dvcfs/get creds problems sorted first, so that one could at least use dvc more comfortably within the databricks/etc. Let me know what you think. |
@dberenbaum I finally tried it out. Your workaround seems to solve one issue but there is still the problem with authentication. If I want to use I could of course simply copy them to the One way to solve this is to enable setting credentials (#9154 ) via the Python API. Then the credentials can be stored in a secret vault (be it the one from Databricks or native to the specific cloud) and retreived from there each time they are needed. |
Btw @efiop this is somewhat relevant. You can at least get info about the repo metadata this way and find out the url to the original repository, the ...but it is on veeery shaky grounds as this is not documented anywhere and subject to change by Databricks at any time (although I guess it should hopefully at least stay fixed for a given Databricks runtime). You get this as the content of
Originally this also gave you the state of the working tree (if it is dirty), but then they removed that from the output json for some reason. Maybe there is a way to access it somehow, but given the lack of documentation you'd need to guess and maybe just do a grid search of all the methods of the various |
For the record: #9154 (dvcfs support for creds passing through config) works now. Feel free to give it a try. I'll add a corresponding EDIT: api read/open now also support |
@efiop Cool, I will give it a try. Also, Databricks are developing a python SDK might also be worth checking out and maybe a good too for integration with Databricks. But the development has only recently started (which might be a good opportunity to get in touch with the devs as things are likely going to be less defined on their part at this point and they might be more receptive to feedback) |
Related discussion in #8411 |
For the record: if anyone is running into |
For the record: we now have some basic info on databricks + dvc in our docs https://dvc.org/doc/user-guide/integrations/databricks |
Databricks has it's own file system format.
I've read a bit of the DVC code and I think I can work on a PR adding support for it by using its REST API. Is there any reason not to work on it?
The text was updated successfully, but these errors were encountered: