-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
status: takes too long to get status #6543
Comments
Yes,I agreed that walking through |
That will result in a very long wait, because resources is a very deep folder, with 112GB. That's because all the data files for all active jobs are under resources. Why not just review what is in the folder specified. i.e. |
I mean that only |
The dvc.lock must be saved in the resource/(jobname)/dvc path, the way I understand it, because the dvc.lock file will be unique per job. So "resource status inside resources\WI_Ozaukee_20201103\dvc might be stored in the dvc.yaml/dvc.lock outside of the resource path." could be true, but it would mean I would have to have yet another tree type structure to disambiguate the different dvc.yaml files. I thought about this and it seems better to keep the .lock file outside of the github content, such as in resources, even though the dvc.yaml files could be in the github saved area. But for this application, there may be a chance that single pipeline definition could be feasible IF there is sufficient capability in the params substituation functionality, and even then I will need to translate from the current settings file input to the params needed. By the time I do all that, I can just create the pipeline from a template. All that will be within the github saved code space, so the actual dvc.yaml stage file(s) will be dynamically produced by my code and placed in the resource/(jobname)/dvc/(stage)/ folder. |
To clarify, the reason for the current (stage/pipeline collection) behavior is that for So if I had a repo with stages:
foo:
outs:
path/to/dir Given the command But I think the issue here is that when using the |
Would it help to decouple pipeline status ( |
Normally, the -R switch in all linux commands does not extend above the starting point. So when I specified -R, assumed it would only search the specified folder and all those contained in it, and below. Stage outputs are specified in each stage. I don't think that during the search for stage specifications, i.e. dvc.yaml files, that there should be any searching for outputs. In fact, all my outputs are on s3. I did not get the impression that the product of a stage would be to create another stage spec, dvc.yaml file, and put it in a strange place. So my impression is that -R is searching too widely. |
The need to use -R at all stems from the facf that the dvc system like to name everything dvc, rather than use descriptive names. Stages could be called say, 'precheck.yaml' and they could all be in the folder 'pipeline' or 'stages'. Instead, they are all called dvc.yaml, so then you need to have them in a folder, like precheck/dvc.yaml, so it can be discriminated from the other stage files. Thus with the system designed as it is, -R will be used a lot, so it needs to be well behaved. I don't see the need to search above the starting point. |
@daavoo asked:
I thought that status could be requested for a specific stage in the pipeline. But I don't really understand stage status except in terms of whether it has been built, is not built but is ready (dependencies are all built) or not built and not ready (the dependencies are not built or are based on dependencies that are not built.) The dvc status command actually does not provide a clear cut status for the stage esp when using --show-json, as when there are no issues, it returns {}. This should return at least the status of the stage specified. So I understand if status -R is used, then it should locate and find all the stages, based on the starting point provided, and provide an overall status of the pipeline. I still don't have a lot of experience with anything more than one stage. In our use case, at least some of the stage specs (dvc.yaml) will be determined based on the specific inputs for that job. Other later stages may be constant for all or most jobs. I still don't know exactly how to organize that, esp. if I discover that the stage definitions can be constant (and then saved with the code) while using param files that vary. No matter what, we will have to change the stage specs to some extent based on the job. Moving those changes to param files is likely not going to buy too much, so my inclination right now, is to build the stages programmatically, and so I am happy to see that there is the 'dvc stage' command which will help to add each stage and create the proper syntax. (It may be that it will be easier to produce the .yaml directly from a python object than going through the dvc stage command) It might be worth thinking about a dvc-lite design, that does not do all the hand-holding and attempt to use a github like cache, and instead just does a bare-bones status and repro operations. In essence, that is what I am trying to get to, because after my review, I don't think it helps to have the github 'cache' duplication of the data. The result of stages is deterministic and can be regenerated anyway, albeit with a cost, and I am saving directly to s3 and also maintaining a local cache of the data in the code that is called by the stage. My goal is to cut CPU costs by not regenerating a result when a prior stage may have be rerun, but since it produced the same result, then a subsequent stage need not be run. I also want to be able to provide the status of each stage our frontend which only needs built, unbuilt and ready, or unbuilt and not ready. The dvc status does not provide that, so it seems either it can be added, or I will need to process the status json block and determine the status that way. Well that's probably enough detail for now. |
The way I understand this. DVC needs to create a DAG of operations even if we talk about " bare-bones status and repro operations". In general case it means traversing the tree to see all the inputs, outputs, etc. The implementation of this specific case ( |
Yes, building DAG takes place first before any filtering, and when we build the DAG we collect all possible stages through the entire repo. I think with Alternatively, you can also just use |
I find this under 'init'
How can I change this default behavior? Maybe something akin to PATH to define the search? As a reminder, I am not using a SCM and also not a dvc cache to maintain versioned files. I think the design of this is wrong, but it appears the design is not too flexible, so I need to just find a solution. It looks like I may be able to work around this problem by just defining a single stage file, and providing it explicitly in the 'status' call. I might be able to use 'subdir' in the init to reduce the search. |
Yes, @raylutz in your specific case a monorepo setup can be a workaround, do |
That might work, but let me make sure it makes sense. I am troubled a bit by the idea of running init "inside" a job, because I normally don't think of it that way, but I am willing to change my thinking process. The way I normally view this is that there is a root directory audit-engine/ which is also the github repo base for the project. The code runs from there. I know that the dvc stages can include the wdir declaration to cause the script to run from where it needs to. This code is used in common with all the jobs. Then in /resources (a github ignored folder) , each job has a folder with many subfolders where the active data files for that job are stored, if they exist locally. However, the job files are mainly stored in s3, and may be mirrored locally. Locally can be in a desktop, either windows, linux, or in an EC2 instance, which is the targeted operating environment. I was planning on putting the stage files in audit-engine/resources/(jobname)/dvc/(stage_name)/dvc.yaml So because you have adopted the notion that all stage files have the same name, then to name them, it seems necessary to have them in separate folders, or the dvc.yaml files will collide. It is also the case that the dvc.lock is in resources/(jobname)/dvc/ I have found for now that all this multiple stage files doesn't work due to the way -R works, so I now am using one dvc.yaml file with everything in it also in /resources/(jobname)/dvc. But I currently have the .dvc folder with plots/ tmp/ and config there at audit-engine/.dvc Then, if I understand you correctly, for each job, I should cd to /resources/(jobname)/dvc, and run dvc init --no-scm there. That might work. Let me know if I have this right. I suggest that perhaps just specifying the search path may be a good way to solve this for the future. I am moving on now with this single pipeline file, which it seems will be built for each job, perhaps a bit differently, because I don't see any syntax for conditional stages. Thanks! |
We have decided not to use DVC and have implemented our own similar functionality. Thanks for your time. |
DAG collection will be simplified in 3.0, closing in favor of #7093 |
Bug Report
Description
I have dvc setup in the root of my project folder, which is at
the stage file is established in
I issue this command:
And I expect that it will walk the subtree under
to look for dvc.yaml stage files. Instead, it appears to walk the full tree below
and this takes 75 seconds (there is 112 GB of data).
But this is just a hunch. We temporarily moved the .dvc folder to inside the folder
and it takes only 5.6 seconds (which is still pretty long). This should probably take only a second or two, because getting the etags from the three s3 files is very fast and it needs only to find one stage file. It seems something is wrong here.
Reproduce
To reproduce this, dvc must be configured with no scm, no remote, no cache and use -R in status, so it can find the dvc.yaml stage files. We have only one.
Expected
See above.
Environment information
Output of
dvc doctor
:Additional Information (if any):
I will attach the profile dump and plot.
Profile Dump
https://cdn.discordapp.com/attachments/882823608949411850/884465153716920380/dump.prof
https://cdn.discordapp.com/attachments/882823608949411850/884467942111203348/image_output.png
The text was updated successfully, but these errors were encountered: