Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: takes too long to get status #6543

Closed
raylutz opened this issue Sep 6, 2021 · 16 comments
Closed

status: takes too long to get status #6543

raylutz opened this issue Sep 6, 2021 · 16 comments
Labels
A: status Related to the dvc diff/list/status performance improvement over resource / time consuming tasks

Comments

@raylutz
Copy link

raylutz commented Sep 6, 2021

Bug Report

Description

I have dvc setup in the root of my project folder, which is at

C:\Users\raylu\Documents\Github\audit-engine

the stage file is established in

resources\WI_Ozaukee_20201103\dvc\precheck\dvc.yaml

I issue this command:

dvc status -R -v -v -v --show-json  resources\WI_Ozaukee_20201103\dvc

And I expect that it will walk the subtree under

C:\Users\raylu\Documents\Github\audit-engine\resources\WI_Ozaukee_20201103\dvc

to look for dvc.yaml stage files. Instead, it appears to walk the full tree below

C:\Users\raylu\Documents\Github\audit-engine

and this takes 75 seconds (there is 112 GB of data).
But this is just a hunch. We temporarily moved the .dvc folder to inside the folder

C:\Users\raylu\Documents\Github\audit-engine\resources\WI_Ozaukee_20201103\dvc

and it takes only 5.6 seconds (which is still pretty long). This should probably take only a second or two, because getting the etags from the three s3 files is very fast and it needs only to find one stage file. It seems something is wrong here.

Reproduce

To reproduce this, dvc must be configured with no scm, no remote, no cache and use -R in status, so it can find the dvc.yaml stage files. We have only one.

Expected

See above.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 2.6.4 (pip)
---------------------------------
Platform: Python 3.7.6 on Windows-10-10.0.19041-SP0
Supports:
        http (requests = 2.24.0),
        https (requests = 2.24.0),
        s3 (s3fs = 2021.8.0, boto3 = 1.17.106)

Additional Information (if any):
I will attach the profile dump and plot.

Profile Dump

https://cdn.discordapp.com/attachments/882823608949411850/884465153716920380/dump.prof

https://cdn.discordapp.com/attachments/882823608949411850/884467942111203348/image_output.png

@shcheklein shcheklein added the performance improvement over resource / time consuming tasks label Sep 6, 2021
@karajan1001
Copy link
Contributor

Yes,I agreed that walking through resources\WI_Ozaukee_20201103\dvc and looking into the dvc.yaml files under ., resources, resources\WI_Ozaukee_20201103 are enough.

@raylutz
Copy link
Author

raylutz commented Sep 7, 2021

That will result in a very long wait, because resources is a very deep folder, with 112GB. That's because all the data files for all active jobs are under resources. Why not just review what is in the folder specified. i.e. resources\WI_Ozaukee_20201103\dvc and not ., and resources. It makes almost no sense to specify a folder and then still looking everywhere else.

@karajan1001
Copy link
Contributor

karajan1001 commented Sep 7, 2021

That will result in a very long wait, because resources is a very deep folder, with 112GB. That's because all the data files for all active jobs are under resources. Why not just review what is in the folder specified. i.e. resources\WI_Ozaukee_20201103\dvc and not ., and resources. It makes almost no sense to specify a folder and then still looking everywhere else.

I mean that only dvc.yaml files in the parent path, not all of the paths. resource status inside resources\WI_Ozaukee_20201103\dvc might be stored in the dvc.yaml/dvc.lock outside of the resource path.

@raylutz
Copy link
Author

raylutz commented Sep 7, 2021

The dvc.lock must be saved in the resource/(jobname)/dvc path, the way I understand it, because the dvc.lock file will be unique per job. So "resource status inside resources\WI_Ozaukee_20201103\dvc might be stored in the dvc.yaml/dvc.lock outside of the resource path." could be true, but it would mean I would have to have yet another tree type structure to disambiguate the different dvc.yaml files. I thought about this and it seems better to keep the .lock file outside of the github content, such as in resources, even though the dvc.yaml files could be in the github saved area. But for this application, there may be a chance that single pipeline definition could be feasible IF there is sufficient capability in the params substituation functionality, and even then I will need to translate from the current settings file input to the params needed. By the time I do all that, I can just create the pipeline from a template. All that will be within the github saved code space, so the actual dvc.yaml stage file(s) will be dynamically produced by my code and placed in the resource/(jobname)/dvc/(stage)/ folder.

@pmrowla
Copy link
Contributor

pmrowla commented Sep 7, 2021

To clarify, the reason for the current (stage/pipeline collection) behavior is that for dvc status <target>, <target> could be either a directory containing a dvc.yaml file, or the output for some dvc.yaml file outside of <target>.

So if I had a repo with path/dvc.yaml containing:

stages:
  foo:
    outs:
        path/to/dir

Given the command dvc status path/to/dir, DVC still has to search the parent directories path/, path/to/ for the correct dvc.yaml file w/the output path/to/dir instead of only limiting the search to path/to/dir itself.

But I think the issue here is that when using the -R/--recursive <target>, the user is explicitly telling DVC to look recursively for dvc.yaml and .dvc files inside the target path (meaning it implies that <target> is not a stage output). So we could potentially skip the parent directory search when using -R.

@daavoo
Copy link
Contributor

daavoo commented Sep 7, 2021

Would it help to decouple pipeline status (dvc stage status) from data status (dvc status), similar to how dvc add / dvc stage add were decoupled?

@raylutz
Copy link
Author

raylutz commented Sep 7, 2021

Normally, the -R switch in all linux commands does not extend above the starting point. So when I specified -R, assumed it would only search the specified folder and all those contained in it, and below.

Stage outputs are specified in each stage. I don't think that during the search for stage specifications, i.e. dvc.yaml files, that there should be any searching for outputs. In fact, all my outputs are on s3. I did not get the impression that the product of a stage would be to create another stage spec, dvc.yaml file, and put it in a strange place.

So my impression is that -R is searching too widely.

@raylutz
Copy link
Author

raylutz commented Sep 7, 2021

The need to use -R at all stems from the facf that the dvc system like to name everything dvc, rather than use descriptive names. Stages could be called say, 'precheck.yaml' and they could all be in the folder 'pipeline' or 'stages'. Instead, they are all called dvc.yaml, so then you need to have them in a folder, like precheck/dvc.yaml, so it can be discriminated from the other stage files. Thus with the system designed as it is, -R will be used a lot, so it needs to be well behaved. I don't see the need to search above the starting point.

@raylutz
Copy link
Author

raylutz commented Sep 7, 2021

@daavoo asked:

Would it help to decouple pipeline status (dvc stage status) from data status (dvc status), similar to how dvc add / dvc stage add where decoupled?

I thought that status could be requested for a specific stage in the pipeline. But I don't really understand stage status except in terms of whether it has been built, is not built but is ready (dependencies are all built) or not built and not ready (the dependencies are not built or are based on dependencies that are not built.) The dvc status command actually does not provide a clear cut status for the stage esp when using --show-json, as when there are no issues, it returns {}. This should return at least the status of the stage specified.

So I understand if status -R is used, then it should locate and find all the stages, based on the starting point provided, and provide an overall status of the pipeline. I still don't have a lot of experience with anything more than one stage.

In our use case, at least some of the stage specs (dvc.yaml) will be determined based on the specific inputs for that job. Other later stages may be constant for all or most jobs. I still don't know exactly how to organize that, esp. if I discover that the stage definitions can be constant (and then saved with the code) while using param files that vary. No matter what, we will have to change the stage specs to some extent based on the job. Moving those changes to param files is likely not going to buy too much, so my inclination right now, is to build the stages programmatically, and so I am happy to see that there is the 'dvc stage' command which will help to add each stage and create the proper syntax.

(It may be that it will be easier to produce the .yaml directly from a python object than going through the dvc stage command)

It might be worth thinking about a dvc-lite design, that does not do all the hand-holding and attempt to use a github like cache, and instead just does a bare-bones status and repro operations. In essence, that is what I am trying to get to, because after my review, I don't think it helps to have the github 'cache' duplication of the data. The result of stages is deterministic and can be regenerated anyway, albeit with a cost, and I am saving directly to s3 and also maintaining a local cache of the data in the code that is called by the stage. My goal is to cut CPU costs by not regenerating a result when a prior stage may have be rerun, but since it produced the same result, then a subsequent stage need not be run. I also want to be able to provide the status of each stage our frontend which only needs built, unbuilt and ready, or unbuilt and not ready. The dvc status does not provide that, so it seems either it can be added, or I will need to process the status json block and determine the status that way.

Well that's probably enough detail for now.

@shcheklein
Copy link
Member

The way I understand this. DVC needs to create a DAG of operations even if we talk about " bare-bones status and repro operations". In general case it means traversing the tree to see all the inputs, outputs, etc.

The implementation of this specific case (-R) is probably not optimal. It might be filtering stages to execute or calculate status for after it's prepared the full pipeline, right? (cc @efiop @pmrowla )?

@pmrowla
Copy link
Contributor

pmrowla commented Sep 8, 2021

Yes, building DAG takes place first before any filtering, and when we build the DAG we collect all possible stages through the entire repo.

I think with -R we can just limit that collection to all stages inside the target dir, instead of collecting all stages in the full repo, although maybe this should be a separate flag? (cc @skshetry)


Alternatively, you can also just use .dvcignore to prevent DVC from traversing any directories that the user already knows will never contain pipeline/dvc files (to speed up the time it takes DVC to build the DAG for an entire repo).

@raylutz
Copy link
Author

raylutz commented Sep 8, 2021

I find this under 'init'

By default, DVC commands like dvc pull and dvc repro explore the whole DVC repository to find DVC-tracked data and pipelines to work with. This can be inefficient for large monorepos.

How can I change this default behavior? Maybe something akin to PATH to define the search?

As a reminder, I am not using a SCM and also not a dvc cache to maintain versioned files.

I think the design of this is wrong, but it appears the design is not too flexible, so I need to just find a solution. It looks like I may be able to work around this problem by just defining a single stage file, and providing it explicitly in the 'status' call.

I might be able to use 'subdir' in the init to reduce the search.

@shcheklein
Copy link
Member

I might be able to use 'subdir' in the init to reduce the search.

Yes, @raylutz in your specific case a monorepo setup can be a workaround, do dvc init --no-scm inside each isolated part of the project. That will force DVC to collect only information within those boundaries.

@raylutz
Copy link
Author

raylutz commented Sep 11, 2021

@shcheklein

That might work, but let me make sure it makes sense. I am troubled a bit by the idea of running init "inside" a job, because I normally don't think of it that way, but I am willing to change my thinking process.

The way I normally view this is that there is a root directory audit-engine/ which is also the github repo base for the project. The code runs from there. I know that the dvc stages can include the wdir declaration to cause the script to run from where it needs to. This code is used in common with all the jobs. Then in /resources (a github ignored folder) , each job has a folder with many subfolders where the active data files for that job are stored, if they exist locally. However, the job files are mainly stored in s3, and may be mirrored locally. Locally can be in a desktop, either windows, linux, or in an EC2 instance, which is the targeted operating environment.

I was planning on putting the stage files in audit-engine/resources/(jobname)/dvc/(stage_name)/dvc.yaml

So because you have adopted the notion that all stage files have the same name, then to name them, it seems necessary to have them in separate folders, or the dvc.yaml files will collide. It is also the case that the dvc.lock is in resources/(jobname)/dvc/

I have found for now that all this multiple stage files doesn't work due to the way -R works, so I now am using one dvc.yaml file with everything in it also in /resources/(jobname)/dvc.

But I currently have the .dvc folder with plots/ tmp/ and config there at audit-engine/.dvc

Then, if I understand you correctly, for each job, I should cd to /resources/(jobname)/dvc, and run dvc init --no-scm there.
That would establish the .dvc folder inside the (jobname)/dvc folder.

That might work. Let me know if I have this right.
For now, the workaround that may work fine is that I am using a single dvc.yaml file with all stages in it. Since I am specifying the file completely in each invocation, it does not take any time to search. This is a reasonable work around for my purposes.

I suggest that perhaps just specifying the search path may be a good way to solve this for the future.

I am moving on now with this single pipeline file, which it seems will be built for each job, perhaps a bit differently, because I don't see any syntax for conditional stages.

Thanks!

@daavoo daavoo added the A: status Related to the dvc diff/list/status label Oct 20, 2021
@raylutz
Copy link
Author

raylutz commented Dec 4, 2021

We have decided not to use DVC and have implemented our own similar functionality. Thanks for your time.

@efiop
Copy link
Contributor

efiop commented Dec 5, 2021

DAG collection will be simplified in 3.0, closing in favor of #7093

@efiop efiop closed this as completed Dec 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: status Related to the dvc diff/list/status performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

6 participants