-
Notifications
You must be signed in to change notification settings - Fork 37
Oracles.ModuleFiles: performance, correctness and tests #210
Comments
Btw just to expand on the email I sent earlier: the issue about cabal traversing everything is this one: haskell/cabal#3019 In my case it's even worse, as in some projects I have many files and folders in my project folder which may be data generated by my program or tests, or other stuff which cabal is not supposed to look at, but I still keep in the cabal project. Then there's also issues about traversing running into (auto)mountpoints if cabal starts looking in places it isn't supposed to. Another related issue is that cabal should become more explicit, and IMO require any preprocessors to be requested via |
It looks like a moderately complex solution to a moderately complex problem - it seems about right for the difficulty of the problem it's solving. Do you have any profile measurements that show it is a bottleneck? And I can see automated tests, namely the Travis pieces. This strikes me very much as a task that is difficult to calculate, but easy to check, since if you get it wrong nothing will work - and thus we are checking it extensively. Only one note from reading through it - there's no point caching something that is computed by an oracle so directly. Every oracle call is cached, so you can drop the newCache bit. |
@hvr I only traverse folders of the form
Making |
I think this is a good idea; the requested build-tools can be used to limit the extensions to search for. I still think we should be caching the results of this search, though. Actually, the same applies to One thing that is unclear to me: how does Shake handle cache invalidation in this case? For example, if I'm developing a package and I move |
@ndmitchell I don't have any profiles, but when this was implemented naively, zero builds took minutes for me. So the optimisation was a forced one.
That's true, but I would want to have some assurance that the current solution will keep working when/if we start building new packages in future (e.g.,
Oh. Silly me. This means most |
@ttuegel The following should happen:
|
I find that idea that A.B.C can be anywhere but src/A/B/C.hs to be a little bonkers. At everywhere I've worked there was a precise isomorphism between module name and file name, which is absolutely necessary when you have 10,000+ modules. Even end user tools were not named @ttuegel - things like @snowleopard profile before optimising - I appreciate a naive implementation is insufficient, but this one might be enough. I think there are a few tweaks/simplifications that might be possible, but it's in the ballpark I was expecting. I suggest that for each of the 3 top-level functions in this module you give a couple of representative queries/answers (edited for length), exactly like you did on this ticket. With that, someone should be able to implement it from scratch without knowing anything more, and then I might be able to give more feedback. |
@ndmitchell Sure, I will do that. |
👍
Having to run
So, if I move |
In this case it's not |
@ttuegel there will never be a requirement for manual intervention. The cost of the communication associated with a manual intervention is ridiculously high, so at the absolute worst, you can just bump the internal version number of the build system and rebuild everything - but that's the absolute worst case. Things like files moving around should be easy. |
@snowleopard - I notice you only commented |
@ndmitchell This is because Two other functions in this module ( Let me think about a possible improvement first and then I will add more comments. |
Detailed review:
Group doesn't do a sort first, so if the arguments are Only other minor point is that you are storing the list of modules as both the argument to the oracle, and in the result. The size of the oracle will increase the size of the database, which increases the time it takes to stream (although is currently < 0.2s in total, so can't be having that big an effect). You could return the modules in the order they were passed, perhaps even demanding the modules are sorted before giving them to the oracle (which also means your groupBy would catch everything) and then not have to also store the modules on the way out as well and let the user just zip. But basically, it looks right, and about as performant as is reasonable. I checked and in the HEAD version of Shake the getDirectoryFiles pattern does only one getDirectoryContents, which is all you can really hope for. |
@ndmitchell Great, many thanks for reviewing!
|
@snowleopard - leaving the sort out is fine, but I would document that invariant in that area, near to where you rely on it. I might be tempted to sort anyway, since it is a trivial operation, and makes it easier to guarantee correctness. Agreed with your new top-level structure, that makes sense. |
A new description of This is an important oracle whose role is to find and cache module source files. More specifically:
For example, for the
|
👍 |
@ndmitchell What do you think of having oracle test like the one I added in 1136a62? It slightly interferes with the Shake database by adding a request which is not exercised during a normal build. |
@snowleopard - a good idea, I don't think it will harm anythibg |
Summary of recent changes:
I think this corner of the build system is now much more clear, robust and efficient. |
I reviewed the code, a few notes:
Overall looks much cleaner and more understandable. |
@ndmitchell Many thanks!
Regarding
Previously the key was the pair
That's true, but I ran into problems while doing this. I might create a new issue to discuss how to proceed. In any case, here we can actually switch to using |
@snowleopard - that all makes sense. Is the |
@ndmitchell You are right, it is not necessary, but it is a good optimisation: there are a lot of I think I should add a comment about this to the code. |
@snowleopard That makes sense - good idea. |
OK, I think I can close this now. |
Oracles.ModuleFiles
is an important oracle in the build system whose role is to find and cache module source files. More specifically (also see update below):modules
and a list of directoriesdirs
as arguments.(A.B.C, dir/A/B/C.extension)
, such thatA.B.C
belongs tomodules
,dir
belongs todirs
, and filedir/A/B/C.extension
exists.For example, for
compiler
package givenmodules = ["CodeGen.Platform.ARM", "Lexer"]
, anddirs = ["codeGen", "parser"]
it produces[("CodeGen.Platform.ARM", "codeGen/CodeGen/Platform/ARM.hs"), ("Lexer", "parser/Lexer.x")]
.The implementation is pretty sophisticated and currently has no automated tests: https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Oracles/ModuleFiles.hs.
Let us discuss possible correctness and performance issues here. I will also attempt to decompose the code into testable pieces and add some automated tests.
Updated definition given in #210 (comment):
This is an important oracle whose role is to find and cache module source files. More specifically:
dirs
and a sorted list of module namesmodules
as arguments.A.B.C
, it returns aFilePath
of the formdir/A/B/C.extension
, such thatdir
belongs todirs
, and filedir/A/B/C.extension
exists, orNothing
if there is no such file. If more than one matching file is found an error is raised.For example, for the
compiler
package:dirs = ["compiler/codeGen", "compiler/parser"]
modules = ["CodeGen.Platform.ARM", "Lexer", "Missing.Module"]
, it returns[Just "compiler/codeGen/CodeGen/Platform/ARM.hs", Just "compiler/parser/Lexer.x", Nothing]
.The text was updated successfully, but these errors were encountered: