Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should PhET transition to a monorepo? #1242

Closed
samreid opened this issue Apr 29, 2022 · 7 comments
Closed

Should PhET transition to a monorepo? #1242

samreid opened this issue Apr 29, 2022 · 7 comments
Assignees

Comments

@samreid
Copy link
Member

samreid commented Apr 29, 2022

Today Quick CT showed a false positive error because it pulled code between 2 consecutive pushes. If we did all development within one repo, commits/pushes/pulls would be truly atomic and this kind of problem wouldn't happen. We have discussed monorepos every now and then in developer meeting but I thought it would be nice to have an issue to track that discussion, and to enumerate the pros and cons.

Dev meeting: March 17, 2022
@jonathanolson: It’s hard to do the feature branch method with as many repos as we have
@samreid: Other companies have gotten around this by having all code under one big mono repo
@chrisklus: Google does this
@pixelzoom: Never worked somewhere that uses one repo for everything, but how about trimming down our repos by grouping common code, phet-io, etc

Pros of a Mono Repo

  • Easier to check out all the code at once
  • Easy to make feature branches. Would this help us get to a "master is more stable" philosophy?
  • Atomic commits/pushes/pulls. Can ensure code is always coherent.
  • dependencies.json gets simpler
  • Precommit hooks will be simpler/faster--only one repo to commit.
  • CT will be simpler and faster since it doesn't need to wait for a stable point
  • Reduce time to initialize a new sim
  • Maintenance releases may get a lot easier, if you don't need to branch dependencies?
  • Daily tasks of "check repo status", "pull all", "clone missing repos" etc. would be more streamlined. Risk of leaving one repo unpushed would be reduced/eliminated

Cons of a Mono Repo

  • Huge initial up front cost to design/implement/migrate. How would we estimate this?
  • How would this impact our maintenance release processes?
  • Would we be merging more?
  • Would the repo be too big/unwieldy? How would we estimate/vet that in advance?
  • Will we have a million branches? (may depend on how maintenance works)
  • Polyrepos help us enforce modularity
  • Would actually need to be a dual repo so we can maintain a private phet-io repo? Maybe we could have a true monorepo for PhET developers, and push a filtered version of the public code.
  • GitHub issue tracking/milestones would need to be redesigned. Or we could keep all the code in 2 repos, but have GitHub repos for issues/milestones?
  • Our subprojects like "scenery", "query string machine", "utterance queue", "axon" etc. make sense as standalone open source projects (?), but we could push filtered versions of these subdirs to separate repos for the community. But it would make community contribution more difficult to propagate back.

Also, please be aware that there are many articles about "monorepo vs polyrepo" with many good points.

I don't think we should take any action on this at the moment, but may be good if @jonathanolson wants to chime in to round out the pros/cons. If we ever undertake this, it would be epic level proportions.

@samreid samreid changed the title Should PhET transition to a monorepo? (hint: probably not) Should PhET transition to a monorepo? Apr 29, 2022
@pixelzoom
Copy link
Contributor

pixelzoom commented Apr 29, 2022

To be frank... A single repo for everything is not at all practical. I can't recommend against it strongly enough. Based on past experience working with diverse organizations (startups to Forture 500 companies), here's what I see: PhET is a growing project, experiencing growing pains. And modularity becomes MORE important, not LESS important, as a project grows. Moving to a mono repo is a giant step to LESS modularity. That said...

Before putting everything in one big repo, I'd consider consolidating common code into a smaller number of repos (possibly even 1 repo), while continuing to keep sims and other "products" in separate repos. See how that goes, if it addresses issues that PhET is having, etc.

If there are still issues.... Experiment with using versioned framework repos, instead of always working in master. That's the typical approach used by organizations that have many separate products supported by a common framework. There are definitely costs, merges are probably the biggest cost/hassle. It will feel inconvenient for devs who are used to working in master. And it requires robust project management and scheduling (which PhET currently lacks) to roll out new versions of framework libraries. These costs are the trade-offs of keeping a large/growing development team running smoothly.

Other cons of a mono repo:

  • PhET will lose many forms of modularity: of day-to-day work, of dependencies, of reponsibility, of scheduling,...
  • Moving from many repos to 1 repo is a huge risk. And practically speaking, there will be no going back.
  • You'll be skipping over the more traditional approach described above. You can get there from PhET's current organization. Getting there from 1 repo would be much more difficult.
  • Would not be able to version "framework libraries" separately, which is what most software projects do.
  • With separate repos for sims, it's OK if there's a problem or something is badly broken in a sim repo. Is that still the case with a mono repo, or does everything need to be kept clean? If so, that's not practical.

@samreid
Copy link
Member Author

samreid commented Apr 29, 2022

To clarify what a monorepo might look like, I was picturing a structure like this:

monorepo/
  public/
    axon/
    chipper/
    energy-skate-park/
    geometric-optics/
    scenery/
    sun/
  private/
    phet-io/
    phet-io-sim-specific/

To manage issues + milestones, we would still have GitHub repos for each common code repo and sim repo, like "axon", "scenery", "energy-skate-park", etc. But they would have no code. The monorepo would be a solution to the problems "how can we easily push/pull multiple repos at once?" and "how can we create cross-cutting feature branches easily?" and "how can we prevent CT from running on a test where it only pulled half of the pushes?"

tsc would work the same. chipper builds and grunt would work the same. So I am having trouble understanding how we would lose modularity of "day to day work" or "responsibility" or "scheduling". We would still have responsible-devs, but it would be for directories instead of repos.

Would not be able to version "framework libraries" separately, which is what most software projects do.

We currently have 27 common code repos. It's difficult for me to imagine releasing each of those as a separate versioned product with our team scale, management and priorities. Is it one of our long-term goals to release each of those independently under semantic versioning?

With separate repos for sims, it's OK if there's a problem or something is badly broken in a sim repo. Is that still the case with a mono repo, or does everything need to be kept clean?

I don't see how bugs/lint errors/type errors/etc in one sim could impact an unrelated sim in the monorepo structure described above.

I agree with many of the points in the preceding comment, just wanted to jot down that polyrepos have triggered a number of pain points. It is not win/win to go to monorepo, but I wanted to start tracking the discussion and pros and cons. And maybe we just need to keep building better tooling (like a tool that makes it easy to manage feature branches across N repos).

@zepumph
Copy link
Member

zepumph commented May 11, 2022

I was interested enough in this idea after a push failed for me last night on 1 of 2 repos and caused CT to fail all night. I read https://www.perforce.com/blog/vcs/what-monorepo because I remembered that Google famously uses one with 86 terabytes of data. Reading this article I am not convinced it is right for our project. It highly recommends against it when using git as the VCS, as it isn't "scalable."

It also feels like a step in the wrong direction for open source. If we get the POSE grant, I feel like having a mono repo would be a step in the wrong direction for creating reusable libraries that could be called an open source environment. Can you imagine the "README" for a sim saying "to run example sim locally, please download the entirety of all simulations."

How would PhET-iO work? It seems like the majority of examples are based in fully proprietary codebases where every having access to all code is acceptable. I don't see a solution where PhET-iO can exist in a mono repo, so aren't we immediately saying duo-repo, one for the open source stuff, and one for the private stuff. I read #1242 (comment), but can you actually have private section of a repository, because I don't think you can, at least not with git/github. (I read https://24ways.org/2013/keeping-parts-of-your-codebase-private-on-github/ and it seems unwieldy to host 2 remotes or have private code in branches).

In general over the last 20 minutes I have convinced myself that it isn't worth the time or energy personally to investigate this further. Sometimes with these larger issues it is harder to know how much discussion is enough to come to a consensus. @samreid I'm happy to be convinced, let me know if you want to discuss further.

@pixelzoom
Copy link
Contributor

Re questions in #1242 (comment) ...

We currently have 27 common code repos. It's difficult for me to imagine releasing each of those as a separate versioned product with our team scale, management and priorities. Is it one of our long-term goals to release each of those independently under semantic versioning?

I'm not advocating versioning 27 repos. I'm suggesting that, as a scalable alternative to a mono repo, common-code repos could be combined into a manageable number of repos that could be versioned. PhET-iO repos could be combine into 1 private repo and versioned. And sims could remain in their own repos, versioned as they currently are.

I don't see how bugs/lint errors/type errors/etc in one sim could impact an unrelated sim in the monorepo structure described above.

I have the entire mono repo checked out. I want to run lint or tsc for my sim and its dependencies. If you create new build tools that know how to do that (for only my sim and its dependencies) then great. If not, then I'll have to lint/tsc everything, and I'm going to see lint and tsc errors in code that's unrelated to my sim.

@samreid
Copy link
Member Author

samreid commented May 11, 2022

My responses will be somewhat brainstormy and I'm not strongly advocating this, just thinking it through. Mainly I'm trying to ask "can we get atomic pushes and pulls without causing too many other problems"?

It highly recommends against it when using git as the VCS, as it isn't "scalable."

The article also says things like: "Using a mono repository is a good idea for many companies. " and lists many advantages of the monorepo.

It also feels like a step in the wrong direction for open source. If we get the POSE grant, I feel like having a mono repo would be a step in the wrong direction for creating reusable libraries that could be called an open source environment.

Agreed!

Can you imagine the "README" for a sim saying "to run example sim locally, please download the entirety of all simulations."

Yes, but I can also imagine that it would be easier for third parties to clone one repo instead of a 2 dozen.

How would PhET-iO work? It seems like the majority of examples are based in fully proprietary codebases where every having access to all code is acceptable. I don't see a solution where PhET-iO can exist in a mono repo, so aren't we immediately saying duo-repo, one for the open source stuff, and one for the private stuff.

I think we would have one 100% repo with everything, including phet-io, then mirror it using filter-branch. But this would make contributions from 3rd parties difficult or impossible?

In general over the last 20 minutes I have convinced myself that it isn't worth the time or energy personally to investigate this further.

I agree this is probably not going to be in our best interest. But it seems good to understand (a) what are the costs of the multi-repos, and why that is preferable to the alternative.

I'm suggesting that, as a scalable alternative to a mono repo, common-code repos could be combined into a manageable number of repos that could be versioned.

That makes sense and may work well for the POSE grant.

I have the entire mono repo checked out. I want to run lint or tsc for my sim and its dependencies. If you create new build tools that know how to do that (for only my sim and its dependencies) then great.

Our existing tools will do that nicely. We won't put code for circuit construction kit and geometric optics in the same directory or anything.

@liammulh
Copy link
Member

liammulh commented Jul 8, 2022

In phetsims/rosetta#283 (comment), I wrote up some notes on Yarn that might be helpful here:

Overview

  • Started in 2016 at Facebook as replacement for NPM.
  • Goal was to create more secure, stable, and efficient package manager.
  • Initially added features NPM didn't have.
  • NPM has since implemented some of these features.

Killer Features of Yarn

  • Generally much faster than NPM.
  • Much better support for de-duplicating packages in monorepos.

Yarn V1

  • Yarn V1 is more akin to NPM than Yarn V2. It uses a node_modules directory and one or more package.json files.

Yarn V2

  • By default, V2 abandons node_modules in favor of .yarn/cache.
    • Q: Why does it do this?
    • A: node_modules is huge, and it negatively impacts the performance of the package manager.
    • Q: How does Yarn V2 resolve dependencies?
    • A: It has a file called .pnp.cjs that contains two maps: one map links package names and versions to their location on the disk, and the other links package names and versions to their list of dependencies.
  • This new scheme allows for "Zero Install".
    • Q: What?
    • A: Configure PNP to resolve dependencies via the .yarn/cache directory rather than the node_modules directory, and check .yarn/cache into version control.
    • Q: Wait, isn't that the same thing as checking node_modules into version control?
    • A: No! To give you an idea, a node_modules folder of 135k uncompressed files (for a total of 1.2GB) gives a Yarn cache of 2k binary archives (for a total of 139MB). The .yarn/cache directory contains exactly one (compressed) file per package, as opposed to node_modules, which contains a gigantic amount of files.

@samreid
Copy link
Member Author

samreid commented Nov 10, 2022

This has been a good discussion, and I agree we should not transition to a monorepo. There are side issues related to other levels of consolidation, and other issues about dealing with the atomicity of commits, and versioning common code repos together. Closing.

@samreid samreid closed this as completed Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants