- Dependency Managers Don’t Manage Your Dependencies
In the previous post we established that:
Adding many large dependencies tends to slow down install times significantly, and make all operations slower for everyone globally, even if individuals only use a subset of tools in a project.
We can look at this problem another way: How long does it take to get started on a project after checking out the repository or when dependencies require an update after rebasing? I’ve seen this process take minutes when it can be seconds.
Continuous Integration (CI) pipelines are a concrete example of this: You may have multiple workflows that verify a different thing about your project, yet all workflows tend to materialize the entire dependency tree. Imagine only installing the dependencies you need for each task! When we designed Yarn’s workspaces feature, we were guided by solving issues related to organizing mono repo and keeping compatibility with lerna. While we have a neat separation of concerns across packages, we don’t leverage that separation to improve the performance of dependency installation.
node_modules installation process should be per-workspace and incrementally install more dependencies based on the operations executed in a repository.
So let’s take all of this in another direction: What would it look like if we eliminated the ongoing need for installing dependencies from our development iteration cycle completely? My solution — check all of our dependencies into source control and make all development tools available as pre-compiled binaries. This would speed up all repository operations (getting everything up-to-date after a rebase) as well as reduce our in-band reliance on dependency management tools.
- DevDependencies were a mistake
- Checking third-party product code into version control
- Build Zero-Overhead Tooling
The idea of having product and tooling code integrated made a lot of sense early in the node.js/front-end ecosystem and still makes sense for libraries. In the past, entries in the
dependencies field usually meant that code is part of the actual production build artifact, while entries part of
devDependencies are only used during development.
However, for applications, as the ecosystem matured, it is becoming evident that this system no longer serves its purpose.1 Nowadays, product dependencies are usually just inputs into a complex compilation pipeline, type system, or test framework. Quite often, there is no meaningful distinction between what is part of
devDependencies and what ships to production. A simple example is a frontend UI library installed in
For applications, I propose a slightly new way of thinking about dependencies that will lead to a clearer distinction between product and development dependencies: Everything that behaves or is used as if it was first-party code should be a product dependency. Treat it like any other code that your team is writing, and don’t think of
node_modules for product code as a magic directory. The code and type definitions for your UI library go into production dependencies, while your compiler toolchain remains a development dependency. This perspective may be completely obvious to you, but I encourage you to open the
package.json file of a large application, and you’ll find some packages violating this principle.
Now that we have a clear rule around what goes into
devDependencies, we can split them up into two separate folders each with its own
package.json, one for product and one for tooling. At the end of this process, we’ll end up with one smaller
node_modules folder with all product-related code and one large
node_modules folder containing all of the tools that operate on the product and third-party product code which leads us directly to the next step:
Now that we did the initial steps to separate product and tooling dependencies, we can go a step further and check all of our product dependencies into version control.
node_modules out of source trees. Historically, projects with checked-in
node_modules are painful to manage. However, only checking product dependencies into source control limits most of the downsides. Let’s analyze some of the trade-offs:
Mitigation: Usually, the amount of product dependencies is only about 10-30% of the total size of a
node_modules folder. Product dependencies are compiled into production bundles, so it is unusual for them to be more than an order of magnitude larger than first-party code. This means that the size of product dependencies
node_modules folder should not be much larger than first-party code already is, resulting in at most twice as much code that’s already in the repository. To keep these dependencies in check, you can use Yarn’s flat option to ensure you are only using a single version of the same package as well as Yarn’s autoclean feature and a strictly managed custom exclusion list is helpful to remove unused and unnecessary files. Additional repository size can be the biggest downside to this strategy so I recommend analyzing the current size and predicting the future growth before committing to checking third-party dependencies into version control.
Mitigation: The difference between third-party dependencies for product code and tooling code is that the dependencies for product code are deployed with applications as if it was first-party code. Given that this code impacts the size of an application, it is unlikely to grow by orders of magnitude or to outpace first-party code creation significantly. Further, there is an incentive for teams to reduce the size of their production bundles instead of increasing them substantially.
Mitigation: Managing a checked-in
node_modules folder is painful, especially when upgrading large dependencies or large trees of dependencies, for example, Babel and Jest, which both contain dozens of packages. However, since we are exclusively checking in product dependencies, we are unlikely to encounter such tightly connected packages. Most of the time, people will only add or update a small number of third-party product dependencies.
Mitigation: This problem can be avoided by building a CI step that verifies the integrity of the
node_modules folder and prevents people from patching files directly.
There are also upsides to checking in
There is usually little scrutiny on newly added dependencies during code reviews. Somebody may add a single line to a
package.json file that can pull in hundreds of other dependencies. This problem is exacerbated and easy to miss because GitHub hides changes in
yarn.lock files by default.
When materializing third-party dependencies in the repository via a
yarn install, changes and additions are visible to people during the code review stage. It makes people aware of large trees of transitive dependencies. If somebody adds a thousand files just to use a single utility function, it makes sense to apply more scrutiny during code review and maybe even recommend alternative solutions.
node_modules into the repository will reduce the reliance on a single package manager and increase option value. In this case, it enables the option to switch to another package manager with less work as we eliminate Yarn from the critical path and only use the output of the install operation (the
node_modules folder). From Principles of Developer Experience:
[Maximizing Option Value] is about retaining or gaining option value, which means any change to a system should unlock more options for improvements and significant future changes. […] There is usually little option value embedded in the design of existing systems. If we keep option value in mind when redesigning infrastructure, we can naturally adapt to new requirements in the future.
To summarize, we can mitigate many downsides and gain significant upsides by taking more ownership of the dependency management process with this strategy. We are not done yet, though! Now that product and tooling dependencies are neatly separated, and product dependencies are part of version control, together with the next step, everything will fall into place:
To get to a state where we can immediately get started after checking out a repository or after rebasing, we need to have fast access to all the tools like the bundling infrastructure, web server, test frameworks, linter, type checker, and all other tools. Installing them as part of the
node_modules install process is slow. The vast majority of time spent here is resolving dependencies and copying tens of thousands of files from tarballs. Even after everything is installed, the tools are slow to start: Many of these tools load thousands of source files into memory when using them. The solution is to compile them into binaries and vendor them into your projects so they don’t require installing and running third-party dependencies from source.
Various tools are already beginning to move the ecosystem in this direction, like deno’s compile command which helps create executables. Next.js is also compiling many of its dependencies into pre-compiled bundles, which already had a meaningful impact on its install times and startup times. You can pre-compile tools in one of the following ways:
- Use Vercel’s pkg to create optimized binaries for your tool.
deno compileto produce binaries for tools written using deno.
The next step will be to deploy the tool. Here are some example strategies:
- Maintain a private homebrew tap.
- Build a custom system that will build packages on GitHub and download and execute them transparently.
- Not recommended: Check binary artifacts into your repository.2
While we’ll still have an install process, it is usually an order of magnitude faster than installing all the source files. The process can be hidden from the user by integrating it into the tools themselves that manage their updates. It’s essential to version the tools with the state of the repository: A version or hash must be embedded into the repository as a commit every time the tool is changed. This way, you can roll back tools by updating the version or hash in the repository if there are issues. Navigating to older commits will use the older versions of the tools instead of a possibly incompatible newer one.
Not all scripts and tools need to be pre-compiled, only the ones used by a large population of developers or tools with specific performance constraints. Ideally, you can separate tools that aren’t often used into a different workspace, so their dependencies will only be installed when using the tool. An excellent example of that is end-to-end testing frameworks: they usually come with many dependencies, but only very few developers run end-to-end tests locally. Consider isolating these tools into a separate part of the repository and writing a script to automatically handle installing and updating its dependencies when developers invoke the tool.
I’m not aware of any ready-to-use solution that unifies both the compilation and deployment process into a smooth experience. If you are building one, please let me know!
With all of the steps above applied, a repository can be checked out or rebased, and engineers can immediately start developing inside of it. Engineers will gain more control over the code they deploy to production, spend much less time waiting to install dependencies and benefit from a better separation of concerns. Further, because the tooling for bundling, type-checking, testing, and linting are separated into different packages, they can each be improved and managed separately.
Here, libraries are defined as packages that end up being published and consumed by other libraries or applications. Applications are defined as repositories that use something like Yarn workspaces to manage dependencies. ↩
I do not recommend this approach because it will negatively affect repository performance. It’s only ok if the binary in question rarely changes, like once a year. ↩