Nix solves this as a byproduct (as it does with many things) of its design. You can have your tests be a "build", where the build succeeds if your tests pass. Any builds can be cached, which means you're essentially caching tests you've already run. Since Nix is deterministic, you never have to rerun any tests until anything about them that could change the evaluation changes.
fire_lake 20 days ago [-]
Nix is mostly deterministic due to lots of effort by the community but it’s hard to maintain full determinism in your own tests.
IshKebab 20 days ago [-]
Basically the only sane answer is Bazel (or similar). We currently use path based test running because I couldn't convince people it was wrong, and it regularly results in `master` being broken.
I don't really have a great answer for when that stops scaling - especially for low level things e.g. when Android changes bionic do they run every test because everything depends on it? Technically they should.
The only other cool technique I know of is coverage-based test ranking, but as far as I know it's only used in silicon verification. Basically if you have more tests than you can run, only run the top N based on coverage metrics.
turboponyy 20 days ago [-]
There's always the escape hatch of marking a derivation as impure.
However, most tests should really have no reason to be impure by design. And tests that do depend on side effects might as well be re-evaluated every time.
fire_lake 19 days ago [-]
The problem is it’s hard to know which tests are impure.
ay 20 days ago [-]
The better you can describe the interdependencies between the components, the more chances the selective approaches have.
However, often if you knew about a given dependency, you might have avoided bugs in the first place!
A simple scenario to illustrate what I have in mind: a system with two plugins, A and B. They provide completely independent functionality, and are otherwise entirely unrelated.
Plugin A adds a new function which allocates a large amount of memory. All tests for A pass. All tests for B pass. The tests for B when A is loaded fail.
Turns out A has done two things:
1) the new memory allocation together with the existing memory allocations in B causes an OOM when both are used.
2) the new function addition offset the function table in the main program, and the plugin B was relying on a haedcoded function index, which didn’t change, by sheer chance, for years.
Those are the the tales based on the real world experience, which made me abandon the idea for the project I am working on (VPP) - the trade offs didn’t seem to be worth it. For some other scenario they may be different though, so thanks for looking into this issue!
deathanatos 20 days ago [-]
Path-based selection of tests, in every single CI system I have ever seen it implemented in, is wrong. TFA thankfully gets the "run the downstream dependent tests" which is the biggest miss, but even then, you can get the wrong answer. Say I have the following: paths a/* need to run tests A, and paths b/* need to runs tests B. I configure the CI system as such. Commit A' is pushed, changing paths under a/* only, so the CI runs tests A, only. The tests fail. A separate engineer, quickly afterwards (perhaps, lets say, even before the prior commit has finished; this could just be two quick merges) pushes commit B', changing paths under b/* only. The CI system runs tests B only, and they all pass so the commit is marked green incorrectly. Automated deployment systems, or other engineers, proceed to see passing tests, and use a broken but "green" build.
Since ever rarely do I see the downstream-tests requirement correctly done, and almost always the "successive miss" bug, I'm convinced for these reasons path based is just basically a broken approach. I think a better approach is to a.) compute your inputs to the tests, and cache results, and b.) have the CI system be able to suss out flakes and non-determinism. But I think a.) is actually quite difficult (enough to be one of the hardest problems in CS, remember?) and b.) is not at all well supported by the current CI systems.
(I have seen both of these approaches result in bad (known bad, had the tests run!) builds pushed to production. The common complaint is "but CI is slow" followed by a need to do something, but without care towards the correctness of that something. Responsibility for a slow CI is often diffused across the whole company, managers do not want to do the hard task of getting their engineers to fix the tests they're responsible for, since that just doesn't get a promotion. So CI remains slow, and brittle.)
emidln 20 days ago [-]
It's possible to use bazel to do this. You need to be very explicit (The Bazel Way(tm)), but in exchange, you can ask the graph for everything that is an rdep of a given file. This isn't always necessary in bazel (if you avoid weird integration targets, deploy targets, etc) where `bazel test //...` generally does this by default anyway. It's sometimes necessary to manually express due to incomplete graphs, tests that are not executed every time (for non-determinism, execution cost, etc), and a few other reasons but at least it's possible.
deathanatos 20 days ago [-]
Yeah, bazel is sort of the exception that proves the rule, here. I wish it did not have such an absurd learning curve; I've found it next to impossible to get started with it.
danpalmer 20 days ago [-]
Bazel is weird. I work at Google, and it's fundamental to our engineering, and I love it for that, but to me it's an internal tool. Nearly all Bazel code I work with is internal (most open source uses of it I see aren't doing enough customisation to be worth it in my opinion), it integrates into so many parts of our engineering workflow, and uses so many internal services like our build farm.
I'm not sure I see much point in using it outside of Google. Maybe if you're a company within 1/10th the size and have a lot of Xooglers? It seems like the sort of thing where most companies don't need it and therefore shouldn't use it, and those that do need it probably need to build their own that's right for them.
emidln 20 days ago [-]
It worked well enough for my last company. It does require a team to teach it and to do the heavy lifting on custom rules and tooling. I'd rather do that than worry about whether my c++ that was exposing me to millions/billions of risk might have skipped some tests in our haste for an intraday release.
esafak 20 days ago [-]
What would you use for a polyglot project outside of Google?
pianoben 20 days ago [-]
Not only is it hard to get started with Bazel, it's hard to continue with Bazel even once you do. Unless you are using Google practices (say, vendoring and compiling all dependencies), you will certainly end up mired in poorly-maintained third-party tools that will cause no end of fun side-quests.
My work became a lot more, ahem, linear, once I moved us off of Bazel.
Some of this is pretty simple, though, at the coarse-grained level. If you have a frontend and a backend, and you change the backend, you run all backend unit tests, backend component tests, and the end to end tests. If you change the frontend, you run the frontend unit tests, the frontend component tests, and the end to end tests.
To stop the problem you mentioned you can either tick the box in Github that says only up to date branches can merge, or in Gitlab you can use merge trains. "Running tests on an older version of the code" is a bigger problem than just this case, but in all cases I can think of enabling those features solves it.
deathanatos 19 days ago [-]
> To stop the problem you mentioned you can either tick the box in Github that says only up to date branches can merge
No, that checkbox doesn't save you. The example above still fails: the second branch, B, after rebase, still only touches files in b/*, and thus still fails to notice the failure in A. The branch CI run would be green, and the merge would be green, both false positives.
agos 20 days ago [-]
what if you ran tests based on the paths touched by the branch, instead of the single commit?
deathanatos 19 days ago [-]
The example assumes two branches with a single commit being merged to the repo HEAD, so "testing the branch" and "testing the commit" are equivalent / testing the whole branch does not save you.
(But if you only test the last commit — and some implementations of "did path change?" in CI systems do this — then it is more incorrect.)
motorest 20 days ago [-]
The premise of this article sounds an awful lot like a solution desperately searching for a problem.
The scenario used to drive this idea is a slippery slope fallacy that tests can take over an hour to run after years. That's rather dubious, but still this leaves out the fact that tests can be run in parallel. In fact, that is also a necessary condition of selective testing. So why bother with introducing with yet more complexity?
To make matters worse if your project grows so large that your hypothetical tests take over an hour to run, it sounds like the project would already be broken down into modules. That, alone, already allows tests to only run if a specific part of the project run.
So it's clear that test run time is not a problem that justifies throwing complexity to solve it. Excluding the test runtime argument, is there anything at all that justifies this?
atq2119 20 days ago [-]
> To make matters worse if your project grows so large that your hypothetical tests take over an hour to run, it sounds like the project would already be broken down into modules.
Story time so that you can revisit your assumptions.
Imagine your product is a graphics driver. Graphics APIs have extensive test suites with millions of individual tests. Running them serially typically takes many hours, depending on the target hardware.
But over the years you invariably also run across bugs exposed by real applications that the conformance suites don't catch. So, you also accumulate additional tests, some of them distilled versions of those triggers, some of them captured frames with known good "golden" output pictures. Those add further to the test runtime.
Then, mostly due to performance pressures, there are many options that can affect the precise details of how the driver runs. Many of them are heuristics auto-tuned, but the conformance suite is unlikely to hit all the heuristic cases, so really you should also run all your tests with overrides for the heuristic. Now you have a combinatorial explosion that means the space of tests you really ought to run is at least in the quadrillions.
It's simply infeasible to run all tests on every PR, so what tends to happen in practice is that a manually curated subset of tests is run on every commit, and then more thorough testing happens on various asynchronous schedules (e.g. on release branches, daily on the development branch).
I'm not convinced that the article is the solution, but it could be part of one. New ideas in this space are certainly welcome.
lihaoyi 20 days ago [-]
> So it's clear that test run time is not a problem that justifies throwing complexity to solve it.
There's a lot I can respond to in this post, but I think the bottom line is: if you have not experienced the problem, count your blessings.
Lots of places do face these problems, with test suites that take hours or days to run if not parallelized. And while parallelization reduces latency, it does not reduce costs, and test suites taking 10, 20, or 50USD every time you update a pull request are not uncommon either
If you never hit these scenarios, just know that many others are not so lucky
motorest 20 days ago [-]
> There's a lot I can respond to in this post, but I think the bottom line is: if you have not experienced the problem, count your blessings.
You don't even specify what problem is it. Again, this is a solution searching for a problem, and one you can't even describe.
> Lots of places do face these problems, with test suites that take hours or days to run if not parallelized.
What do you mean "if not parallelized"? Are you running tests sequentially and then complaining about how long they take to run?
I won't even touch on the red flag which is the apparent lack of modularization.
> And while parallelization reduces latency, it does not reduce costs, and test suites taking 10, 20, or 50USD every time you update a pull request are not uncommon either
Please explain exactly how you managed to put together a test suite that costs up to 50€ to run.
I assure you the list of problems and red flags you state along the way will never even feature selective testing as a factor or as a solution.
TypingOutBugs 20 days ago [-]
> What do you mean "if not parallelized"? Are you running tests sequentially and then complaining about how long they take to run?
Some CI runners are single core unless you pay more, some tests are very hard to parallelise (integration, E2E), some test code bases are large and unwieldy and would require a lot of cost to improve to allow parallelisation.
> Please explain exactly how you managed to put together a test suite that costs up to 50€ to run.
If you need to run a hefty windows runner on GitHub it’s $0.54 per minute, so a 90 minute full suite will cost $50.
You could be better with different tests running at PR/Nightly but if quality is a known issue some orgs will push to run total test coverage each PR (saw this at a large finance company).
lihaoyi 20 days ago [-]
Notably Github Actions runners are about 4x more expensive than AWS on demand, and windows is also more expensive than Linux.
50USD gets you about 960 linux core-hours on AWS. A lot more than 90 minutes on 1 box on Windows/GHA, but not so much that someone running a lot of unit/integration/e2e tests in a large codebase won't be able to use
motorest 20 days ago [-]
> Some CI runners are single core unless you pay more
I don't know which hypothetical CI runner you have in mind. Back in reality, most CICD services bill you on time spent running tests, not "core". Even core count is reflected as runtime multpliers. Moreover, services like GitHub not only offer a baseline of minutes per month, and on top of that provide support for self-hosted runners that cost you nothing.
> some tests are very hard to parallelise (integration, E2E)
No, that's simply false and fundamentally wrong. You can run any test on any scenario independently of any other test from another scenario. You need to go way out of your way to screw up a test suite so badly you have dependencies between tests covering different sets of features.
> If you need to run a hefty windows runner on GitHub it’s $0.54 per minute, so a 90 minute full suite will cost $50.
This argument just proves you're desperately grasping at straws.
The only Windows test runner from GitHub that costs that is their largest, most expensive runner: Windows 64-core runner. Do you need a Threadripper to use the app you're testing?
Even so, that hypothetical cost is only factored in if you run out of minutes from your plan, and you certainly do not order Windows 64-core runners from your free tier plan.
Even in your hypothetical cost-conscious scenario, GitHub actions do support self-hosted runners, which cost zero per minute.
> You could be better with different tests running at PR/Nightly (...)
None of the hypothetical scenarios you fabricated pose a concern. Even assuming you need a Windows machine with 64 processor cores to do an E2E test run of your app, the most basic intro to automated testing tutorial and test pyramids mentions how these tests can and often run on deployments to pre-prod and prod stages. This means that anyone following the most basic intro tutorial on the topic will manage to sidestep any of your hypothetical scenarios by gating auto promotions to prod.
TypingOutBugs 20 days ago [-]
> I don't know which hypothetical CI runner you have in mind. Back in reality, most CICD services bill you on time spent running tests, not "core". Even core count is reflected as runtime multpliers.
If you’re charging a runtime multiplier per core then there’s a cost per core. Included minutes on most runners are limited to basic versions with limited cores. Try xdist pytest on a default GitHub runner and get any speed up…
> Moreover, services like GitHub not only offer a baseline of minutes per month, and on top of that provide support for self-hosted runners that cost you nothing.
Except the cost of the hardware, sysadmin time for setting up and supporting ephemeral runners, monitoring, etc…
> No, that's simply false and fundamentally wrong. You can run any test on any scenario independently of any other test from another scenario.
In end to end system tests if you have 100 hitting at the same time how do you guarantee the underlying state is the same so the tests are idempotent?
Also another example, I set up testing pipelines for an OS that ran in an FPGA in a HIL CI test. I had three of these due to operating costs. How could I parallelise tests that required flashing firmware AND have the most pipelines running as possible?
csomar 20 days ago [-]
I can see tests running 50$ if you are running them on Github compute which is x10-x15 times more expensive than AWS. Still, even at 2-3$ that's still quite expensive. Maybe if you have a really large low-level build. I have a really small one and it consumes roughly 10 minutes on Github.
mhlakhani 20 days ago [-]
> Please explain exactly how you managed to put together a test suite that costs up to 50€ to run.
I'm not OP but have worked with them: have you considered a repo that might have tents of thousands of committers over decades? It's very easy to just have an insane amount of tests.
csomar 20 days ago [-]
How does modules solve this problem? If you changed Module A, don't you need to also test every component that depends on that module to account for any possible regression?
> The scenario used to drive this idea is a slippery slope fallacy that tests can take over an hour to run after years.
It doesn't really take much for a project to grow to one hour for tests to run. If you have a CI/CD pipeline that executes tests on every commit and you have 30-40 commits daily, that's 30-40 hours.
Still, I'd rather setup a full 2 machines to run the tests than rather have to deal with selective testing. It might make sense if you have hundreds of commits per day.
mhlakhani 20 days ago [-]
In large mono-repos, like this one is presumably targeting, running all tests in the repo for a given PR would take years (maybe even decades/centuries) of compute time. You have to do some level of test selection, and there are full time engineers who just work on optimizing this.
The test runtime argument is the main one IMO.
(source: while I did not work on this at a prior job, I worked closely with the team that did this work).
motorest 20 days ago [-]
> In large mono-repos, like this one is presumably targeting, running all tests in the repo for a given PR would take years (maybe even decades/centuries) of compute time.
No, not really. That's a silly example. Your assertion makes as much sense as arguing that your monorepo would take years to build because all code is tracked by the same repo. It doesn't it, does it? Why?
How you store your code has nothing to do with how you build it.
danpalmer 20 days ago [-]
Being a monorepo is typically about more than just how code is stored, it leads to very different practices about dependency management, building, etc. On the monorepo I work on, the fact it is a monorepo is intrinsically linked to the build, testing, and deployment processes in many ways.
The ideal is that your build system by-necessity contains the data to be able to selectively test – typically the case if you're linking code in some way. You import a library, now your tests get run if that library changes. As the article suggests, this breaks down over service boundaries, but as you suggest, you still hopefully have modules you can link up.
The problem is when you have hundreds of services, maintaining those dependencies manually could be hard. When you have thousands it may be nearly impossible. When you have hundreds of thousands it may be impossible. I think applying ML to that problem so that you can incrementally understand the ever changing dependencies across services.
I can also assure you that however smart the build system is, there will always be spooky action at a distance between components.
if you change a low level library that's the equivalent of the C++ standard library and you want to test the changes, you effectively have to rebuild the world. And you don't want to.
danpalmer 20 days ago [-]
Exactly, or when "one" build rule connecting a service to another is actually a hundred deeply nested build rules doing a lot more work, and every one of those code paths would need to correctly convey whether a dependency is required at the test level or not.
rurban 20 days ago [-]
When your CI is too big, do at least a random selection. So you'll catch bugs at least as somewhen. With selective testing you just ignore them
brunoarueira 20 days ago [-]
On my last job, since the project is based on Ruby on Rails, we implementei this https://github.com/toptal/crystalball and additional modifications based on the gitlab setup. After that, the pull requests test suite runs pretty fast
lbriner 19 days ago [-]
Like others, I think this is a solution describing an idealised problem but it very quickly breaks down.
Firstly, if we could accurately know the dependencies that potentially affected a top-level test, we are not like to have a problem in the first place. Our code base is not particularly complex and is probably around 15 libraries and a web app + api in a single solution. A change to something in a library potentially affects about 50 places (but might not affect any of these) and most of the time there is no direct/easy visibility of what calls what to call what to call what. There is also no correlation between folders and top-level tests. Most code is shared, how would that work?
Secondly, we use some front-end code (like many on HN), where a simple change could break every single other front-end page. Might be bad architecture but that is what it is and so any front-end change would need to run every UI change. The breaks might be subtle like a specific button now disappears behind a sidebar. Not noticeable on the other pages but will definitely break a test.
Thirdly, you have to run all of your tests before deploying to production anyway so the fact you might get some fast feedback early on is nice but most likely you won't notice the bad stuff until the 45 minutes test suite has run at which point, you have blocked production and will have to prove that you have fixed it before waiting another 45 minutes.
Fourthly, a big problem for us (maybe 50% of the failures) are flaky tests (maybe caused by flaky code, timing issues, database state issue or just hardware problems) and running selective tests doesn't deal with this.
And lastly, we already run tests somewhat selectively - we run unit tests on branch uilds before building main, we have a number of test projects in parallel but still with less than perfect Developers, less than perfect Architecture, less than perfect CI tools and environments, I think we are just left to try and incrementally improve things by identifying parallelisation opportunities, not over-testing functionality that is not on the main paths etc.
atq2119 20 days ago [-]
Another helpful tool that should be mentioned in this context is the idea of merge trains, where a thorough test run is amortized over many commits (that should each have first received lighter selective testing).
This doesn't necessarily reduce the latency until a commit lands, though it might by reducing the overall load on the testing infrastructure. But it ensures that all commits ultimately get the same test coverage with less redundant test expense.
That avoids the problem where a change in component A can accidentally break a test that is only tested on changes to component B.
(It also eliminates regressions that occur when two changes land that have a semantic conflict that wasn't detected because there was no textual conflict and the changes were only tested independently.)
the_gipsy 20 days ago [-]
> Selective testing is a key technique necessary for working with any large codebase or monorepo: picking which tests to run to validate a change or pull-request, because running every test every time is costly and slow.
Wait, is it really? Or is it preemptively admitting defeat?
gorset 20 days ago [-]
Selective testing is great also for smaller projects to help keep velocity high for merging and working with branches with many commits.
I implemented selective testing using bazel for a CI some years ago, and it was painful to get it right. When finished even bigger branches would only take seconds-to-minutes to go through the pipeline, which was a significant improvement from the ~30 minutes build when I started working with the project, even though the project size grew a lot.
Glad to see mill-build is prioritizing this feature.
reynaldi 20 days ago [-]
Interesting, I just now know about selective testing after reading this post. But, now I wonder when do you do selective testing and when do you just break up the codebase?
lihaoyi 20 days ago [-]
Even if you break up a codebase, you need selective testing.
Let's say you have 100 small repos, and make a change to one. How confident are you in the changed repo's test suite that you can guarantee there are no bugs that will affect other related repos? If not, which other repos do you test with your change to get that confidence?
menaerus 20 days ago [-]
Splitting the code into 10s or 100s repositories is a cancer. It's almost as if you're intentionally trying to make your dev life miserable by pretending that you speeded up the dev turnaround times by not running the full test circle.
atmosx 20 days ago [-]
You break up the codebase when the communication patterns in the org calls for it e.g. engineers complain about it.
It’s not a purely technical problem.
hinkley 20 days ago [-]
Years ago there was a system that would track code coverage per test and would rerun only the tests that intersected with the lines you changed., as a more specific watch command.
But watch itself still has a lot of value, because it starts in the gap between when you save changes and get curious about the state of the tests.
Ar-Curunir 20 days ago [-]
I’ve wanted something like this deeply integrated into my programming language. The compiler already knows that fine-grained dependency information; it should be able to use that to ignore tests for modules that are not in the dependency subgraph of the diff
k3vinw 20 days ago [-]
Another approach that helped us was relocating integration tests out of the main CI build which were costly and not critical to run as part of every CI build. We now run them daily against the latest version of the build.
dshacker 20 days ago [-]
Hah, funnily enough this is exactly what I was working on this year :)
pluto_modadic 20 days ago [-]
I think... an approach that might work... is testing the immediate unit first (fastest, guess by changed files), then dependent tests you didn't test yet (e.g., guessing based on imports), then integration tests that use the feature, then full tests (regardless), then fuzzing...
order the test by what is likely to fail quickest. you could even have deployed it during the dependent tests (and run the full and fuzz tests overnight), and have it report back that "no, the commit actually didn't pass".
or, bonus points, if you just made a new test case as part of a bugfix or regression ticket, it should definitely test that test unit you just modified first. any commit in a tests/ folder or `test_`.
I mean... maybe it should be running on your laptop...
I don't really have a great answer for when that stops scaling - especially for low level things e.g. when Android changes bionic do they run every test because everything depends on it? Technically they should.
The only other cool technique I know of is coverage-based test ranking, but as far as I know it's only used in silicon verification. Basically if you have more tests than you can run, only run the top N based on coverage metrics.
However, most tests should really have no reason to be impure by design. And tests that do depend on side effects might as well be re-evaluated every time.
However, often if you knew about a given dependency, you might have avoided bugs in the first place!
A simple scenario to illustrate what I have in mind: a system with two plugins, A and B. They provide completely independent functionality, and are otherwise entirely unrelated.
Plugin A adds a new function which allocates a large amount of memory. All tests for A pass. All tests for B pass. The tests for B when A is loaded fail.
Turns out A has done two things:
1) the new memory allocation together with the existing memory allocations in B causes an OOM when both are used.
2) the new function addition offset the function table in the main program, and the plugin B was relying on a haedcoded function index, which didn’t change, by sheer chance, for years.
Those are the the tales based on the real world experience, which made me abandon the idea for the project I am working on (VPP) - the trade offs didn’t seem to be worth it. For some other scenario they may be different though, so thanks for looking into this issue!
Since ever rarely do I see the downstream-tests requirement correctly done, and almost always the "successive miss" bug, I'm convinced for these reasons path based is just basically a broken approach. I think a better approach is to a.) compute your inputs to the tests, and cache results, and b.) have the CI system be able to suss out flakes and non-determinism. But I think a.) is actually quite difficult (enough to be one of the hardest problems in CS, remember?) and b.) is not at all well supported by the current CI systems.
(I have seen both of these approaches result in bad (known bad, had the tests run!) builds pushed to production. The common complaint is "but CI is slow" followed by a need to do something, but without care towards the correctness of that something. Responsibility for a slow CI is often diffused across the whole company, managers do not want to do the hard task of getting their engineers to fix the tests they're responsible for, since that just doesn't get a promotion. So CI remains slow, and brittle.)
I'm not sure I see much point in using it outside of Google. Maybe if you're a company within 1/10th the size and have a lot of Xooglers? It seems like the sort of thing where most companies don't need it and therefore shouldn't use it, and those that do need it probably need to build their own that's right for them.
My work became a lot more, ahem, linear, once I moved us off of Bazel.
To stop the problem you mentioned you can either tick the box in Github that says only up to date branches can merge, or in Gitlab you can use merge trains. "Running tests on an older version of the code" is a bigger problem than just this case, but in all cases I can think of enabling those features solves it.
No, that checkbox doesn't save you. The example above still fails: the second branch, B, after rebase, still only touches files in b/*, and thus still fails to notice the failure in A. The branch CI run would be green, and the merge would be green, both false positives.
(But if you only test the last commit — and some implementations of "did path change?" in CI systems do this — then it is more incorrect.)
The scenario used to drive this idea is a slippery slope fallacy that tests can take over an hour to run after years. That's rather dubious, but still this leaves out the fact that tests can be run in parallel. In fact, that is also a necessary condition of selective testing. So why bother with introducing with yet more complexity?
To make matters worse if your project grows so large that your hypothetical tests take over an hour to run, it sounds like the project would already be broken down into modules. That, alone, already allows tests to only run if a specific part of the project run.
So it's clear that test run time is not a problem that justifies throwing complexity to solve it. Excluding the test runtime argument, is there anything at all that justifies this?
Story time so that you can revisit your assumptions.
Imagine your product is a graphics driver. Graphics APIs have extensive test suites with millions of individual tests. Running them serially typically takes many hours, depending on the target hardware.
But over the years you invariably also run across bugs exposed by real applications that the conformance suites don't catch. So, you also accumulate additional tests, some of them distilled versions of those triggers, some of them captured frames with known good "golden" output pictures. Those add further to the test runtime.
Then, mostly due to performance pressures, there are many options that can affect the precise details of how the driver runs. Many of them are heuristics auto-tuned, but the conformance suite is unlikely to hit all the heuristic cases, so really you should also run all your tests with overrides for the heuristic. Now you have a combinatorial explosion that means the space of tests you really ought to run is at least in the quadrillions.
It's simply infeasible to run all tests on every PR, so what tends to happen in practice is that a manually curated subset of tests is run on every commit, and then more thorough testing happens on various asynchronous schedules (e.g. on release branches, daily on the development branch).
I'm not convinced that the article is the solution, but it could be part of one. New ideas in this space are certainly welcome.
There's a lot I can respond to in this post, but I think the bottom line is: if you have not experienced the problem, count your blessings.
Lots of places do face these problems, with test suites that take hours or days to run if not parallelized. And while parallelization reduces latency, it does not reduce costs, and test suites taking 10, 20, or 50USD every time you update a pull request are not uncommon either
If you never hit these scenarios, just know that many others are not so lucky
You don't even specify what problem is it. Again, this is a solution searching for a problem, and one you can't even describe.
> Lots of places do face these problems, with test suites that take hours or days to run if not parallelized.
What do you mean "if not parallelized"? Are you running tests sequentially and then complaining about how long they take to run?
I won't even touch on the red flag which is the apparent lack of modularization.
> And while parallelization reduces latency, it does not reduce costs, and test suites taking 10, 20, or 50USD every time you update a pull request are not uncommon either
Please explain exactly how you managed to put together a test suite that costs up to 50€ to run.
I assure you the list of problems and red flags you state along the way will never even feature selective testing as a factor or as a solution.
Some CI runners are single core unless you pay more, some tests are very hard to parallelise (integration, E2E), some test code bases are large and unwieldy and would require a lot of cost to improve to allow parallelisation.
> Please explain exactly how you managed to put together a test suite that costs up to 50€ to run.
If you need to run a hefty windows runner on GitHub it’s $0.54 per minute, so a 90 minute full suite will cost $50.
You could be better with different tests running at PR/Nightly but if quality is a known issue some orgs will push to run total test coverage each PR (saw this at a large finance company).
50USD gets you about 960 linux core-hours on AWS. A lot more than 90 minutes on 1 box on Windows/GHA, but not so much that someone running a lot of unit/integration/e2e tests in a large codebase won't be able to use
I don't know which hypothetical CI runner you have in mind. Back in reality, most CICD services bill you on time spent running tests, not "core". Even core count is reflected as runtime multpliers. Moreover, services like GitHub not only offer a baseline of minutes per month, and on top of that provide support for self-hosted runners that cost you nothing.
> some tests are very hard to parallelise (integration, E2E)
No, that's simply false and fundamentally wrong. You can run any test on any scenario independently of any other test from another scenario. You need to go way out of your way to screw up a test suite so badly you have dependencies between tests covering different sets of features.
> If you need to run a hefty windows runner on GitHub it’s $0.54 per minute, so a 90 minute full suite will cost $50.
This argument just proves you're desperately grasping at straws.
The only Windows test runner from GitHub that costs that is their largest, most expensive runner: Windows 64-core runner. Do you need a Threadripper to use the app you're testing?
Even so, that hypothetical cost is only factored in if you run out of minutes from your plan, and you certainly do not order Windows 64-core runners from your free tier plan.
Even in your hypothetical cost-conscious scenario, GitHub actions do support self-hosted runners, which cost zero per minute.
> You could be better with different tests running at PR/Nightly (...)
None of the hypothetical scenarios you fabricated pose a concern. Even assuming you need a Windows machine with 64 processor cores to do an E2E test run of your app, the most basic intro to automated testing tutorial and test pyramids mentions how these tests can and often run on deployments to pre-prod and prod stages. This means that anyone following the most basic intro tutorial on the topic will manage to sidestep any of your hypothetical scenarios by gating auto promotions to prod.
If you’re charging a runtime multiplier per core then there’s a cost per core. Included minutes on most runners are limited to basic versions with limited cores. Try xdist pytest on a default GitHub runner and get any speed up…
> Moreover, services like GitHub not only offer a baseline of minutes per month, and on top of that provide support for self-hosted runners that cost you nothing.
Except the cost of the hardware, sysadmin time for setting up and supporting ephemeral runners, monitoring, etc…
> No, that's simply false and fundamentally wrong. You can run any test on any scenario independently of any other test from another scenario.
In end to end system tests if you have 100 hitting at the same time how do you guarantee the underlying state is the same so the tests are idempotent?
Also another example, I set up testing pipelines for an OS that ran in an FPGA in a HIL CI test. I had three of these due to operating costs. How could I parallelise tests that required flashing firmware AND have the most pipelines running as possible?
I'm not OP but have worked with them: have you considered a repo that might have tents of thousands of committers over decades? It's very easy to just have an insane amount of tests.
> The scenario used to drive this idea is a slippery slope fallacy that tests can take over an hour to run after years.
It doesn't really take much for a project to grow to one hour for tests to run. If you have a CI/CD pipeline that executes tests on every commit and you have 30-40 commits daily, that's 30-40 hours.
Still, I'd rather setup a full 2 machines to run the tests than rather have to deal with selective testing. It might make sense if you have hundreds of commits per day.
The test runtime argument is the main one IMO.
(source: while I did not work on this at a prior job, I worked closely with the team that did this work).
No, not really. That's a silly example. Your assertion makes as much sense as arguing that your monorepo would take years to build because all code is tracked by the same repo. It doesn't it, does it? Why?
How you store your code has nothing to do with how you build it.
The ideal is that your build system by-necessity contains the data to be able to selectively test – typically the case if you're linking code in some way. You import a library, now your tests get run if that library changes. As the article suggests, this breaks down over service boundaries, but as you suggest, you still hopefully have modules you can link up.
The problem is when you have hundreds of services, maintaining those dependencies manually could be hard. When you have thousands it may be nearly impossible. When you have hundreds of thousands it may be impossible. I think applying ML to that problem so that you can incrementally understand the ever changing dependencies across services.
I can also assure you that however smart the build system is, there will always be spooky action at a distance between components.
if you change a low level library that's the equivalent of the C++ standard library and you want to test the changes, you effectively have to rebuild the world. And you don't want to.
Firstly, if we could accurately know the dependencies that potentially affected a top-level test, we are not like to have a problem in the first place. Our code base is not particularly complex and is probably around 15 libraries and a web app + api in a single solution. A change to something in a library potentially affects about 50 places (but might not affect any of these) and most of the time there is no direct/easy visibility of what calls what to call what to call what. There is also no correlation between folders and top-level tests. Most code is shared, how would that work?
Secondly, we use some front-end code (like many on HN), where a simple change could break every single other front-end page. Might be bad architecture but that is what it is and so any front-end change would need to run every UI change. The breaks might be subtle like a specific button now disappears behind a sidebar. Not noticeable on the other pages but will definitely break a test.
Thirdly, you have to run all of your tests before deploying to production anyway so the fact you might get some fast feedback early on is nice but most likely you won't notice the bad stuff until the 45 minutes test suite has run at which point, you have blocked production and will have to prove that you have fixed it before waiting another 45 minutes.
Fourthly, a big problem for us (maybe 50% of the failures) are flaky tests (maybe caused by flaky code, timing issues, database state issue or just hardware problems) and running selective tests doesn't deal with this.
And lastly, we already run tests somewhat selectively - we run unit tests on branch uilds before building main, we have a number of test projects in parallel but still with less than perfect Developers, less than perfect Architecture, less than perfect CI tools and environments, I think we are just left to try and incrementally improve things by identifying parallelisation opportunities, not over-testing functionality that is not on the main paths etc.
This doesn't necessarily reduce the latency until a commit lands, though it might by reducing the overall load on the testing infrastructure. But it ensures that all commits ultimately get the same test coverage with less redundant test expense.
That avoids the problem where a change in component A can accidentally break a test that is only tested on changes to component B.
(It also eliminates regressions that occur when two changes land that have a semantic conflict that wasn't detected because there was no textual conflict and the changes were only tested independently.)
Wait, is it really? Or is it preemptively admitting defeat?
I implemented selective testing using bazel for a CI some years ago, and it was painful to get it right. When finished even bigger branches would only take seconds-to-minutes to go through the pipeline, which was a significant improvement from the ~30 minutes build when I started working with the project, even though the project size grew a lot.
Glad to see mill-build is prioritizing this feature.
Let's say you have 100 small repos, and make a change to one. How confident are you in the changed repo's test suite that you can guarantee there are no bugs that will affect other related repos? If not, which other repos do you test with your change to get that confidence?
It’s not a purely technical problem.
But watch itself still has a lot of value, because it starts in the gap between when you save changes and get curious about the state of the tests.
order the test by what is likely to fail quickest. you could even have deployed it during the dependent tests (and run the full and fuzz tests overnight), and have it report back that "no, the commit actually didn't pass".
or, bonus points, if you just made a new test case as part of a bugfix or regression ticket, it should definitely test that test unit you just modified first. any commit in a tests/ folder or `test_`.
I mean... maybe it should be running on your laptop...