When Python Environments Meet Distributed Filesystems, enter chaosfs
How distributed filesystems can make two hosts disagree about the same Python environment
In my When You Don’t Control the Python Environment post, I wrote about picopip and what happens when you need to run Python in an environment you do not fully control and how vendoring can be a solution to the problem.
That post was about package installation. This one is about filesystems.
It turns out those are not really separate topics.
In a system like Posit Connect, where Python environments are created on behalf of users and then reused to run their content, you very quickly stop living in the nice world where Python is just running out of a folder on your laptop. Connect has to create those environments, make sure they are compatible with the content and in a state the content can run without surprises and make sure they are immediately available on demand for any new content. Deployment environments should not be something users have to think about, if Connect does its job well, content works and users don’t need to know how and why.
That sounds great, but it also means the filesystem holding those environments becomes part of the runtime contract.
And filesystems, unfortunately, are not always as simple as Python would like them to be.
This post is the second one in what will probably be a small series on running Python in hostile environments.
The first problem that picopip tried to solve was: what do you do when you cannot trust the package manager story?
The second one is: what do you do when you cannot trust that two machines looking at the same environment are actually seeing the same state?
That is where chaosfs came from.
A virtualenv on shared storage is not just a folder
Python virtual environments look pleasantly boring. They are directories with an interpreter, some scripts, a pyvenv.cfg, and a site-packages tree.
On a local disk, that is usually the end of the story.
But on shared storage, it is the beginning of a different one: Imagine a distributed setup where one host creates a virtual environment, installs packages into it, validates it, and marks it as ready. Then another host gets a request that needs to use that same environment.
The first host can see the environment just fine.
The second one cannot.
Not because the install failed. Not because the environment is corrupt. Just because the state has not propagated yet, or because that host is still looking at a stale cached view of the same underlying data.
It means the machine that created the environment and the machine that needs to use it can disagree on whether the environment exists in a usable state.
And once that happens, a virtual environment stops being just a folder, it becomes distributed state pretending to be a folder.
Enter the multiverse
These failures are annoying because they surface at the Python layer, but the real problem is somewhere else.
For example, imagine Host A creates /environments/abc123, installs fastapi, and marks the environment ready.
Host B receives work that should run using /environments/abc123.
Host B can see the directory. It can maybe even see bin/python. But its view of site-packages is stale, so fastapi is still missing from what it observes.
From Host A’s point of view, the environment is complete.
From Host B’s point of view, the environment is not ready yet.
Both are looking at the same environment.
Both are technically correct.
And your system is now in trouble.
The error you get is something boring like ModuleNotFoundError, which makes it tempting to debug packaging, dependency resolution, or the build process.
But the real bug is that your platform decided “environment is ready” based on one machine’s view of shared state, and another machine had to live with a different one.
Why this is hard to develop for and debug
Most Python tooling quietly assumes local filesystem semantics.
Write a file, and the next reader sees it.
Create a directory, and listings are current.
Finish
pip install, and the environment is now ready to use.
Those assumptions are usually fine on 99.9% of the systems out there
They are much less safe once environments live on remote or shared storage and multiple hosts are involved, and the difficult part is that these bugs are often transient: Retry a moment later and everything works.
That makes them epic production bugs that your tests can’t catch and your developers can’t reproduce.
Do not play Whack-A-Mole
While the theory around eventually consistent systems is well known and studied, when I write code, I want solid proof it works as expected, I didn’t want to design my solution or code based on theoretical assumptions
I wanted a way to reproduce the exact failure shape that matters here: one machine writes environment state, another machine consumes it, and they do not agree yet on what exists.
That is the core idea behind chaosfs.
The important feature is not just “add delay to filesystem operations”. The interesting part is that chaosfs can mount multiple filesystems backed by the same underlying data, so different processes can observe different views of the same environment with slightly different delays or caching behavior.
That matters because it lets a single machine simulate the distributed case that is otherwise annoying to reproduce:
one process behaves like the host creating the environment
another behaves like the host trying to use it
both point to the same backing data, but they do not see the same state at the same time.
A concrete example
Suppose your platform does something like this:
env = create_virtualenv(”/shared/envs/abc123”)
install_requirements(env)
validate_environment(env)
mark_environment_ready(env)That logic looks perfectly reasonable.
Now add a second host:
run_job(env=”/shared/envs/abc123”)If Host A runs the first sequence and Host B runs the second, the hidden assumption is that once mark_environment_ready happens, Host B will observe the same environment Host A just validated.
That assumption is exactly what breaks on hostile storage.
Host A may have seen:
pyvenv.cfgbin/pythonall the installed packages
all metadata files
all console scripts
Host B may still see only part of that.
So you get the worst kind of failure. Not “environment creation failed”. Not “storage is down”. Just “the system said this environment was ready, and another host disagreed”.
That is the kind of bug chaosfs can force into the open.
Why this matters for Python specifically
Imports depend on files being there.
Virtual environments depend on a directory tree being complete and visible enough to reconstruct an interpreter plus package universe.
That is very convenient when the filesystem behaves well.
It is much less convenient when the filesystem is allowed to lag behind the writer.
And if you are building infrastructure that creates Python environments on demand, reuses them, caches them, and expects multiple machines to consume them, then you are not just running Python anymore.
You are running a distributed protocol made of files and directories, It just does not look like one at first.
Enter chaosfs
At that point I wanted something more useful than another theory about distributed filesystems: a way to run the existing test suite on top of a filesystem that behaves a bit more like the hostile environments I care about.
That is where chaosfs became really convenient.
It is a local FUSE harness for reproducing NFS-style consistency issues without needing a real NFS setup. It can mount multiple clients over the same backing directory and inject the kinds of problems that actually matter here: delayed visibility of writes, stale metadata and directory listings, delayed rename visibility, and random failures. It also supports deterministic seeds, so once you hit a bad interleaving you can reproduce it instead of hoping it happens again.
The multi-client part is especially useful. Two processes can operate on the same backing data while seeing slightly different views of it, which is exactly what I needed to model the “host A created the environment, host B tries to use it” case.
For example, the CLI can expose two mount points over the same backing directory:
BACKING_DIR=/tmp/chaosfs/backing
MOUNT_BASE=/tmp/chaosfs/mnt
LOG_DIR=/tmp/chaosfs/logs
mkdir -p “$BACKING_DIR” “$MOUNT_BASE” “$LOG_DIR”
chaosfs mount “$BACKING_DIR” “$MOUNT_BASE/clientA” \
--client-id clientA --log-dir “$LOG_DIR” --background
chaosfs mount “$BACKING_DIR” “$MOUNT_BASE/clientB” \
--client-id clientB --log-dir “$LOG_DIR” --backgroundAs a Python engineer, I also wanted this to fit naturally into pytest, without dedicated infrastructure or some separate test environment nobody wants to maintain. That is probably my favorite part: the fake distributed filesystem can be created directly inside the test itself.
from chaosfs import dual_mount
def test_writer_reader(tmp_path):
with dual_mount(tmp_path) as (writer, reader):
(writer / “data.txt”).write_text(”content”)
time.sleep(2.5) # wait for reader TTL to expire
assert (reader / “data.txt”).read_text() == “content”That makes it much easier to turn a vague reliability concern into something the test suite can answer.
If tests pass under those conditions, great. If they fail or turn flaky, there is still some hidden assumption in the code about the filesystem telling the truth immediately.
And that is the part I like most about tools like this: they let you stop debating whether a system is “probably robust” and start checking.
picopip came from not trusting the packaging story.
chaosfs came from not trusting the filesystem story.
I doubt it will be the last one.


