I know that someone has already written a distributed multimountable filesystem for S3
. But it's commercial and closed source.
I've not looked at how it works. But I've been thinking...
There exist already filesystems that are based on preallocated extents, and filesystems that are based on immutable extents. One can combine the two, and build a filesystem that builds on S3, like so...
Each inode structure is an S3 item. Also, each extent is likewise an S3 item. Actually, they are a sequence of S3 items, because they will be versioned. Every time an inode is changed, or an extent is modified, what actually happens is a new one gets written to S3, and the item names for them have a delimited suffix with the version number.
This allows multiple hosts to mount the filesystem readwrite, without being incoherent, and without needing a "live" distributed lock manager. If a host has it mounted, and is reading from some extent, and some other host writes to that extent, the first host will keep reading from the old one.
On a regular basis, such as on sync, a host will issue a list request against all the extents and inodes it is using. It will then thus discover any updated ones, and act accordingly.
Also, each host will write a "ping" item, probably at every such sync. Something can monitor the bucket, and delete all extents and inodes that are older than newest ping of the farthest behind mounting host.
If instead old extents are not deleted after they are obsoleted, it would in fact be possible to mount the filesystem readonly as it appeared at time X, for any arbitrary time between "now" and "just ahead of the reaper process".