Bazillion bytes

Status update: a distributed filesystem 2015-07-07

A lot of time has passed, and a lot of code has been written. Bazil is still in heavy development, but it has reached a good milestone to blog about: it can synchronize changes from one peer to another.

Warning: at this stage in development, we will put no effort into compatibility of file formats or protocols. Do not stare into laser with remaining eye.

What follows is a walkthrough of scenario where we have two computers sharing files – find me at GopherCon for a live demo, or follow the steps and run it yourself.

Installation

First, make sure you have a working Go (>=1.4) installation. You are expected to have basic familiarity with Go, at this point in development.

Unfortunately, to work around a missing gRPC feature, we need a custom branch of it for now. Let's check that out:

$ go get google.golang.org/grpc
$ cd $GOPATH/src/google.golang.org/grpc
$ git remote add bazil https://github.com/bazil/grpc-go
$ git fetch bazil
$ git checkout -b auth bazil/auth

And then install Bazil itself:

$ go get bazil.org/bazil

Initialization

For the rest, we'll assume you have two computers, virtual machines or containers that will talk to each other.

You can also run the steps on one host, by calling passing the bazil -data-dir=PATH option as appropriate to keep two separate state directories.

We'll call our two environments black and white, and differentiate them with that hostname in the prompt.

white$ bazil create
black$ bazil create

Public keys

To introduce the peers to each other, we need to pass their public keys to each other. As the current code doesn't actually keep track of any nicknames or aliases for peers, we'll need to refer to these public keys a lot. Let's set shell variables to remember them.

To see the public key of a node, run

white$ bazil debug pubkey

Typically, debug commands access the database directly, and will only work if the server is not running.

Now set the variable $BLACK on the host white with the value being the public key of black, and vice versa. If you're running the two on the same host, the following will work; if not, copy-pasting with the mouse is needed.

white$ BLACK="$(bazil -data-dir=path/to/datadir/of/black debug pubkey)"
black$ WHITE="$(bazil -data-dir=path/to/datadir/of/white debug pubkey)"

As is probably obvious from the debug in the command name, this is not the final UX for this.

Running the server

Bazil has a (per-user) server component that the command-line utilities communicate with. Let's start the server on white.

white$ bazil server run &
bazil: Listening on [::]:34211
black$ bazil server run &
bazil: Listening on [::]:nnnnn

Making friends

We believe in the value of encryption. Bazil uses convergent encryption with sharing keys where the people who know the relevant sharing key can have access to the data.

The default installation sets up one sharing key, but let's make a new one for our shared files; it's just 32 bytes of random data. We'll name our new sharing key friends.

white$ dd if=/dev/urandom of=sekrit bs=32 count=1
white$ bazil sharing add friends <sekrit

Let's create a volume using the new sharing key, and mount it.

white$ bazil volume create -sharing=friends myfiles
white$ mkdir mnt
white$ bazil volume mount myfiles mnt

We now have an encrypted, deduplicating, snapshottable, local file system. Let's share it with black, using the public key stored in $BLACK from earlier.

We introduce a new peer, identified by the public key stored in $BLACK. We tell white to allow black to access its local content-addressed storage, and the myfiles volume we just created.

white$ bazil peer add $BLACK
white$ bazil peer storage allow $BLACK local
white$ bazil peer volume allow $BLACK myfiles

Let's tell black to use the new volume. First, we introduce the white as a new peer for black, and giving the network location where the server on white is listening on. The server prefers the port 34211 (bazil, do you see it?), but will use any free port. We saw the port output earlier.

black$ bazil peer add $WHITE
black$ bazil peer location set $WHITE 192.0.2.42:34211

Later, we'll introduce more rendezvous mechanisms, including multicast DNS and an internet-wide lookup based on the public key, and mechanisms for working behind NATs.

black needs to know the sharing key from earlier. Copy the sekrit file from white to black through whatever means are appropriate, and then run

black$ bazil sharing add friends <sekrit
black$ bazil volume connect -sharing=friends $WHITE myfiles
black$ bazil volume storage add -sharing=friends myfiles peerkey:$WHITE
black$ mkdir mnt
black$ bazil volume mount myfiles mnt

We now have the save volume mounted on two machines.

A distributed filesystem

Let's make changes on white and observe them on black.

white$ echo hello, world >mnt/greeting
black$ bazil volume sync myfiles $WHITE
black$ ls mnt
black$ cat mnt/greeting
white$ echo hello, again >mnt/greeting
black$ bazil volume sync myfiles $WHITE
black$ cat mnt/greeting

Hey! It works!

Limitations

The sync implementation doesn't currently handle deletions or subdirectories.

There is currently no user interface to resolve conflicts, or to finish sync merges that were postponed because a file was still open.

At this stage in development, we will put no effort into compatibility of file formats or protocols.

Future

After the obvious missing functionality mentioned is done, there's plenty of work to be done on making the user experience of managing peers better. The steps above are very manual and discrete right now, as that is what's easiest to debug.

Once the common usage scenarios have been explored, more convenient mechanisms can be added on top of these low-level steps, e.g. bootstrapping a peer connection over ssh, and interacting with friends over im with humans copy-pasting short messages.

To learn more about the why of Bazil, read the introductory blog post.

To understand the architecture of Bazil better, browse the documentation https://bazil.org/doc/ .

Bazil is still at an early stage in development, but the future looks really exciting. We'd love to have you participating.

posted on 2015-07-07, tagged as bazil


Introducing Bazil 2014-04-24

GopherCon is here, and it is time to reveal what Bazil is all about.

Bazil, also known as bazil.org/bazil, is a file system that lets your data reside where it is most convenient for it to reside.

Bazil is still under heavy development, but welcomes developers and curious power users. Here's a little teaser of what's coming.

Imagine you have

  • A laptop with a a 256GB SSD
  • A desktop with a 3TB hard disk
  • A cloud server or storage service
    (virtual machine with expandable disks, S3, etc)
  • or
    just a cheap computer in a closet with two slow 4TB disks
  • or
    external USB disk you plug in once a week for backups

On the desktop, you naturally want to be able to use the whole 3TB disk. And you're not always using the desktop, even when you're home – the sofa is just so comfortable. You'd like to work with your files even when you're on the laptop.

First try: Let's sync to the Cloud*

So you install the currently fashionable large-corporate-backed cloud-sync solution.

A file sync based solution will try to copy all of your files from the desktop to the laptop – yet the laptop's smaller SSD just can't hold that much! You're forced to play games with picking-and-choosing what folders get synchronized, and just don't have the convenience of grabbing that 8-year-old wedding photo on a whim.

  • Desktop use is just ok: you need to keep adjusting what folders are synced and what not
  • Laptop use is miserable
  • Cloud storage of 3TB gets expensive: Dropbox or AWS will charge you $1000/year for the storage alone. That's about 30TB of hard drives.
    And most providers don't inspire confidence on the privacy of your files.
  • or
    The large corporations are not interested in supporting your server in a closet.
  • or
    These are the wrong corporations to make money off of you buying hard drives, so they have no interest in supporting that either. Why don't you rent more online space, you're easier to monetize that way.

To modernize an aphorism, you can't put ten terabytes of files on a 500 GB SSD. Syncing between very disproportionate systems is fundamentally a problematic design, and is best for a small hand-picked set of files, not as an actual storage solution.

Don't take this the wrong way; you really should have some sort of remote backups for important data, in case the building burns down. S3 RRS/Glacier, Google Cloud Storage DRA seem very promising for backup cold storage; we'll come back to that later.

Second try: Use a network file system

Rocking it old school? We're down with that.

A network file system like CIFS or NFS, or something like sshfs, would let you use the files from the desktop on the laptop – but your wifi will never be as fast as the laptop's local SSD, in either bandwidth or latency, so now all your file accesses are crawling, and you end up hunting for an ethernet cable whenever you need to transfer something bigger.

To speed things up, you end up copying often used files to the SSD. Now you have several copies of the same files, and no idea what was modified when, or whether you're looking at the last copy, or whether it's safe to delete to free up space on the cramped laptop.

A network file system will also require for you to stay within wifi range. For travel, you're once again reduced to up manually copying files around, and once again lose track of where's the latest copy of what file.

  • Desktop use is kinda sorta tolerable: you're never sure whether the file you are looking at is the latest copy

  • Laptop use is miserable: you're confused about which copies of your files are the right ones, the network file system is an umbilical tying you to your home network, and everything goes over the slow wifi all the time

  • Cloud storage is still expensive, but now you can use it as backup only and bypass the synchronization service providers: switching between clouds is easier, and cold storage and reduced availability is cheaper.

    However, this leaves you installing & configuring cloud backup software in addition to your network file system woes; not the simplest ordeal, and don't expect any kind of file history browsing / recovery integration for you network file system clients.

  • or

    You can choose to back up to your own disks – with the same caveats as above

  • or

    All of the bad parts of the computer in the closet, with the extra of needing to fiddle with the disks and remember things.

What Bazil does

Bazil separates knowledge of a file from the contents of the file, letting the laptop know about all of the files, without having to store the contents of the file.

With Bazil, the laptop SSD contents act as

  • a cache: file contents accessed recently can be stored temporarily, in case they are needed again
  • a buffer: new content is saved fast on the SSD, and transferred to the desktop / cloud / server in a closet in the background
  • a stash: the user can pin files for use when offline or just using a slower Internet connection

And because Bazil keeps track of the changes, it can also keep track of changes and synchronize them between the different peers; no more confusion about what copy is the latest.

You try to read a file where the contents are not locally stored, the data will be fetched from desktop or cloud/closet server, whichever happens to be the fastest way. All the data is accessible even if it won't fit on the SSD.

You can pin files for travel, so you're no longer tied to your home network, or even any network connectivity.

Bazil is the archival solution, with the snapshot feature. Every Bazil peer can browse the earlier snapshots, making restoring files easy no matter what computer you're on. You don't have to manage both a network file system and a backup solution.

Bazil is the redundancy solution, with copies of file contents stored on multiple computers. The CAS stores immutable, write-once objects, so you can even mitigate software bugs by taking an extra copy of the history with just rsync, file system snapshots, or any other file copy tool. A snapshot is just an object, and refers to other objects; the objects contain everything needed to regain access to your files.

All Bazil file storage can be encrypted to guarantee your privacy, whether in the cloud, on your own computers, or on external hard drives. Encryption is on by default.

  • Desktop use is good: changes are synced at the first opportunity
  • Laptop use is good: all the data is accessible, often used files are cached, changes are synchronized, files can be made available offline
  • and
    You can mix-and-match cloud storage providers and servers in closet as you please
  • and
    you can use external disks for extra space, with Bazil keeping track of what data is what on disk, even when they are unplugged, or even use them to avoid slow Internet transfers

Current status

Bazil is still under heavy development, and a lot of the functionality hinted at above is still not quite there. We welcome developers and curious power users.

See the documentation for more, and feel free to ask questions on the mailing list or Twitter @BazilFS.

The original gopher image was made by Renee French.

posted on 2014-04-24, tagged as bazil project announce


FUSE talk slides available 2013-06-10

Tv is talking about the Bazil FUSE library at the local Gopher meetup tonight. Check out the slides.

posted on 2013-06-10, tagged as fuse talk


Hello, World! 2013-04-01

Hi. This blog post establishes the Bazil.org project. This is not an April Fool's joke.

There are more ambituous things floating in the background, but many people have expressed interest in this, so here's an early release: a Go FUSE filesystem programming library.

This is based on Russ Cox‘s fuse library, as hosted at https://code.google.com/p/rsc/source/browse/#hg%2Ffuse

Here's how to get going:

go get bazil.org/fuse

The github repository is at https://github.com/bazil/fuse

posted on 2013-04-01, tagged as fuse announce


Projects

Bazil is a distributed file system designed for single-person disconnected operation. It lets you share your files across all your computers, with or without cloud services.

FUSE is a programming library for writing file systems in userspace, in Go.