VolunteerGrid2 Philosophies
We have some philosophies that have guided the decisions that resulted in
VolunteerGrid2 Policies:
Why storage capacity variation should be small
by Shawn Willden Sat Jan 15, 2011
original message
At first glance, it doesn't appear that there's any problem with allowing
both large and small nodes in the same grid, as long as the owners of the
nodes "play fair", meaning the small node owner only consumes capacity less
than or equal to the capacity he provides (including consideration of
encoding expansion).
The problem arises because the large contributor also expects to be able to
consume a large amount of storage -- it's only fair, right? But what
happens is that all of the small nodes quickly get filled up leaving only
the large nodes with any capacity, and if there aren't enough of them to
achieve good dispersion, the grid is effectively full.
Consider an extreme example: 10 nodes, nine of which provide 10 GB and one
which provides 1 TB. The total storage in the grid is 1090 GB, but as soon
as 100 GB has been uploaded, all of the small servers are full. The 1 TB
server still has 990 GB available, but it's unusable by anyone who actually
wants the reliability benefits of distributing their data. So the true
capacity of this grid is only 100 GB, and the additional 990 GB on the large
node offers no value to the grid -- but the owner of that node may well
have been the one who filled the grid, believing that was fair for him to do
so.
In general, if H is the "servers-of-happiness" setting, the grid becomes
effectively full as soon as all but the largest H-1 servers are full. So to
determine the actual capacity of the grid, take the H-1 largest servers and
assume they provide the same amount of storage as the Hth-largest server,
then sum. Obviously, the "fullness" of a grid with wide capacity variation
will depend on what you choose for H.
For a while, volunteergrid #1 was in the state that it was "full" for anyone
with H>5, even though there were terabytes of free storage, and even though
there were nearly 20 servers in the grid. It has now been fixed, but I
don't have much confidence it will stay fixed, because some of the nodes
that became non-full don't actually have very much storage available.
A related point is that there is a lot of value in setting K, H and N to be
significantly larger values than the defaults. I'll touch more on that in
my post about why we should institute an uptime commitment, but it's
relevant here just because wide variation in capacity means you have to
start reducing H to less than optimal values, which forces you to reduce K
as well. Optimally, you really want to set H to be nearly S (the number of
nodes in the grid), so that's bad.
I can think of two ways to avoid this problem.
1. Allow nodes of any capacity, but institute a limit on how much any node
operator can upload to the grid, in addition to the "fairness" rule.
Specifically, compute total grid capacity as defined above (picking a
generous H, and I suggest H should be around 3/4 of S), divide that by the
number of nodes and specify that no one is allowed to consume more than that
amount of storage, no matter how much they provide. This will ensure the
grid never fills up before anyone has reached their fair share, but it's
kind of complicated, and it means that a few very small nodes will impose an
artificially-low limit on maximum usage.
2. Keep all nodes pretty close to the same capacity -- maybe limited to no
more than 3x between largest and smallest.
Hmm. I started out typing this thinking I wanted to recommend option 2.
Now I'm thinking that maybe option 1 is better. It's not that complicated
to compute and it means we don't have to place as many artificial
restrictions on contributed nodes.
Obviously, both options will require some discussion/negotiation on the list
regarding minimum node capacities. If we end up with a max-usage value of
10 GB, for example, then this grid isn't useful to me (though I'll happily
contribute 10 GB anyway). Ideally, I want a grid with a max-usage value of
at least 500 GB, though that may not be achievable.
Why high availability is crucial
by Shawn Willden Sat Jan 15, 2011
original message
It may seem that the reason maintaining high node uptime is important is so
that files can be retrieved reliably, i.e. read-availability. In fact, the
bigger hurdle is maintaining write-availability. This is fairly obvious,
since to read you only need K servers and to write you need H servers and
usually H is significantly larger than K.
I think it's even more important than it appears, however, because I think
there's value in setting H very close to S (the number of servers in the
grid). If S=20 and H=18, then clearly it's crucial that availability of
individual servers be very high, otherwise the possibility of more than two
servers being down at once is high, and the grid is then unavailable for
writes.
So, why would you want to set H very high, rather than just sticking with
the 3/7/10 parameters provided by default?
There are two reasons you might want to increase H. The first is to
increase read-reliability and the second is so that you can increase K and
reduce expansion while maintaining a certain level of read-reliability. For
purposes of determining the likelihood that a file will be available at some
point in the future, I ignore N. Setting H and N to different values is
basically saying "I'll accept one level of reliability, but if I happen to
get lucky I'll get a higher one". That's fine, but when determining what
parameters to choose, it's H and K that make the difference. In fact if S
happens to decline so that at the moment of your upload S=H, then any value
of N > H is a waste.
If you want to find out what kinds of reliability you can expect from
different parameters, there's a tool in the Tahoe source tree.
Unfortunately, I haven't done the work to make it available from the web
UI, but if you want you can use it like this:
1. Go to the tahoe/src directory.
2. Run python without any command-line arguments to start the python
interpreter.
3. Type "import allmydata.util.statistics as s" to import the statistics
module and give it a handy label (s)
4. Type "s.pr_file_loss([p]*H, K)", where "p" is the server reliability,
and H and K are the values you want to evaluate.
What value to use for p? Well, ideally it's the probability that the data
on the server will not become lost before your next repair cycle. To be
conservative, I just use the server availability target, which I'm
proposing is 0.95.
The value you get is an estimate of the likelihood that your file will be
lost before the next repair cycle. If you want to understand how it's
calculated and maybe argue with me about its validity, read my lossmodel
paper (in the docs dir). I think it's a very useful figure.
However, unless you're only storing one file, it's only part of the story.
Suppose you're going to store 10,000 files. On a sufficiently-large grid
(which volunteergrid2 will not be), you can model the survival or failure of
each file independently, which means the probability that all of your files
survive is "(1-s.pr_file_loss([p]*H, K))**10000". Since volunteergrid2 will
not be big enough for the independent-survival model to be accurate, the
real estimate would fall somewhere between that figure and
"1-s.pr_file_loss([p]*H, K)", which is the single-file survival probability.
To be conservative, I choose to pay attention to lower probability, which
is the 10,000-file number.
Anyway, if you use that tool and spend some time playing with different
values of H and K, what you find is that if you increase H you can increase
K and reduce your expansion factor while maintaining your survival
probability. If you think about it, this makes intuitive sense, because
although you're decreasing the amount of redundancy, you're actually
increasing the number of servers that must fail in order for your date to
get lost. With 3/7, if five servers fail, your data is gone. With 7/15,
nine servers must fail. With 35/50, 16 must fail. Of course that's five
out of seven, nine out of 15 and 16 out of 50, but still, with relatively
high availability numbers, the odds of those failure rates are very close to
the same.
>From a read-performance perspective there's also some value in increasing K,
because it will allow more parallelism of downloads -- at least in theory.
With the present Tahoe codebase that doesn't help as much as it should, but
it will be fixed eventually. (At present, you do download in parallel from
K servers, but all K downloads are limited to the speed of the slowest, so
your effective bandwidth is K*min(server_speeds). If that were fixed, it
would just be the sum of the bandwidth available to the K servers).
So, if we can take as a given that larger values of K and H are a good thing
(and I'm happy to go into more detail about why that is if anyone likes;
I've glossed over a lot here), then the best way to choose your parameters
is to, ideally, set H=S and then choose the largest K that gives you the
level of reliability you're looking for.
But if you set H=S, then even a single server being unavailable means that
the grid is unavailable for writes. So you want to set H a little smaller
than S. How much smaller? That depends on what level of server
availability you have, and what level of write-availability you require.
I'd like to have 99% write-availability. If we have a 95% individual server
availability and a grid of 20 servers, the probability that at least a given
number of servers is available at any given moment is:
20 servers: 35.8%
19 servers: 73.6%
18 servers: 92.5%
17 servers: 98.4%
16 servers: 99.7%
15 servers: 99.9%
Again, if anyone would like to understand the way I calculated those, just
ask.
At 99.9% availability, if I can't write to the grid it's more likely because
my network connection is down than because there aren't enough servers to
satisfy H=15.
So, that's why I'd really like everyone to commit to trying to maintain 95+%
availability on individual servers. In practice if you have a situation
which takes your box down for a few days, it's not a huge deal, because more
than likely most of the nodes will have >95% availability, but what we don't
want is a situation (like we have over on volunteergrid1) where a server is
unavailable for weeks.
If you can't commit to keeping your node available nearly all the time, I
would rather that you're not in the grid. Sorry if that seems harsh, but I
really want this to be a production grid that we can actually use with very
high confidence that will always work, for both read and write.
Some thoughts on nodes, node owners and community
By Shawn Willden Tue Jan 18, 2011
original message
1. Each node shall provide no less than 500 GB of storage, and shall
consume no more than min(node_size, 1000 GB).
2. Each node shall maintain an uptime of at least 95% (eventually we may
even build some tools to monitor uptime).
3. Nodes shall not be co-located without consensus approval by the group.
4. Each node's nickname shall include the operator's e-mail address. The
recommended form is "<e-mail address>-", though operators who
provide only one node may omit the hostname.
5. Each node shall generally be no more than two releases behind the
current Tahoe-LAFS version. Node operators are encouraged to delay 1-2
weeks before deploying a new release.
6. All node operators shall be nice to one another when addressing any
violations of the above rules