Tags:
create new tag
, view all tags

VolunteerGrid2 Philosophies

We have some philosophies that have guided the decisions that resulted in VolunteerGrid2 Policies:

Why storage capacity variation should be small

by Shawn Willden Sat Jan 15, 2011 original message

At first glance, it doesn't appear that there's any problem with allowing both large and small nodes in the same grid, as long as the owners of the nodes "play fair", meaning the small node owner only consumes capacity less than or equal to the capacity he provides (including consideration of encoding expansion).

The problem arises because the large contributor also expects to be able to consume a large amount of storage -- it's only fair, right? But what happens is that all of the small nodes quickly get filled up leaving only the large nodes with any capacity, and if there aren't enough of them to achieve good dispersion, the grid is effectively full.

Consider an extreme example: 10 nodes, nine of which provide 10 GB and one which provides 1 TB. The total storage in the grid is 1090 GB, but as soon as 100 GB has been uploaded, all of the small servers are full. The 1 TB server still has 990 GB available, but it's unusable by anyone who actually wants the reliability benefits of distributing their data. So the true capacity of this grid is only 100 GB, and the additional 990 GB on the large node offers no value to the grid -- but the owner of that node may well have been the one who filled the grid, believing that was fair for him to do so.

In general, if H is the "servers-of-happiness" setting, the grid becomes effectively full as soon as all but the largest H-1 servers are full. So to determine the actual capacity of the grid, take the H-1 largest servers and assume they provide the same amount of storage as the Hth-largest server, then sum. Obviously, the "fullness" of a grid with wide capacity variation will depend on what you choose for H.

For a while, volunteergrid #1 was in the state that it was "full" for anyone with H>5, even though there were terabytes of free storage, and even though there were nearly 20 servers in the grid. It has now been fixed, but I don't have much confidence it will stay fixed, because some of the nodes that became non-full don't actually have very much storage available.

A related point is that there is a lot of value in setting K, H and N to be significantly larger values than the defaults. I'll touch more on that in my post about why we should institute an uptime commitment, but it's relevant here just because wide variation in capacity means you have to start reducing H to less than optimal values, which forces you to reduce K as well. Optimally, you really want to set H to be nearly S (the number of nodes in the grid), so that's bad.

I can think of two ways to avoid this problem.

1. Allow nodes of any capacity, but institute a limit on how much any node operator can upload to the grid, in addition to the "fairness" rule. Specifically, compute total grid capacity as defined above (picking a generous H, and I suggest H should be around 3/4 of S), divide that by the number of nodes and specify that no one is allowed to consume more than that amount of storage, no matter how much they provide. This will ensure the grid never fills up before anyone has reached their fair share, but it's kind of complicated, and it means that a few very small nodes will impose an artificially-low limit on maximum usage.

2. Keep all nodes pretty close to the same capacity -- maybe limited to no more than 3x between largest and smallest.

Hmm. I started out typing this thinking I wanted to recommend option 2. Now I'm thinking that maybe option 1 is better. It's not that complicated to compute and it means we don't have to place as many artificial restrictions on contributed nodes.

Obviously, both options will require some discussion/negotiation on the list regarding minimum node capacities. If we end up with a max-usage value of 10 GB, for example, then this grid isn't useful to me (though I'll happily contribute 10 GB anyway). Ideally, I want a grid with a max-usage value of at least 500 GB, though that may not be achievable.

Why high availability is crucial

by Shawn Willden Sat Jan 15, 2011 original message

It may seem that the reason maintaining high node uptime is important is so that files can be retrieved reliably, i.e. read-availability. In fact, the bigger hurdle is maintaining write-availability. This is fairly obvious, since to read you only need K servers and to write you need H servers and usually H is significantly larger than K.

I think it's even more important than it appears, however, because I think there's value in setting H very close to S (the number of servers in the grid). If S=20 and H=18, then clearly it's crucial that availability of individual servers be very high, otherwise the possibility of more than two servers being down at once is high, and the grid is then unavailable for writes.

So, why would you want to set H very high, rather than just sticking with the 3/7/10 parameters provided by default?

There are two reasons you might want to increase H. The first is to increase read-reliability and the second is so that you can increase K and reduce expansion while maintaining a certain level of read-reliability. For purposes of determining the likelihood that a file will be available at some point in the future, I ignore N. Setting H and N to different values is basically saying "I'll accept one level of reliability, but if I happen to get lucky I'll get a higher one". That's fine, but when determining what parameters to choose, it's H and K that make the difference. In fact if S happens to decline so that at the moment of your upload S=H, then any value of N > H is a waste.

If you want to find out what kinds of reliability you can expect from different parameters, there's a tool in the Tahoe source tree. Unfortunately, I haven't done the work to make it available from the web UI, but if you want you can use it like this:

1. Go to the tahoe/src directory. 2. Run python without any command-line arguments to start the python interpreter. 3. Type "import allmydata.util.statistics as s" to import the statistics module and give it a handy label (s) 4. Type "s.pr_file_loss([p]*H, K)", where "p" is the server reliability, and H and K are the values you want to evaluate.

What value to use for p? Well, ideally it's the probability that the data on the server will not become lost before your next repair cycle. To be conservative, I just use the server availability target, which I'm proposing is 0.95.

The value you get is an estimate of the likelihood that your file will be lost before the next repair cycle. If you want to understand how it's calculated and maybe argue with me about its validity, read my lossmodel paper (in the docs dir). I think it's a very useful figure.

However, unless you're only storing one file, it's only part of the story. Suppose you're going to store 10,000 files. On a sufficiently-large grid (which volunteergrid2 will not be), you can model the survival or failure of each file independently, which means the probability that all of your files survive is "(1-s.pr_file_loss([p]*H, K))**10000". Since volunteergrid2 will not be big enough for the independent-survival model to be accurate, the real estimate would fall somewhere between that figure and "1-s.pr_file_loss([p]*H, K)", which is the single-file survival probability. To be conservative, I choose to pay attention to lower probability, which is the 10,000-file number.

Anyway, if you use that tool and spend some time playing with different values of H and K, what you find is that if you increase H you can increase K and reduce your expansion factor while maintaining your survival probability. If you think about it, this makes intuitive sense, because although you're decreasing the amount of redundancy, you're actually increasing the number of servers that must fail in order for your date to get lost. With 3/7, if five servers fail, your data is gone. With 7/15, nine servers must fail. With 35/50, 16 must fail. Of course that's five out of seven, nine out of 15 and 16 out of 50, but still, with relatively high availability numbers, the odds of those failure rates are very close to the same.

>From a read-performance perspective there's also some value in increasing K,
because it will allow more parallelism of downloads -- at least in theory. With the present Tahoe codebase that doesn't help as much as it should, but it will be fixed eventually. (At present, you do download in parallel from K servers, but all K downloads are limited to the speed of the slowest, so your effective bandwidth is K*min(server_speeds). If that were fixed, it would just be the sum of the bandwidth available to the K servers).

So, if we can take as a given that larger values of K and H are a good thing (and I'm happy to go into more detail about why that is if anyone likes; I've glossed over a lot here), then the best way to choose your parameters is to, ideally, set H=S and then choose the largest K that gives you the level of reliability you're looking for.

But if you set H=S, then even a single server being unavailable means that the grid is unavailable for writes. So you want to set H a little smaller than S. How much smaller? That depends on what level of server availability you have, and what level of write-availability you require.

I'd like to have 99% write-availability. If we have a 95% individual server availability and a grid of 20 servers, the probability that at least a given number of servers is available at any given moment is:

20 servers: 35.8% 19 servers: 73.6% 18 servers: 92.5% 17 servers: 98.4% 16 servers: 99.7% 15 servers: 99.9%

Again, if anyone would like to understand the way I calculated those, just ask.

At 99.9% availability, if I can't write to the grid it's more likely because my network connection is down than because there aren't enough servers to satisfy H=15.

So, that's why I'd really like everyone to commit to trying to maintain 95+% availability on individual servers. In practice if you have a situation which takes your box down for a few days, it's not a huge deal, because more than likely most of the nodes will have >95% availability, but what we don't want is a situation (like we have over on volunteergrid1) where a server is unavailable for weeks.

If you can't commit to keeping your node available nearly all the time, I would rather that you're not in the grid. Sorry if that seems harsh, but I really want this to be a production grid that we can actually use with very high confidence that will always work, for both read and write.

Some thoughts on nodes, node owners and community

By Shawn Willden Tue Jan 18, 2011 original message

1. Each node shall provide no less than 500 GB of storage, and shall consume no more than min(node_size, 1000 GB).

2. Each node shall maintain an uptime of at least 95% (eventually we may even build some tools to monitor uptime).

3. Nodes shall not be co-located without consensus approval by the group.

4. Each node's nickname shall include the operator's e-mail address. The recommended form is "<e-mail address>-", though operators who provide only one node may omit the hostname.

5. Each node shall generally be no more than two releases behind the current Tahoe-LAFS version. Node operators are encouraged to delay 1-2 weeks before deploying a new release.

6. All node operators shall be nice to one another when addressing any violations of the above rules wink

Topic revision: r5 - 2011-04-27 - JodyHarris
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback