« Why Cassandra is Unfit for Production | Main | Salzenberg's Law of Pretense »

September 07, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.

AB Periasamy

Good point. It is coming in 3.3 under the name of "pro-active self-heal". 3.2 introduced a marker framework. We can now query the file system exactly asking for changes affecting that node during the down time. 3.3 also brings in server to server healing (faster) and granular byte-range healing, which is important for fast VM image healing on the fly.

-AB Periasamy
CTO, Gluster.

Chip  Salzenberg

Well, AB, that sounds excellent! I will watch for 3.3.

Jeff Darcy

While I have my own concerns about GlusterFS's replication and self-healing, I think phrases like "failed to take replication seriously" are misleading and counterproductive. At least GlusterFS has replication, and ways to force repair. Most alternatives such as Lustre or PVFS don't even get that far; their non-solution is to use RAID to handle disk failures and roll-your-own failover through shared storage to handle node failures. That's *real* failure to take replication seriously. On-demand repair might not be what you or I are used to, and it has shown itself not to be what users want, but it's better than some other options.

...which brings us to Cassandra, which also relies on read repair. In their case it makes a lot more sense due to their read:write ratio, the presence of full version vectors that allow the repair to be more accurate, the R+W>N consistency model, and other factors. When you're dealing with truly large amounts of data, and with network latencies as well, fsck-like scans of every object in the system are the truly insane choice - which is why both Cassandra and GlusterFS are tending toward Merkle-tree types of approaches. The Dynamo approaches on which Cassandra is based work, and do a pretty good job of incurring load when and where they should.

All that said, I still think the GlusterFS approach sacrifices more performance than necessary in the non-failure case. The way they do both replication and distribution is also IMO not flexible enough, foregoing opportunities for better load/capacity balancing and forcing new capacity to be added N bricks at a time. I hope to address some of these issues by providing a different kind of replication in HekaFS (which is based on GlusterFS) but it'll take a while. Meanwhile, I think GlusterFS is doing a fine job even though it's not as good as it could be.

P.S. For extra fun, consider how you'd do repair in an asynchronously distributed environment with pure-client-side encryption. It turns out that read repair is pretty much the only option.

Chip  Salzenberg

GlusterFS replication is, without doubt, a joke. Supposedly offering replication without ensuring that when you need a complete and correct replica you will have it is nothing more than a joke. Unless it's bait-and-switch: "Sure we have replication. See, we write the data more than once! [usually] No problem!" Pick your explanation: incompetence or deception.

They claim 3.3 will be better. We'll see. They've hardly proven themselves trustworthy.

WRT Cassandra: If you had read my article and understood it, you'd understand that Cassandra's read repair still offers no guarantee that it worked because replication events triggered by the read repair are just as vulnerable to loss as the originals were. And you'll never know until it's too late.

Meanwhile, a Cassandra node repair is even *worse* than a typical fsck, because the *good* data have to be copied too. Have you experienced it yourself, on a machine at capacity (actually half capacity because Cassandra steals 50% of your disk space)? I have, which is why I know it's a joke.

Chip  Salzenberg

PS - as long as the pure client-side encryption is -consistent- there is no need for read repair. Nodes can flood propagate the encrypted blobs just fine, without any need to understand them.

Jeff Darcy

In fail-stop cases, GlusterFS does have a complete and correct replica, and enough information to know which one that is. The only thing it won't have is N complete and current copies, until that state is checked and repair done. No, it doesn't handle partition/split-brain situatios, nor (unlike Cassandra) does it claim to. It's essentially the same guarantee as RAID-1, which would also fail if different data were written to each replica. That might not be ideal, but it's certainly sufficient for many. Also, GlusterFS's modular architecture allows a different kind of replication to be implemented without having to re-write every part of a distributed filesystem. In fact, that's something I'm working on. Want to help? Let us all know what other open-source distributed filesystem solves these problems better.

Your PS about encrypted data etc. is only correct if you're always writing whole files, which is a tyro's assumption. If you want to support proper POSIX byte-oriented access, you have to be able to merge partially modified blocks. If servers don't have keys they can't do that themselves, and with async replication they can't count on calling out to a client to do it for them at the time a remote write is received. That leaves the option of having a client do the merge and write back the result *at their convenience* - i.e read repair.

There are a lot of VFS/POSIX and distributed-system details and concerns involved here. It would be nice if people would make at least a token effort to understand those before slinging insults. At least GlusterFS and Cassandra haven't been in vapor stage for ten years and counting.

Chip  Salzenberg

"No, it doesn't handle partition/split-brain situations"

Well, that's something agreed. Good.

"Your PS about encrypted data etc. is only correct if you're always writing whole files, which is a tyro's assumption."

You'll note I said "blobs", not "files". Blobs can be sub-file elements. If the design of GlusterFS doesn't allow for encrypted sub-file blobs, that doesn't harm my argument. Perhaps such a limitation was accepted for good reason.

In any case, the concept of read repair is not at fault here; demanding some client cooperation is acceptable, if not ideal. It's the implementation's requirement that the client Just Has To Know when to do it that is the flaw: lack of notice when guarantees are not being met. How is a client supposed to know when read repair pass is required? Perhaps more tellingly, when can a client know for certain *no* read repair passes are required because connectivity has been perfect?

"At least ... vapor stage ..."

Want to help? Oh, sorry, I slipped into Darcy mode there for a second. If you mean to bring in an ad hominem argument, you're not doing very well. You should know more about what you criticize (he wrote with deliberate sarcasm). Perl 5 is hardly vaporware. It's not even dead. :-)

Jeff Darcy

The changed definition of "blob" doesn't change anything, Chip. A filesystem has to handle the general case of overlapping writes at no more than byte granularity. Assuming that all writes will be to discrete "blobs" that do not exist in the filesystem standards or literature, but whose boundaries magically align with the filesystem's own, is functionally equivalent to assuming whole-file writes. Similarly, we seem to have gone from read repair being "ludicrous" without qualification to "not at fault" now. If you're going to base your flames on assumptions and claims so ill-justified that even you aren't willing to stand behind them, you don't get to act all aggrieved when people respond in a tone no worse than your own.

With regard to the issue of knowing when repair is done, I think we're really talking about two sorts of guarantees here: a short-term guarantee to the user, and a long-term guarantee to the operator. The guarantee in the first case is simply that if a write completes successfully in a system with N replicas then that data will be read subsequently despite N-1 failures in between. Both systems we're talking about satisfy that guarantee (in Cassandra's case as long as R+W>N) in different ways. There is no guarantee that there will be more than one copy of current data, that flow control (beyond that of the underlying transport) will be applied, and so on.

In the operational long term, there should be a guarantee that the system will restore itself to N replicas in finite time when there are no new requests. It's not feasible to expect the same when new requests continue to come in, or to expect that the current replication state will be perfectly knowable at one place and time. There is no single queue as is implied by your "replication should be a queue" edict, because of propagation delays and the infeasibility of imposing a single global order or view on such a system. One cannot usefully reason about a distributed system while clinging to single-system assumptions about omniscient observers or omnipotent actors.

As it turns out, the non-deterministic nature of repair in GlusterFS *is* a flaw IMO. I've complained about it for years, and instead of just complaining I've actually posted code to address it. I think you're wrong to indict Cassandra on the same point, though. In addition to read repair to address short-term inconsistency, there is another mechanism which you don't even seem aware of to address long-term inconsistency. It's usually called "anti-entropy" and it's based on Merkle trees to discover which parts of the whole data set differ. Unlike your "fire and forget" assumption, it's autonomously kicked off when nodes rejoin, and not treated as complete until successful reconciliation is positively indicated. It's all right there in the Dynamo papers. Its implementation in Cassandra came rather late, and is unfortunately afflicted by some performance warts, but AFAIK it does work and provides the relevant guarantees.

That brings us to the last point: what the hell do you expect here? That somebody will provide you with software, for free, that satisfies your every (as yet incompletely stated) need? That such software will work exactly as you think it should, even though your ideas about that are based on assumptions that the people actually implementing such systems know are invalid or inapplicable? It's constructive to point out that GlusterFS replication does not provide an adequate guarantee of autonomous convergence to a fully repaired state. Relying on an externally-driven full scan is not acceptable in a system of this type and scale. It's not constructive to make up new requirements that have never been seen before, make false claims about Cassandra, or keep changing your argument as points are refuted. As any experienced project leader should know, it's easier to make progress when people seek to understand what's there before they start demanding change. When people enter an unfamiliar field with guns blazing, requiring laborious education before they'll settle enough to engage in a productive dialogue, that only hinders progress. I don't do that to you on crappy-language internals, and you shouldn't be doing it to others on distributed-DB internals.

Chip  Salzenberg

To clarify what certainly looked like a backtrack, and maybe was: I don't need client-side crypto so requirements relating to it were not on my mind. Read repair as a vital part of replication is ludicrous in the absence of crypto, but if it's to make crypto better, and you've set yourself an arbitrary-byte-range write requirement, it's not ludicrous ... just inconvenient (modulo the notification issues you acknowledge).

Also, I never changed my definition of blob; we started with different definitions and now have achieved mutual understanding (if not respect). If you say they're equivalent for the theory, I don't know enough to argue.

OK, enough conversational housekeeping.

"In the operational long term, there should be a guarantee that the system will restore itself to N replicas in finite time when there are no new requests." I can entirely agree with this. But "It's not feasible to expect the same when new requests continue to come in, or to expect that the current replication state will be perfectly knowable at one place and time" is I think a straw man. Knowing "replication is keeping up generally, with a queue length max of 5 sec" is plenty precise for operational use. Similarly "single global order" is a straw man; I meant local order, as in, if node A has things to send to node B, it can keep a queue of them so they get delivered, in order, eventually.

Your adjective "non-deterministic" hits the nail on the head, I think. Queues provide determinism, but it's the determinism I'm missing, not the queues.

It'd be neat if Cassandra's anti-entropy feature works entirely as you describe, but that doesn't mean my fire-and-forget description was inaccurate. Supposing a node entirely dies rather than simply being temporarily offline, a failed replication may have gone an arbitrary time without being noticed and retried, and now the node with the only good copy is gone.

As for why I'm doing this; I'm offended by Cassandra's marketing. Cassandra's boosters don't tell the truth, even if perchance they don't lie. Cassandra *isn't* ready for production; I wasted a lot of time on it, and I'm trying to save other people the same pain. Similarly I was considering GlusterFS for a project, and found it wanting for different but related reasons. Perhaps I should rename my blog "Mene Mene Tekel Parsin." Remember it was a king who got that message, so don't feel put down. Granted he wasn't a king for much longer.

PS: Calling Cassandra's performance issues with repair "warts" is very, very generous.

Jeff Darcy

__Knowing "replication is keeping up generally, with a queue length max of 5 sec" is plenty precise for operational use.__

Fair enough. That or something very close to it should generally be achievable.

__Similarly "single global order" is a straw man; I meant local order, as in, if node A has things to send to node B, it can keep a queue of them so they get delivered, in order, eventually.__

In *what* order? The answer is obvious for a single source replicating to a single destination, but it gets murkier for more complex replication scenarios. If the replication between A and B is bi-directional, then the processing of the queues in each direction can become entangled (depending on what ordering guarantees you make to clients). If A can't reach B, that shouldn't stop it from propagating updates to C even though it might not be possible to apply those updates until B comes back up so that proper global order can be determined. That's why I think "replication should be a queue" falls down. It shouldn't be *a* queue, it should be *many* queues with non-trivial relationships between them.

__Supposing a node entirely dies rather than simply being temporarily offline, a failed replication may have gone an arbitrary time without being noticed and retried, and now the node with the only good copy is gone.__

I assume "now" above means at the time of a *second* failure, even though you don't say so, because it would be absurd to expect data survival if you had originally written with W=1. Any production installation of Cassandra (or anything similar) should have an associated monitoring infrastructure capable of noticing that a node is down, and hinted handoff should provide adequate data protection against "only good copy is gone" until corrective action is taken. Are you saying that doesn't work, or that it's too burdensome and the whole end-to-end process should be entirely automatic, or something else?

__I'm offended by Cassandra's marketing. Cassandra's boosters don't tell the truth__

There's certainly plenty of boosterism in this space. I like to consider jbellis a friend and there are many other Cassandra folks of whom I think highly, but there are a few current or former Cassandra boosters (especially the ones who switched to being Riak boosters) who seem a bit short on integrity. As much as they might have portrayed Cassandra in an unjustifiably positive light, though, it seems to me that you've portrayed it in an unfairly negative one. You've practically admitted that you didn't know about anti-entropy, for example, after having made statements that seem to assume read repair is the only mechanism for handling failure. Maybe that's an issue of presentation rather than actual knowledge, but does cast doubt on the validity of your criticism.

Let's say, for the sake of argument, that both Cassandra and GlusterFS are fatally flawed. What else would you recommend that solves these same problems better? Why do you suppose everyone is getting this "wrong" according to your standards? Is it just barely possible that the hard problems they're solving require different models and approaches than the ones you had in your head?

Chip  Salzenberg

"It shouldn't be *a* queue, it should be *many* queues with non-trivial relationships between them."

Fair enough. It's close enough to what I asked for, in my inchoate way.

"I assume "now" above means at the time of a *second* failure, even though you don't say so, because it would be absurd to expect data survival if you had originally written with W=1."

Use cases vary. In our case, we are underprovisioned enough by design that we do run W=1 (or its moral equivalent, single master) and watch our replication very carefully. Sometimes we lose data, but when it happens, we know which data and can replicate it from upstream in our overall process, simply by rewinding the upstream feed.

In this environment, knowing that replication is behind by (say) no more than a minute means that after a crash and master/slave swap, we need only rewind the upstream to a minute before the crash and we're all good. Uncertainty and arbitrarily delayed replication in the manner of Cassandra [and, I believe, GlusterFS?] defeat this strategy, and would require us to switch to a highly overprovisioned strategy that supports full traffic at W=2. We would rather not.

As for my memory of anti-entropy, I did know about it at one time; my memory had simply faded. The fact of its having to re-read and re-write the world made it operationally indistinguishable from a mass read-repair, which is my excuse such as it is. Anti-entropy's use of merkle trees accomplishing a reduction of network traffic but not a significant change in strategies and weaknesses.

As for what I would consider -not- fatally flawed: I think I've been clear. I want to know for sure how far back in time to go before I reach an assurance that the data written then are safe, and I want that number to be kept as low as possible at all times, hopefully without manual intervention.

Jeff Darcy

Yes, use cases do vary. Your use case is valid, but so are others, and it's impossible to optimize for every case at once. Dynamo and its descendants are explicitly at the AP vertex of the CAP-theorem space, which means that they prioritize availability - including write availability - over consistency during a partition. Bounded replication delay is a kind of consistency. Guaranteeing it would require that live nodes refuse writes when the queue limit has been reached, and that's inconsistent with an AP strategy. You happen to have an upstream data source that can be rewound, but that's not true for most users. It's not reasonable to expect that Cassandra or any other project will sacrifice applicability to other use cases for the sake of a rare one.

GlusterFS is also AP, though I'd have to say less consciously so. It's more like they fell into it. It also preserves write availability at the expense of consistency during a partition, which can result in unbounded replica skew. Worse, they have neither R/W/N controls over consistency nor Merkle trees to make full repair more efficient. (They do use something like Merkle trees in their "geosync" remote replication, but until now we've been talking about the "AFR" local kind so I'll stick to that context.) I've had customers tell me flat out that these issues make GlusterFS replication unusable, and I haven't disagreed. Fortunately, most of them realize that eschewing GlusterFS replication (solving that particular problem by other means) doesn't have to mean throwing the baby out with the bathwater.

All that said, even if *limits* on replication would mean sacrificing properties of these systems that others find desirable, I think asking for *visibility* into replication is reasonable. Maintaining information about which data items remain incompletely replicated, and even a partial order among them (within the limits of what vector clocks can do), wouldn't violate the "spirit" of what they're doing and wouldn't impose too much performance cost either. In fact, a system that maintains and relies on such information could be much more efficient than one that relies on going back and forth to maintain replication state on the objects themselves. That's why I've proposed just that as HekaFS's alternative to GlusterFS's current replication strategy, and you might notice that I even refer to this post in the proposal[1]. Once that information exists for use by the system itself, exposing it to users in some way should be trivial.

[1] https://fedorahosted.org/pipermail/cloudfs-devel/2011-September/000157.html).


You guys need to get girlfriends.

John Mark

Hey Chip - just letting you know that 3.3 did resolve much of this. Let me know if you'd like to give it a whirl - http://www.gluster.org/

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Your Information

(Name and email address are required. Email address will not be displayed with the comment.)