Cassandra is a tragedy.
Database replication should be a queue, or otherwise kept strict track of. If a datum should be replicated from X to Y and Z, then if it hasn't gotten to Z yet, then it should eventually. The database is allowed to fail to replicate at first, but it is not allowed to just give up.
Cassandra views 1000 replication events from X to Z as a flock of birds who set off toward their destination. Most should arrive. If some don't ... well, no problem, you still have at least 950, right?
Also, there is absolutely no backpressure to throttle a writing process that would really like to write as fast is it can but no faster. This makes replication failure much harder to avoid.
To deal with the consequences, Cassandra depends on read repair (which is ludicrous) and a manual process that amounts to read repair of everything all at once.
It's pathetic that the Cassandra guys won't fix these problems, or even acknowledge that they are problems.
Why don't you email the cassandra user or dev list with these questions?
Posted by: Jeremy Hanna | August 27, 2011 at 09:00 PM
Jeremy: You think they don't know? I've spoken to many of the devs in person. My company hired Riptano. It's not like this is a secret, it's their freaking DESIGN PHILOSOPHY: "Lose all the data you want--the user can always make more."
Posted by: Chip Salzenberg | August 28, 2011 at 05:42 PM
Your title isn't a correct generalization. I'm using cassandra in production to serve > 20k operations per second. Many others are also using it successfully.
Read repair does a good job picking up where node availability was diminished, and general repair does a good job picking up where read repair hasn't.
Your statement about the manual process (I assume you mean "repair") amounting to a read repair of everything all at once is incorrect. Repair calculates a highly compact merkle tree that's cheap to broadcast and compare, and only data identified as missing is relayed back to the node missing it.
There are lots of tunables (both in the cassandra server as well as in the client) that allow you to enforce or relax certain behaviors.
Cassandra does have its pain points (tweaking till stable, ring reconfiguration, client backpressure as you've mentioned, overzealous disk IO in some cases), but in my experience data loss isn't an issue when all the settings are properly configured.
Posted by: Mina | September 28, 2011 at 06:43 AM
That people are using something doesn't mean that thing is fit for the use. cf Windows.
I am curious how you know read repair is working. Seems to me that it could be working very badly and you might never know, unless you are so overprovisioned that write replication never fails you and nodes never die.
Full repair is conceptually identical to mass read repair, in that it compares what the nodes have and make sure they end up sharing what any of them has. It doesn't require sending the full data, but that's not a relevant difference to me. It still requires READING the full data, and iops are the more precious resource.
Posted by: Chip Salzenberg | September 28, 2011 at 02:46 PM