This morning there was yet another comment thread on hacker news about Yet Another outage involving MongoDB and data loss, this time by some company called “CleverTap”.
To summarize: the CleverTap engineering team noticed that the WiredTiger storage engine was faster than MMAPv1 for MongoDB. They decided to … “upgrade the following weekend” (that sentence alone made my eyes bulge).
According to the blog post , they upgraded from 2.6 to 3.0, while simultaneously changing storage engines from MMAPv1 to WiredTiger, while leaving zero secondaries snapshot nodes with data on MMAPv1. All over the course of 3 days.
(They are also running sharded mongo, with a mere 300 ops/sec on each primary, which RAISES A LOT OF QUESTIONS but I already feel like I’m beating up on these kids so I won’t pursue that.)
(But seriously what the *hell* can you be doing to have such a low request rate, that you
need to shard at an infinitesimal volume? Why did you specify it in req/min instead of req/sec? What is the breakdown of reads/writes? What is the lock percentage? What is the avg object size?? Are these like multi-MB documents???? Why did you pause all incoming traffic and process it after the upgrade? If the primary can’t take the extra load, why not rs.syncFrom() a secondary? If that doesn’t work, don’t you have other, bigger problems??)
Most bafflingly of all: why wait only a few minutes after electing a new WiredTiger primary for the first time ever , and then immediately DELETE your only known-good copies of the data on MMAPv1 and re-sync over them with WiredTiger?
Okay. So here’s the thing: you are clearly a team of accidental DBAs. You are operations and software engineers who have found yourselves in charge of the data.
It’s cool. I am too! It’s a really neat and fun place to be in. DBAs and network admins are kind of the last remaining priesthoods in our industry.
There’s a lot of powerful and fun stuff to be done for generalists who pick up specialty knowledge in one of those areas, or specialists (like my neteng friend Leslie ) who start bringing their skills back to the generalist side and merging the two.
(Oh Right, We Wrote A Book About This!!!)
My friend Laine and I are writing a book for people on the data side, called “ Database Reliability Engineering “, which is pitched towards generalists who want to learn how to deal with data responsibly and effectively.
(Actually that’s a good point, I am supposed to be pitching this book, which is really