技术控

    今日:160| 主题:57868
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] The Accidental DBA

[复制链接]
夏末聆听 投递于 2016-10-2 16:09:26
303 6
This morning there was yet another comment thread on hacker news about Yet Another outage involving MongoDB and data loss, this time by some company called “CleverTap”.
  Recap

  To summarize: the CleverTap engineering team noticed that the WiredTiger storage engine was faster than MMAPv1 for MongoDB.  They decided to … “upgrade the following weekend” (that sentence alone made my eyes bulge).
   According to the blog post , they upgraded from 2.6 to 3.0, while simultaneously changing storage engines from MMAPv1 to WiredTiger, while leaving zero secondaries snapshot nodes with data on MMAPv1.  All over the course of 3 days.
   (They are also running sharded mongo, with a mere 300 ops/sec on each primary, which RAISES A LOT OF QUESTIONS but I already feel like I’m beating up on these kids so I won’t pursue that.)
  Questions …

     (But seriously what the *hell* can you be doing to have such a low request rate, that you
   

The Accidental DBA

The Accidental DBA-1-技术控-involving,summarize,following,weekend,running
    need to shard at an infinitesimal volume?  Why did you specify it in req/min instead of req/sec?  What is the breakdown of reads/writes?  What is the lock percentage?  What is the avg object size??  Are these like multi-MB documents????  Why did you pause all incoming traffic and process it after the upgrade?  If the primary can’t take the extra load, why not rs.syncFrom() a secondary?   If that doesn’t work, don’t you have other, bigger problems??)
     Most bafflingly of all: why wait only a few minutes after electing a new WiredTiger primary for the  first time ever , and then immediately  DELETE your only known-good copies of the data on MMAPv1 and re-sync over them with WiredTiger?
  Accidental DBAs

  Okay.  So here’s the thing: you are clearly a team of accidental DBAs.  You are operations and software engineers who have found yourselves in charge of the data.
   It’s cool.  I am too!  It’s a really neat and fun place to be in.  DBAs and network admins are kind of the last remaining priesthoods in our industry.

The Accidental DBA

The Accidental DBA-2-技术控-involving,summarize,following,weekend,running

   There’s a lot of powerful and fun stuff to be done for generalists who pick up specialty knowledge in one of those areas, or specialists (like my neteng friend Leslie ) who start bringing their skills back to the generalist side and merging the two.
  (Oh Right, We Wrote A Book About This!!!)

   My friend Laine and I are writing a book for people on the data side, called “ Database Reliability Engineering “, which is pitched towards generalists who want to learn how to deal with data responsibly and effectively.
   (Actually that’s a good point, I am supposed to be pitching this book, which is really

The Accidental DBA

The Accidental DBA-3-技术控-involving,summarize,following,weekend,running
mostly Laine with a smidgen of me but it’s going to be awesome.  Consider this your pitch)
   So first, as an accidental DBA, you should definitely buy this book.  Second: stateful services require different skills.   It’s cool that you are running your own databases.  But reading post mortems like this where the conclusion is “MongoDB sucks” makes me fucking grind my teeth.
  Stop treating your databases like stateless services.

  There are lots of ways that MongoDB (and every other database on the planet) really sucks.  Mongo set themselves up for special rage by overpromising too much early on, and seeming tone deaf to criticism from real database engineers.
  But *I* can criticize Mongo all day long.  You children on hacker news who have never run it don’t get to.:stuck_out_tongue:  If you don’t know what the fuck you’re talking about, if you’re cargo culting other people’s years-old complaints, just shut up already.
   Managing stateful services like databases means that you need to be more paranoid than you did with stateless services.  With stateless services the best practices are to to roll early, roll fast, roll often, roll back.  When you’re dealing with state, you need to be careful .
  With stateful services you can’t play it fast and loose like that.  You’re going to have data loss, corruption, unpredictable results, catastrophic failures that you can’t simply roll back from.  Data loss can be ruinous to your company.  (This can also be true for stateless services that sit close to your data and mutate it a lot.)
  But that’s what makes it fun.  :)
  Be paranoid.

   When we were moving from MMAPv1 to RocksDB at Parse, we ran hybrid replica sets for 6-9 months.  We were paranoid.  It was justified!  We spent half a year capturing production workloads and replaying them, electing Rocks primaries and rolling back, and even then keeping snapsh

The Accidental DBA

The Accidental DBA-4-技术控-involving,summarize,following,weekend,running
ots and secondaries of both storage engines for *months*.
  This isn’t because MongoDB sucks.  It’s the nature of the game, it’s the difference between stateful and stateless services.
  Do you know that there was a total query engine rewrite in 2.6?  We spent months flushing out tons of crazy bugs.  Do you know about the index intersection changes?  We helped chase down bugs in those too.  (You’re welcome.)
   You can’t just go “dudes it’s faster” and jump off a cliff.  This shit is basic. Test real production workloads . Have a rollback plan.  (Not for *10 days* … try a month or two.)
  Lessons

   If CleverTap had run their plan past anyone experienced with data, they would have called out all of those completely predictable failures, and advised them to change their plan in the following ways.
  
       
  • Make one change at a time.  Do a major version upgrade  separately from the storage engine upgrade.   
  • Delay between each change.  Two weeks is absolutely minimal, any thing less is careless.  Let them bake.   
  • Storage engine changes are scary.  It takes years to gain confidence in a new way of laying bits down on disk.  (Whenever people bitch and moan about mongo, I remind

    The Accidental DBA

    The Accidental DBA-5-技术控-involving,summarize,following,weekend,running
    them that I’ve still lost WAY more data to MyISAM, InnoDB, and MySQL overall than Mongo.   
  • You can run lots and lots of replicas (up to 7 votes per replica set, even more nodes) per each replica set in Mongo.  This is a killer feature.  Why didn’t you use it?   
  • Keep backups around for months in the new storage engine *and* the old storage engine, just in case.  Have two hidden snapshot nodes.  The only cost is in dollars, which is fucking cheap compared to data or engineering time.  
  If you are a new accidental DBA, you have to make a point of learning things.  Go to conferences.  Read books.  Buy bottles of whiskey for your data friends and pick their brains.  Remember that they know things you do not.  Don’t blame the vendors when you fucked up.
  Network engineering is the same way, but mistakes tend to be a lot less … permanent.  You drop some packets..  like grains of sand. ^_^
  Remember that you’re in charge of keeping people’s data safe and secure.  You have much to learn.  Learn it.
  And get off my fucking lawn.  <3

  Some slides from a couple of relevant talks I’ve given on the subject:
  
       
  • Maslow’s Hierarchy of Needs for Databases : how to database as a startup   
  • Upgrading Databases : without losing your data, your perf or your mind  
   P.S.:“Stop treating your stateful services like stateless services” … this is a fact, but it’s not the aspiration.  DB folks should all be leaning in to the model of learning how to treat our stateful services with the same casual disregard as we do our stateless services.  This is hard, and it’s going to take some time, but it’s clearly where the world is heading and it’s definitely a good thing.  :)  The learning goes both ways!
   

The Accidental DBA

The Accidental DBA-6-技术控-involving,summarize,following,weekend,running



上一篇:ES proposal: Rest/Spread Properties
下一篇:Issue #225
。随便就好 投递于 2016-10-2 19:36:47
你拥有再大再多的水桶,也不如有一个水龙头。说明:”渠道很重要!
回复 支持 反对

使用道具 举报

xuchenmin123 投递于 2016-11-7 14:12:52
沙发抢不到,板凳也行啊!
回复 支持 反对

使用道具 举报

冒险家蜜蜜 投递于 2016-11-13 13:34:38
站位支持
回复 支持 反对

使用道具 举报

廖文熙 投递于 2016-11-14 16:08:36
很不错,谢谢夏末聆听的分享
回复 支持 反对

使用道具 举报

小清新 投递于 2016-11-15 09:05:49
顶顶更健康!
回复 支持 反对

使用道具 举报

wpctm 投递于 2016-11-18 17:15:48
幸好爱情不是一切,幸好一切都不是爱情。
回复 支持 反对

使用道具 举报

我要投稿

回页顶回复上一篇下一篇回列表
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 | 粤公网安备 44010402000842号 )

© 2001-2017 Comsenz Inc.

返回顶部 返回列表