请选择 进入手机版 | 继续访问电脑版

网络科技

    今日:1194| 主题:268125
收藏本版 (1)
互联网、科技极客的综合动态。

[其他] A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot

[复制链接]
南方北方是远方 发表于 2016-10-4 03:11:59
827 48

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
Why Even Try, Man?

  I recently came upon Brian Granger and Jake VanderPlas’s Altair, a promising young visualization library. Altair seems well-suited to addressing Python’s ggplot envy, and its tie-in with JavaScript’s Vega-Lite grammar means that as the latter develops new functionality (e.g., tooltips and zooming), Altair benefits — seemingly for free!
   Indeed, I was so impressed by Altair that the original thesis of my post was going to be: “Yo, use Altair.”
   But then I began ruminating on my own Pythonic visualization habits, and — in a painful moment of self-reflection — realized I’m all over the place: I use a hodgepodge of tools and disjointed techniques depending on the task at hand (usually whichever library I first used to accomplish that task 1 ).
   This is no good. As the old saying goes: “The unexamined plot is not worth exporting to a PNG.”
  Thus, I’m using my discovery of Altair as an opportunity to step back — to investigate how Python’s visualization options hang together. I hope this investigation proves helpful for you as well.
   How’s This Gonna Go?

   The conceit of this post will be: “You need to do Thing X. How would you do Thing X in matplotlib? pandas? Seaborn? ggplot? Altair?”   By doing many different Thing X’s, we’ll develop a reasonable list of pros, cons, and takeaways — or at least a whole bunch of code that might be somehow useful.
  (Warning: this all may happen in the form of a two-act play.)
   The Options (in ~Descending Order of Subjective Complexity)

   First, let’s welcome our friends 2 :
    matplotlib
  The 800-pound gorilla — and like most 800-pound gorillas, this one should probably be avoided unless you genuinely need its power, e.g., to make a really custom plot or produce a publication-ready graphic
    pandas
   “Come for the DataFrames; stay for the plotting convenience functions that are arguably more pleasant than the matplotlib code they supplant.” — rejected pandas taglines
  (Bonus tidbit: the pandas team must include a few visualization nerds, as the library includes things like RadViz plots and Andrews Curves that I haven’t seen elsewhere.)
    Seaborn
  Seaborn has long been my go-to library for statistical visualization; it summarizes itself thusly:
  “If matplotlib ‘tries to make easy things easy and hard things possible,’ seaborn tries to make a well-defined set of hard things easy too”
    yhat’s ggplot
  A Python implemention of the grammar of graphics. This isn’t a “feature-for-feature port of ggplot2,” but there’s strong feature overlap. (And speaking as a part-time R user, the main geoms seem to be in place.)
    Altair
  The new guy, Altair is a “declarative statistical visualization library” with an exceedingly pleasant API.
  Wonderful. Now that our guests have arrived and checked their coats, let’s settle in for our very awkward dinner conversation. Our show is entitled…
   Little Shop of Python Visualization Libraries (starring all libraries as themselves)  

   ACT I: LINES AND DOTS

  (In Scene 1, we’ll be dealing with a tidy data set named “ts.” It consists of three columns: a “dt” column (for dates); a “value” column (for values); and a “kind” column, which has four unique levels: A, B, C, and D. Here’s a preview…)
                   dt     kind     value                   0     2000-01-01     A     1.442521             1     2000-01-02     A     1.981290             2     2000-01-03     A     1.586494             3     2000-01-04     A     1.378969             4     2000-01-05     A     -0.277937            Scene 1: How would you plot multiple time series on the same graph?

    matplotlib: Ha! Haha!  Beyond simple. While I  could and  would accomplish this task in any number of complex ways, I know your feeble brains would crumble under the weight of their ingenuity. Hence, I dumb it down, showing you two simple methods. In the first, I loop through your trumped-up matrix — I believe you peons call it a “Data” “Frame” — and subset it to the relevant time series. Next, I invoke my “plot” method and pass in the relevant columns from that subset.
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. for k in ts.kind.unique():
  5.     tmp = ts[ts.kind == k]
  6.     ax.plot(tmp.dt, tmp.value, label=k)
  7. ax.set(xlabel='Date',
  8.        ylabel='Value',
  9.        title='Random Timeseries')   
  10. ax.legend(loc=2)
  11. fig.autofmt_xdate()
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    MPL:   Next, I enlist this chump  (*motions to pandas*) , and have him pivot this “Data” “Frame” so that it looks like this…
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()
复制代码
             kind     A     B     C     D             dt                                       2000-01-01     1.442521     1.808741     0.437415     0.096980             2000-01-02     1.981290     2.277020     0.706127     -1.523108             2000-01-03     1.586494     3.474392     1.358063     -3.100735             2000-01-04     1.378969     2.906132     0.262223     -2.660599             2000-01-05     -0.277937     3.489553     0.796743     -3.417402             MPL: By transforming the data into an index with four columns — one for each line I want to plot — I can do the whole thing in one fell swoop (i.e., a single call of my “plot” function).
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

      pandas (*looking timid*): That was great, Mat. Really great. Thanks for including me. I do the same thing — hopefully as good?      (*smiles weakly*)
   
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()
复制代码
   pandas: It looks exactly the same, so I just won’t show it.
    Seaborn (*smoking a cigarette and adjusting her beret*): Hmmm. Seems like an awful lot of data manipulation for a silly line graph. I mean, for loops and pivoting? This isn’t the 90’s or Microsoft Excel. I have this thing called a FacetGrid I picked up when I went abroad. You’ve probably never heard of it…
  1. # SEABORN
  2. g = sns.FacetGrid(ts, hue='kind', size=5, aspect=1.5)
  3. g.map(plt.plot, 'dt', 'value').add_legend()
  4. g.ax.set(xlabel='Date',
  5.          ylabel='Value',
  6.          title='Random Timeseries')
  7. g.fig.autofmt_xdate()
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    SB: See? You hand FacetGrid your un-manipulated tidy data. At that point, passing in “kind” to the “hue” parameter means you’ll plot four different lines — one for each level in the “kind” field. The way you actually realize these four different lines is by mapping my FacetGrid to this Philistine’s  (*motions to matplotlib*) plot function, and passing in “x” and “y” arguments. There are some things you need to keep in mind, obviously, like manually adding a legend, but nothing too challenging. Well, nothing too challenging for some of us…
    ggplot: Wow, neat! I do something similar, but I    do it like my big bro. Have you heard of him? He’s so coo–
   SB: Who invited the kid?
    GG: Check it out!
  1. # GGPLOT
  2. fig, ax = plt.subplots(1, 1, figsize=(7.5, 5))
  3. g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
  4.         geom_line(size=2.0) + \
  5.         xlab('Date') + \
  6.         ylab('Value') + \
  7.         ggtitle('Random Timeseries')
  8. g
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    GG (*picks up ggpot2 by Hadley Wickham and sounds out words*):   Every plot is com — com — com- prised of data (e.g., “ts”), aesthetic mappings (e.g, “x”, “y”, “color”), and the geometric shapes that turn our data and aesthetic mappings into a real visualization (e.g., “geom_line”)!
    Altair: Yup, I do that, too.
  1. # ALTAIR
  2. c = Chart(ts).mark_line().encode(
  3.     x='dt',
  4.     y='value',
  5.     color='kind'
  6. )
  7. c
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    ALT: You give my Chart class some data and tell it what kind of visualization you want: in this case, it’s “mark_line”. Next, you specify your aesthetic mappings: our x-axis needs to be “date”; our y-axis needs to be “value”; and we want to split by kind, so we pass “kind” to “color.” Just like you, GG (* tousles GG’s hair*) . Oh, and by the way, using the same color scheme y’all use isn’t a problem, either:
  1. # ALTAIR
  2. # cp corresponds to Seaborn's standard color palette
  3. c = Chart(ts).mark_line().encode(
  4.     x='dt',
  5.     y='value',
  6.     color=Color('kind', scale=Scale(range=cp.as_hex()))
  7. )
  8. c
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  *MPL stares in terrified wonder*
   Analyzing Scene 1

   Aside from matplotlib being a jerk 3 , a few themes emerged:
  
       
  • In matplotlib and pandas, you must either make multiple calls to the “plot” function (e.g., once-per-for loop), or you must manipulate your data to make it optimally fit the plot function (e.g., pivoting). (That said, there’s another technique we’ll see in Scene 2.)  
  
       
  • (To be frank, I never used to think this was a big deal, but then I met people who use R. They looked at me aghast.)  
  
       
  • Conversely, ggplot and Altair implement similar “grammar of graphics”-approved ways to handle our simple case: you give their “main” function– “ggplot” in ggplot and “Chart” in Altair” — a tidy data set. Next, you define a set of aesthetic mappings — x, y, and color — that explain how the data will map to our geoms (i.e., the visual marks that do the hard work of conveying information to the reader). Once you actually invoke said geom (“geom_line” in ggplot and “mark_line” in Altair), the data and aesthetic mappings are transformed into visual ticks that a human can understand — and thus, an angel gets its wings.  
  
       
  • Intellectually, you can — and probably should — view Seaborn’s FacetGrid through the same lens; however, it’s not 100% identical. FacetGrid needs a hue argument  upfront —  alongside your data — but wants the x and y arguments  later . At that point, your mapping isn’t an aesthetic one, but a functional one: for each “hue” in your data set, you’re simply calling matplotlib’s plot function using “dt” and “value” as its x and y arguments. The for loop is simply hidden from you.  
  
       
  • That said, even though the aesthetic maps happen in two separate steps, I prefer the aesthetic mapping mindset to the functional mindset (at least when it comes to plotting).  
  Data Aside
  (In Scenes 2-4, we’ll be dealing with the famous “iris” data set [though we refer to it as “df” in our code]. It consists of four numeric columns corresponding to various measurements, and a categorical column corresponding to one of three species of iris. Here’s a preview…)
                   petalLength     petalWidth     sepalLength     sepalWidth     species                   0     1.4     0.2     5.1     3.5     setosa             1     1.4     0.2     4.9     3.0     setosa             2     1.3     0.2     4.7     3.2     setosa             3     1.5     0.2     4.6     3.1     setosa             4     1.4     0.2     5.0     3.6     setosa            Scene 2: How would you make a scatter plot?

    MPL (*looking shaken*): I mean, you could do the for loop thing again. Of course. And that would be fine. Of course. See?  (*lowers voice to a whisper*) Just remember to set the color argument explicitly or else the dots will all be blue…
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1, figsize=(7.5, 7.5))
  3. for i, s in enumerate(df.species.unique()):
  4.     tmp = df[df.species == s]
  5.     ax.scatter(tmp.petalLength, tmp.petalWidth,
  6.                label=s, color=cp[i])
  7. ax.set(xlabel='Petal Length',
  8.        ylabel='Petal Width',
  9.        title='Petal Width v. Length -- by Species')
  10. ax.legend(loc=2)
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    MPL: But, uh,  (*feigning confidence*) I have a    better   way! Look at this:
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1, figsize=(7.5, 7.5))
  3. def scatter(group):
  4.     plt.plot(group['petalLength'],
  5.              group['petalWidth'],
  6.              'o', label=group.name)
  7. df.groupby('species').apply(scatter)
  8. ax.set(xlabel='Petal Length',
  9.        ylabel='Petal Width',
  10.        title='Petal Width v. Length -- by Species')
  11. ax.legend(loc=2)
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    MPL: Here, I define a function named “scatter.” It will take groups from a pandas groupby object and plot petal length on the x-axis and petal width on the y-axis. Once per group! Powerful!
    P: Wonderful, Mat! Wonderful! Essentially what I would have done, so I will sit this one out.
    SB (*grinning*): No pivoting this time?
    P: Well, in this case, pivoting is complex. We can’t have a common index like we could with our time series data set, and so —
   MPL: SHHHHH! WE DON’T HAVE TO EXPLAIN OURSELVES TO HER.
    SB: Whatever. Anyway, in my mind, this problem is the same as the last one. Build another FacetGrid but borrow plt.scatter rather than plt.plot.
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()0
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    GG: Yes! Yes! Same! You just gotta swap out geom_line for geom_point!
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()1
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    ALT (*looking bemused*): Yup — just swap our mark_line for mark_point.
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()2
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

   Analyzing Scene 2

  
       
  • Here, the complications that emerge from adapting your data to your visualization method become more clear. While the pandas pivoting trick was extremely convenient for time series, it doesn’t translate so well to this case.  
  
       
  • To be fair, the “group by” method is somewhat generalizable, and the “for loop” method is very generalizable; however, they require more custom logic, and custom logic either introduces room for error or necessitates reinventing a wheel that Seaborn has already made for you.  
  
       
  • Conversely, Seaborn, ggplot, and Altair all realize that scatter plots are in many ways line plots without the assumptions (however innocuous those assumptions may be). As such, our code from Scene 1 can largely be reused, but with a new geom (geom_point/mark_point in the case of ggplot/Altair) or a new method (plt.scatter in the case of Seaborn). At this junction, none of these options seems to emerge as particularly more convenient than the other, though I love Altair’s elegant simplicity.  
   Scene 3: How would you facet your scatter plot?

    MPL: Well, uh, once you’ve mastered the for loop — as I have, obviously — this is a simple adjustment to my earlier example. Rather than build a single Axes using my subplots method, I build three. Next, I loop through as before, but in the same way I subset my data, I subset to the relevant Axes object.
   (*confidence returning*) AND I WOULD CHALLENGE ANY AMONG YOU TO COME UP WITH AN EASIER WAY! (*raises arms, nearly hitting pandas in the process*)
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()3
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  *SB shares a look with ALT, who starts laughing; GG starts laughing to appear in on the joke*
    MPL: What is it?!
    Altair: Check your x- and y-axes, man. All your plots have different limits.
    MPL (*goes red*): Ah, yes, of course.  A TEST TO ENSURE YOU WERE PAYING ATTENTION. You can, uh, manually set the limits for each plot.
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()4
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    P (*sighs*): I would do the same. Pass.
    SB: Adapting FacetGrid to this case is simple. In the same way we have a “hue” argument, we can simply add a “col” (i.e., column) argument. This tells FacetGrid to not only assign each species a unique color, but also to assign each species a unique subplot, arranged column-wise. (We could have arranged them row-wise by passing in a “row” argument rather than a “col” argument.)
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()5
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    GG: Oooo — this is different from how I do it.  (*again picks up ggplot2 and starts sounding out words*) See, faceting and aesthetic mapping are two fundamentally different steps, and we don’t want to in-ad-vert-ent-ly conflate the two. As such, we need to take our code from before but add a “facet_grid” layer that explicitly says to facet by species.  (*shuts book happily*) At least, that’s what my big bro says! Have you heard of him, by the way? He’s so cool– 4
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()6
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    ALT : I take a more Seaborn-esque approach here. Specifically, I just add a column argument to the encode function. That said, I’m doing a couple of new things here, too: (A) While the column parameter could accept a simple string argument, I actually use a Column object instead — this lets me set a title; (B) I use my configure_cell method, since without it, the subplots would have been way too big.
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()7
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

   Analyzing Scene 3

  
       
  • matplotlib made a really good point: in this case, his code to facet by species is nearly identical to what we saw above; assuming you can wrap your head around the previous for loops, you can wrap your head around this one. However, I didn’t ask him to do anything more complicated — say, a 2 x 3 grid. In that case, he might have had to do something like this:  
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()8
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  
       
  • To use the formal visualization expression: Yeesh. Meanwhile, in Altair, this would have been wonderfully simple:  
  1. # in matplotlib-land, the notion of a tidy
  2. # dataframe matters not
  3. dfp = ts.pivot(index='dt', columns='kind', values='value')
  4. dfp.head()9
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  
       
  • Just one more argument to the “encode” function than we had above!  
  
       
  • Hopefully, the advantages of having faceting built into your visualization library’s framework are clear.  
   ACT 2: DISTRIBUTIONS AND BARS

   Scene 4: How would you visualize distributions?

    MPL (*confidence visibly shaken*): Well, if we wanted a boxplot — do we want a boxplot? — I have a way of doing it. It’s stupid; you’d hate it. But I pass an array of arrays to my boxplot method, and this produces a boxplot for each subarray. You’ll need to manually label the x-ticks yourself.
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()0
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    MPL: And if we wanted a histogram — do we want a histogram? — I have a method for that, too, which you can produce using either the for loop or group by methods from before.
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()1
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    P (*looking uncharacteristically proud*): Ha! Hahahaha! This is my moment! You all thought I was nothing but matplotlib’s patsy, and although I’ve so far been nothing but a wrapper around his plot method, I possess special functions for both boxplots   and histograms — these make visualizing distributions a snap. You only need two things: (A) The column name by which you’d like to stratify; and (B) The column name for which you’d like distributions. These go to the “by” and “column” parameters, respectively, resulting in instant plots!
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()2
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()3
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

   *GG and ALT high five and congratulate P; shouts of “awesome!”, “way to be!”, “let’s go!” audible*

    SB (*feigning enthusiasm*): Wooooow. Greeeeat. Meanwhile, in my world, distributions are exceedingly important, so I maintain special methods for them. For example, my boxplot method needs an x argument, a y argument, and data, resulting in this:
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()4
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    SB: Which, I mean, some people have told me is beautiful… but whatever. I  also have a special distribution method named “distplot” that goes beyond histograms  (*looks at pandas haughtily*) . You can use it for histograms, KDEs, and rugplots — even plotting them simultaneously. For example, by combining this method with FacetGrid, I can produce a histo-rugplot for every species of iris:
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()5
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    SB: But again… whatever.
    GG: THESE ARE BOTH JUST NEW GEOMS! GEOM_BOXPLOT FOR BOXPLOTS AND GEOM_HISTOGRAM FOR HISTOGRAMS! JUST SWAP THEM IN!  (*starts running around the dinner table*)
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()6
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()7
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    ALT (*looking steely-eyed and confident*): I… I have a confession…
  *silence falls — GG stops running and lets plate fall to the floor*
    ALT: (*breathing deeply*) I… I… I can’t do boxplots. Never really learned how, but I trust the JavaScript grammar out of which I grew has a good reason for this. I can make a mean histogram, though…
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()8
复制代码

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...


    ALT: The code may look weird at first glance, but don’t be alarmed. All we’re saying here is: “Hey, histograms are effectively bar charts.” Their x-axes correspond to bins, which we can define with my Bin class; meanwhile, their y-axes correspond to the number of items in the data set which fall into those bins, which we can explain using a SQL-esque “count(*)” as our argument for y.
   Analyzing Scene 4

  
       
  • In my work, I actually find pandas’ convenience functions very convenient; however, I’ll admit that there’s some cognitive overhead in remembering that pandas has implemented a “by” parameter for boxplots and histograms but not for lines.  
  
       
  • I separate Act 1 from Act 2 for a few reasons, and a big one is this: Act 2 is when using matplotlib gets particularly hairy. Remembering a totally separate interface when you want a boxplot, for example, doesn’t work for me (and just wait until we get to bar charts!).  
  
       
  • Speaking of Act 1 v. Act 2, a fun story: I actually came to Seaborn from matplotlib/pandas for its rich set of “proprietary” visualization functions (e.g., distplot, violin plots, regression plots, etc.). While I later learned to love FacetGrid, I maintain that it’s these Act 2 functions which are Seaborn’s killer app. They’ll keep me a Seaborn fan as long as I plot.  
  
       
  • These examples are really when you begin to grok the power of ggplot’s geom system. Using mostly the same code (and more importantly, mostly the same thought process), we create a wildly different graph. We do this not by calling an entirely separate function, but by changing how our aesthetic mappings get presented to the viewer, i.e., by swapping out one geom for another.  
  
       
  • Similarly, even in the world of Act 2, Altair’s API remains remarkably consistent. Even for what feels like a different operation, Altair’s API is simple, elegant, and expressive.  
  Data Aside
  (In the final scene, we’ll be dealing with “titanic,” another famous tidy dataset [although again, we refer to it as “df” in our code]. Here’s a preview…)
                   survived     pclass     sex     age     fare     class                   0     0     3     male     22.0     7.2500     Third             1     1     1     female     38.0     71.2833     First             2     1     3     female     26.0     7.9250     Third             3     1     1     female     35.0     53.1000     First             4     0     3     male     35.0     8.0500     Third           In this example, we’ll be interested in looking at the average fare paid by class and by whether or not somebody survived. Obviously, you could do this in pandas…
  1. # MATPLOTLIB
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. ax.plot(dfp)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(dfp.columns, loc=2)
  9. fig.autofmt_xdate()9
复制代码
                       fare             survived     pclass                        0     1     64.684008             2     19.412328             3     13.669364             1     1     95.608029             2     22.055700             3     13.694887           …but what fun is that? This is a post on visualization, so let’s do it in the form of a bar chart!)
   Scene 5: How would you create a bar chart?

    MPL (*looking grim*): No comment.
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()0
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

  *everyone else shakes their head*
    P: I need to do some data manipulation first — namely, a group by and a pivot — but once I do, I have a really cool bar chart method — much simpler than that mess above! Wow, I’m feeling so much more confident — who knew all I had to was put someone else down!? 5
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()1
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    SB: Again, I happen to think tasks such as this are extremely important. As such, I implement a special function named “factorplot” to help out:
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()2
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    SB: As ever, you pass in your  un -manipulated data frame. Next, you explain what you would like to group by — in this case, it’s “class” and “survived,” so these become our “x” and “hue” arguments. Next, you explain what numeric field you would like summaries for — in this case, it’s “fare,” so this becomes our “y” argument. The default summary statistic is mean, but factorplot possesses a parameter named “estimator,” where you can specify any function you want, e.g., sum, standard deviation, median, etc. The function you choose will determine the height of each bar.
   Of course, there are many ways to visualize this information, only one of which is a bar. As such, I also have a “kind” parameter where you can specify different visualizations.
   Finally, some of us still care about statistical certainty, so by default, I bootstrap you some error bars so you can see if the differences in average fair between classes and survivorship are meaningful.
   (*under her breath*) Would like to see any of you top that…
  *ggplot2 pulls up in his Lamborghini and walks through the door*
    ggplo2: Hey, have y’all see–
    GG: HEY BRO.
    GG2: Hey, little man. We gotta go.
    GG: Wait, one sec — I gotta make this bar plot real quick, but I’m having a hard time. How would you do it?
    GG2 (*reading instructions*) : Ah, like this:
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()3
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    GG2: See? You define your aesthetic mappings like we always talk about, but you need to turn your “y” mapping into average fare. To do so, I get my pal “stat_summary_bin” to do that for me by passing in “mean” to his “fun.y” parameter.
    GG (*eyes wide in amazement*): Oh, whoa… I don’t think I have stat_summary yet. I guess — pandas, could you help me out?
    P: Uh, sure.
    GG: Weeeee!
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()4
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    GG2: Huh, not exactly grammar of graphics-approved, but I guess so long as Hadley doesn’t find out it seems to work fine… In particular, you shouldn’t have to summarize your data in advance of your visualization. I’m also confused by what “weight” means in this context…
    GG: Well, by default, my bar geom seems to default to simple counts, so without a “weight,” all the bars would have had a height of one.
    GG2: Ah, I see… Let’s talk about that later later.
  *GG and GG2 say their goodbyes and leave the dinner party*
    ALT: Ah, now this is my bread-and-butter. It’s really simple.
  1. # PANDAS
  2. fig, ax = plt.subplots(1, 1,
  3.                        figsize=(7.5, 5))
  4. dfp.plot(ax=ax)
  5. ax.set(xlabel='Date',
  6.        ylabel='Value',
  7.        title='Random Timeseries')
  8. ax.legend(loc=2)
  9. fig.autofmt_xdate()5
复制代码
  

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

    ALT: I’m hoping all the arguments are intuitive by this point: I want to plot mean fare by survivorship — faceted by class. This directly translates into “survived” as the x argument; “mean(fare)” as the y argument; and “class” as the column argument. (I specify the color argument for some pizazz.)
   That said, a couple of new things are happening here. Notice how I append “:N” to the “survived” string in the x and color arguments. This is a note to myself which says, “This is a nominal variable.” I need to put this here because survived looks like a quantitative variable, and a quantitative variable would lead to a slightly uglier visualization of this plot. Don’t be alarmed: this has been happening the whole time — just implicitly. For example, in the time series plots above, if I hadn’t known “dt” was a temporal variable I would have assumed they were nominal variables, which… would have been awkward (at least until I appended “:T” to clear things up.
  Separately, I invoke my configure_facet protocol to make my three subplots look more unified.
   Analyzing Scene 5

  
       
  • Don’t overthink this one: I’m never making a bar chart in matplotlib again. Conversely, whenever I need summary statistics and error bars, I will always and forever turn to Seaborn.  
  
       
  • (It’s potentially unfair I chose an example that seems tailor-made to one of Seaborn’s functions, but it comes up a lot in my work, and hey, I’m writing the blog post here.)  
  
       
  • I don’t find either the pandas approach or the ggplot approach particularly offensive.  
  
       
  • However, in the pandas case, knowing you must group by  and pivot — all in service of a simple bar chart — seems a bit silly.  
  
       
  • Similarly, I do think this is the main hole I’ve found in yhat’s ggplot — having a “stat_summary” equivalent would go a long way toward making this thing wonderfully full-featured.  
  
       
  • Meanwhile, Altair continues to impress! I was struck by how intuitive the code was for this example. Even if you’d never seen Altair before, I imagine someone could intuit what was happening. It’s this  type of 1:1:1 mapping between thinking, code, and visualization that is my favorite thing about the library.  
   Final Thoughts

  You know, sometimes I think it’s important to just be grateful: we have a ton of great visualization options, and I enjoyed digging into all of them!
  (Yes, this is a cop-out.)
  Although I was a bit hard on matplotlib, it was all in good fun: the fine-grained aesthetic control he gives you is essential. I didn’t touch on this, but in almost every non-Altair example, I used matplotlib to customize our final graph.
   Meanwhile, pivoting plus pandas works wonders for time series plots. Given how good pandas’ time series support is more broadly, this is something I’ll continue to leverage. Moreover, the next time I need a RadViz plot, I’ll know where to go.
  If you want to do anything more stats-y, use Seaborn (she really did pick up a ton of cool things when she went abroad). Learn her API — factorplot, regplot, displot, et al — and love it. It will be worth the time investment. As for faceting, I find FacetGrid to be a very useful partner in crime; however, if I hadn’t worked with Seaborn for so long, I think I would probably prefer the ggplot or Altair versions.
  I’ve long loved ggplot2, and for the most part came away impressed by how well Python’s ggplot managed to hang in example-for-example. This is a project I will definitely continue to monitor. (More selfishly, I hope it prevents my R-centric coworkers from making fun of me.)
  Finally, if the thing you want to do is implemented in Altair (sorry, boxplot jockeys), it boasts an amazingly simple and pleasant API. Use it! If you need additional motivation, consider the following: one exciting thing about Altair — other than forthcoming improvements to its underlying Vega-Lite grammar — is that it technically isn’t a visualization library. It emits Vega-Lite approved JSON blobs, which — in notebooks — get lovingly rendered by IPython Vega.
  Why is this exciting? Well, under the hood, all of our visualizations looked like this:
   

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

A Dramatic Tour through Python’s Data Visualization Landscape (including ggplot ...

   Granted, that doesn’t look exciting, but think about the implication: if other libraries were interested, they could also develop ways to turn these Vega-Lite JSON blobs into visualizations. That would mean you could do the basics in Altair and then drop down to matplotlib for more control.
  I am already salivating about the possibilities.
  All of that said, some parting words: visualization in Python is larger than any single man, woman, or Loch Ness Monster. Thus, you should take everything I said above — code and opinions alike — with a grain of salt. Remember: everything on the internet amounts to lies, damned lies, and statistics.
  I hope you enjoyed this far nerdier version of Mad Hatter’s Tea Party, and that you learned some things you can take to your own work.
   As always, code is available .
   Notes

   1 Strictly speaking, this story isn’t true. I’ve almost always used Seaborn if I could, dropping down to matplotlib when I needed the customizability. That said, I find this premise to be a more compelling set-up, plus we’re living in a post-truth society anyway.
   2 Right off the bat, you’re mad at me, so allow me to explain: I love bokeh and plotly, and indeed, one of my favorite things to do before sending out an analysis is getting “free interactivity” by passing my figures to the relevant bokeh/plotly functions; however, I’m not familiar enough with either to do anything more sophisticated. (And let’s be honest — this post is long enough.)
   3 Please note: this is all in good fun. I am rendering    no   judgments on any library with my amateur anthropomorphism. I’m sure matplotlib is very charming in real life.
   4 To be frank, I’m not  totally sure if faceting is handled separately for ideological purity or if it’s simply a practical concern. While my ggplot character claims it’s the former (his understanding is based on a hasty reading of  this paper ), it may be that ggplot2 has such rich faceting support that — practically speaking — it needs to happen as a separate step. If my characterization offends any grammar of graphics disciples, please let me know and I’ll find a new bit.
   5 Absolutely    not   the moral of this story



上一篇:Apple Watch sales will be worse in 2016 than in 2015 (AAPL)
下一篇:Independent consumer body ranks iPhone 7 well behind Android rivals in battery l
可乐天曲 发表于 2016-10-4 03:51:04
俺从不写措字,但俺写通假字!  
回复 支持 反对

使用道具 举报

魂凝静幽 发表于 2016-10-4 03:51:08
兜兜转转,楼主我又来了!
回复 支持 反对

使用道具 举报

aphr9868 发表于 2016-10-4 03:51:40
如果我做了皇帝,就封你当太子!
回复 支持 反对

使用道具 举报

相亲相爱一佰年 发表于 2016-10-4 03:51:47
我看着大家顶!
回复 支持 反对

使用道具 举报

qgbmc 发表于 2016-10-4 03:54:20
撸过...
回复 支持 反对

使用道具 举报

怀萍 发表于 2016-10-4 04:01:11
接下来是见朕骑妓的时刻
回复 支持 反对

使用道具 举报

董杰 发表于 2016-10-4 04:06:27
如果回帖是一种美德,那董杰早就成为圣人了!  
回复 支持 反对

使用道具 举报

ci尔宓影b 发表于 2016-10-4 04:15:05
好,很好,非常好!
回复 支持 反对

使用道具 举报

冬萱 发表于 2016-10-4 04:16:26
very good
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读


回页顶回复上一篇下一篇回列表
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )

© 2001-2017 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表