What do you really need to know to be a data scientist? Super Data Science and Talk Python …

综合技术 2017-04-09

Previously I discussed the Super Data Science podcast and credit modeling in terms of the modeling strategy and models used. The discussion also covered data science in general, and one part of the conversation I thought was well worth discussing in more detail. Later in the podcast Kirill discusses a trap that data scientists or aspiring data scientists fall into:

Kirill: "I think there’s a level of acumen that people should have, especially going into data science role. And then if you’re a manager you might take a step back from that. You might not need that much detail…If you’re doing the algorithms, that acumen might be enough. You don’t need to know the nitty-gritty mathematical academic formulas to everything about support vector machines or Kernels and stuff like that to apply it properly and get results. On the other hand, if you find that you do need that stuff you can go and spend some additional time learning. A lot of people fall into the trap. They try to learn everything in a lot of depth, whereas I think the space of data science is so broad you can’t just learn everything in huge depths. It’s better to learn everything to an acceptable level of acumen and then deepen your knowledge in the spaces that you need."

Greg: "if you don’t want to get into that detail, I totally get it. You can be totally fine without it. I have never once in my career had somebody ask me what are the formulas behind the algorithm….there’s a lot of jobs out there for people that don’t know them."

I admit I used to fall into this trap. In fact this blog is a direct result. Early in my career I had the mindset if you can't prove it you can't use it. I really didn't feel confident about an algorithm or method until I understood it 'on paper' and could at least code my own version in SAS IML or R. A number of posts here were based on this work and mindset. Then, a very well known and accomplished developer/computational scientist that frequently helped me gave the good advice that with this mindset you might never get any work done. Or only a fraction of work.

Given the amount of discussion you might see on LinkedIn or the so called data science community about real or fake data scientists (lots of haters out there) in the Talk Python to Me podcast
author Joel Grus (of Data Science from Scratch
) provides what I think is the most honest discussion of what data science is and what data scientists do:

"there are just as many jobs called data science as there are data scientists"

That is kind of paraphrasing and kind of out of context and yes very general. Almost defining a word using the word in the definition. But it is very very TRUE. That is because the field is largely undefined. To attempt to define it is futile and I think would be the antithesis of data science itself. I will warn though that there are plenty of data science haters out there that would quibble with what Greg and Joel have said above.

These are people that want to impose something more strict. Some minimum threshold. Common threads indicate some fear of a poser or fake data scientist fooling some company into hiring them or incompetently pointing and clicking their way through an analysis without knowing what is going on and calling themselves a data scientist. While I understand that concern, its one extreme. It can easily morph into a straw man argument for a more political agenda at the other extreme. Some listing of minimal requirements to be a real
data scientist, some laundry list of requirements (think bid data, degrees and the like).

I call them haters because if you hate something you typically want to destroy it, and although it may be unintentional I think imposing more structure on the field is tantamount to destroying it. In its inception, data science was all about disruption. As described in Johns Hopkins applied economics program description:

“Economic analysis is no longer relegated to academicians and a small number of PhD-trained specialists. Instead, economics has become an increasingly ubiquitous as well as rapidly changing line of inquiry that requires people who are skilled in analyzing and interpreting economic data, and then using it to effect decisions ………Advances in computing and the greater availability of timely data through theInternet have created an arena which demands skilled statistical analysis, guided by economic reasoning and modeling.”

This parallels data science. Suddenly you no longer need a PhD in statistics or a software engineering background or an academics' level of acumen to create value added analysis. (although those are all excellent backgrounds for doing some advanced work in data science no doubt). Its that basic combination of subject matter expertise, some knowledge of statistics and machine learning, and ability to write code or use software to solve problems.
That's it. Its disruptive and the haters hate it. They simultaneously embrace the disruption and want to reign it in and fence out the competition. I hate it for the haters but you don't need to be able to code your own estimators or train a neural net from scratch to use it. And there is probably as much or more value creating professional space out there for someone that can clean a data set and provide a set of cross tabs as there is for the know how to set up a Hadoop cluster.

责编内容by:Econometric Sense (源链)。感谢您的支持!


Elegant Python code for a Markov chain text genera... While preparing the post on minimal char-based RNNs , I coded a simple Markov c...
python编程(数据库操作) 【 声明:版权所有,欢迎转载,请勿用于商业用途。 联系信箱:feixiaoxing @163.com】 用python编写数据库的代码很方便,但是如果不想自...
「世界模型」实现,一步步让机器掌握赛车和躲避火球的技能... 前段时间,由谷歌大脑研究科学家 David Ha 与瑞士 AI 实验室 IDSIA 负责人 Jürgen Schmidhuber(他也是 LS...
使用Python查询JMX 一、介绍 我们知道 java 项目中的 JMX 接口信息是十分有用的,我们可以提取这些信息来分析或告警。但是 JMX 的 API 只在 java 中实现,...
Extract data from the text file This question already has an answer here: Python Parse CSV Correctly ...