综合编程

Exploratory Data Analysis (EDA) with PySpark on Databricks

微信扫一扫,分享到朋友圈

Exploratory Data Analysis (EDA) with PySpark on Databricks

bye-bye, Pandas…

Photo by
chuttersnap on
Unsplash

EDA with spark means saying bye-bye to Pandas. Due to the large scale of data, every calculation must be parallelized, instead of Pandas , pyspark.sql.functions are the right tools you can use. It is, for sure, struggling to change your old data-wrangling habit. I hope this post can give you a jump start to perform EDA with Spark.

There are two kinds of variables, continuous and categorical. Each of them has different EDA requirements:

Continuous variables EDA list:

  • missing values
  • statistic values: mean, min, max, stddev, quantiles
  • binning & distribution
  • correlation

Categorical variables EDA list:

  • missing values
  • frequency table

I will also show how to generate charts on Databricks without any plot libraries like seaborn or matplotlib.

[图]索尼拆分EP&S部门 成立索尼电子控股公司

上一篇

为减少二氧化碳排放 丰田计划打造大型氢燃料电池卡车

下一篇

你也可能喜欢

评论已经被关闭。

插入图片

热门栏目

Exploratory Data Analysis (EDA) with PySpark on Databricks

长按储存图像,分享给朋友