Introduction to Prinicipal Component Analysis: Dimensionality Reduction Made Easy

综合技术 2018-12-08 阅读原文

BigML’s upcoming release on Thursday, December 20, 2018 , will be presenting our latest resource to the platform: Principal Component Analysis (PCA). In this post, we’ll do a quick introduction to PCA before we move on to the remainder of our  series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an  example use case . Then, there will be three more blog posts focused on how to use PCA through the  BigML DashboardAPI , and  WhizzML for automation . Finally, we will complete this series of posts with a technical view of  how PCAs work behind the scenes .

Understanding Principal Component Analysis

Many datasets in fields as varied as bioinformatics, quantitative finance, portfolio analysis or signal processing can contain an extremely large number of variables, that may be highly correlated, resulting in sub-optimal Machine Learning performance. Principal component analysis (PCA) is one technique that can be used to transform such a dataset in order to obtain uncorrelated features or as a first step in dimensionality reduction

Because PCA transforms the variables in a dataset without accounting for a target variable, it can be considered an unsupervised Machine Learning method suitable for exploratory data analysis of complex datasets. However, when used towards dimensionality reduction, it also helps reduce supervised model overfitting, as there remain fewer relationships to consider between variables after the process. To do this, the principal components yielded by a PCA transformation are typically ordered by the amount of variance each explains in the original dataset. The practitioner can decide how many of the new component features can be eliminated from a dataset while preserving most of the original information contained in it.

Even though they are all grouped under the same umbrella term (PCA), under the hood, BigML’s implementation incorporates multiple factor analysis techniques, rather than only the standard PCA implementation. Specifically,

  • Principal Component Analysis (PCA)BigML utilizes this option if the input dataset contains only numerical data.
  • Multiple Correspondence Analysis (MCA) : t his option is available if the input dataset contains only categorical data.
  • Factorial Analysis of Mixed Data (FAMD)in case the input dataset contains both numeric and categorical fields this option is also available.

In the case of items and text fields, data is processed using a bag-of-words approach allowing PCA to be applied. Because of this nuanced approach, BigML can handle categorical, text, and items fields in addition to numerical data in an automatic fashion that does not require manual intervention by the end user.

Want to know more about PCA?

If you would like to learn more about Principal Component Analysis and see it in action on the BigML Dashboard, please reserve your spot for our upcoming release webinar on Thursday, December 20, 2018 . Attendance is FREE of charge, but space is limited so register soon!

责编内容 【阅读原文】。感谢您的支持!


UFLDL教程笔记(一) UFLDL教程 作为深度学习入门非常好的一个教程,为了防止自己看过忘过,所以写下笔记,笔记只记录了自己对于每个知识点的直观理解,某些地方加了一些使自己更容易理解的例子,想要深入了解建议自己刷一遍此教程。 一、稀疏自编码器 神经网...
【火炉炼AI】机器学习053-数据降维绝招-PCA和核PCA... (本文所使用的Python库和版本号: Python 3.6, Numpy 1.14, scikit-learn 0.19, matplotlib 2.2 ) 主成分分析(Principal Component Analysis, PC...
机器学习实战之主成分分析(PCA) 如果人类适应了三维,去掉一个维度,进入了二维世界,那么人类就会因为缺少了原来所适应的一个维度,而无法生存。 ——《三体》 在许多科幻电影中,有许多降维的例子。在《十万个冷笑话2》(可能只有萌新看过)中,大bos...
PCA revisited: using principal components for clas... This is a short post following the previous one (PCA revisited). In this post I’m going to apply PCA to a toy probl...
特征工程全过程 1 特征工程是什么? 有这么一句话在业界广泛流传:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。那特征工程到底是什么呢?顾名思义,其本质是一项工程活动,目的是最大限度地从原始数据中提取特征以供算法和模型使用。通过...