综合技术

My Personal Holy Trinity for Machine Learning Reproducibility

微信扫一扫,分享到朋友圈

My Personal Holy Trinity for Machine Learning Reproducibility
0

Short and direct:

ML Flow

Why I do use?

(a.k.a What was my pain?)

One of the most painful situations that I faced was spent a huge time coding doing hyperparameter search and track the whole experimental setup . With ML Flow right now the only thing that I need to do it’s just investing time to pre-process the data and choose the algorithm to train; the model serialization, data serialization, packaging it’s all done by MLFlow . A great advantage it is that the best model can be deployed in a REST/API easily instead to use a customized Flask script.

Caveats: I really love Databricks but I think sometimes they’re so fast in their development (sic.) and this can cause some problems, especially if you’re relying on a very stable version and suddenly with some migration you can lose a lot of work (e.g. RDD to Dataframe) because rewrite things again.


Pachyderm

Why I do use?

(a.k.a What was my pain?)

Data pre-processing sometimes can be very annoying and there’s a lot of new tools that actually overpromise to solve it, but in reality, it’s only a over-engineer stuff with a good Marketing (see this classic provided by Daniel Molnar to understand what I’m talking about (minute 15:48))

My main wish in the last 5 years it’s package all dirty SQL scripts in a single place just to execute with decent version control using Kubernetes and Docker and throw all ETLs made in Jenkins to trash (a.k.a embrace the dirty, cold, and complex reality of ETL). Nothing less, nothing more.

So, with Pachyderm I can do that.

Caveats: It’s necessary to say that you’ll need to know Docker and embrace all the problems related, and the bug list can be a little frightening.


DVC

Why I do use?

(a.k.a What was my pain?)

ML Flow can serialize data and models. But DVC put this reproducibility in another level. With less than 15 commands in bash git-like you can easily serialize one versioning your data, code, and models . You can put the entire ML Pipeline in a single place and rolling back any point in time. In terms of reproducibility I think this is the best all-round tool.

Caveats: In comparison with ML Flow the navigation over the experiments here it’s a little bit

hard
tricky and demands some time to get used.


阅读原文...


Avatar

Stack Abuse: Run-Length Encoding

上一篇

全球第一家!台积电官宣2nm工艺:2024年投产

下一篇

您也可能喜欢

评论已经被关闭。

插入图片
My Personal Holy Trinity for Machine Learning Reproducibility

长按储存图像,分享给朋友