Real-time machine learning with TensorFlow, Kafka, and MemSQL

存储架构 2017-11-15

TensorFlow has emerged as one of the leading machine learning libraries, and when combined with an operational database, it provides the foundation for quickly building sophisticated machine learning workflows.

In this post, we will explore a machine learning workflow using a speed dating dataset. The overall objective of this demonstration is to compare the machine-suggested matches with those a person might choose directly from looking at different people’s profiles. The dataset comes from a speed dating experiment on Kaggle .

As part of the workflow, we will detail how you can use MemSQL Pipelines to stream data from Kafka in real time into the database. Upon ingesting the data, we will incorporate TensorFlow to train and classify data simultaneously using some of the built-in TensorFlow algorithms. Finally, we’ll see how well the machine determines matches.

This overall architecture provides a template for creating more complex machine learning workflows with new datasets.

A machine learning workflow with TensorFlow

Our architecture consists of training and classification data streamed through Kafka and stored in a persistent, queryable database. In this case we will use MemSQL and take advantage of the Pipelines function to run TensorFlow operations on the stream before persisting it to the database.


On the Kafka side, we set up two Kafka topics, Classification and Training. Raw training and classification data is streamed from these Kafka topics into our MemSQL Pipeline.

On the database side, we create a database called speed_dating_matches, and within that database we create two tables, dating_training and dating_results.

  • dating_training is a single-row table where we place the output of the training evaluation to show training in action
  • dating_results is a table containing all of the data about a potential date as well as whether it was determined that this date is a match
    • isMatch = 1 means the date was a match
    • isMatch = 0 means the date was not a match

Next we will create two Pipelines, speed_dating_training and speed_dating_results, which stream in the data from the Kafka topics, train or classify using that data, and place the final result in the corresponding table.

Applying machine learning to predict matches

The speed dating information includes assigning 100 priority points across six traits: attractiveness, intelligence, fun, shared interests, sincerity, and ambition.

It also includes biographical and interest information on hometown, study interests (data was collected from college students), and hobbies such movies, yoga, travel, and video games.

The training data is a set of predetermined matches, and the classification data represents the predicted likelihood of a match. With this information, we can look at who matched in the training data, and use our own answers to the questions to see whom we might match with.

From there, we can ask more detailed questions such as what does the average person look for in terms of dating attributes and interests, and what is the difference between the average person and whom I match with?

We can also query across the entire dataset or query a subset of the dataset that was determined to be a match.

Using built-In TensorFlow models

TensorFlow comes with a number of built in models to choose from. It includes:

  • DNNClassifer
  • DNNRegressor
  • DNNLinearCombinedClassifier
  • DNNLinearCombinedRegressor
  • LinearClassifier
  • LinearRegressor

We will choose the linear classifier for the purpose of this demonstration, and base our model inputs on a combination of the following data types.

  • CSV field names. The CSV field names are the names that will be used when reading your CSV into Pandas dataframes.
  • TensorFlow categorical feature columns. Categorical feature columns are any item that cannot be represented by a discrete number. Features like country of residence, occupation, or alma mater are all examples of categorical feature columns. One of the great features of TensorFlow is that you do not need to know how many distinct values you will have for a given category, and it will handle creating sparse vectors for you. See the “Base Categorical Features Column” section of the TensorFlow Linear Model Tutorial in the TensorFlow documentation.
  • TensorFlow continuous feature columns. Continuous features are anything that can be represented by a number. Features like age, salary, and maximum running speed are all examples of things that could be represented using a continuous feature column. For more information, see the “Base Continuous Feature Columns” section of the TensorFlow Linear Model Tutorial .

Putting training and classification data to work

In this example, people in the speed dating dataset are represented as a vector composed of how they ranked traits, completed biographical info, and listed interests:


Training data is represented as

where the final value is a 0 or 1 based on no match or match.

Classification data is passed through as

where the outcome is a 0 or 1 based on a predicted match.

In the following diagram, we can see that the training data is passed through to train the linear classifier model and the classification data is passed through the TensorFlow model to output a 0 or 1 based on the likelihood of a match.


Predicting love with TensorFlow and MemSQL

With this infrastructure in place we can add our own information into the mix. In this case we can feed dating information for an individual into the classification workflow and predict the likelihood of a match. To assess the validity, one could then look at the matches to see if they are representative of what one might have chosen directly.

The overall architecture provides a number of advantages. It supports simple streaming of new data through Kafka, draws on out-of-the-box TensorFlow models, and persists data in a format that can be easily queried with SQL. Fundamentally, it provides the ability to stream data into MemSQL and classify simultaneously. For more on this, see the TensorFlow documentation on serving a TensorFlow model .

If you would like to see a demonstration of this application in action, feel free to check out this 10 minute video, “ Real-time Machine Learning with TensorFlow, Kafka, and MemSQL ,” from Strata Data Conference New York 2017.

Gary Orenstein leads marketing strategy, growth, communications, and customer engagement at MemSQL . Prior to MemSQL, Gary was the Chief Marketing Officer at Fusion-io where he led global marketing activities. He holds a bachelor’s degree from Dartmouth College and a master's in business administration from The Wharton School at the University of Pennsylvania.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to .


责编内容by:InfoWorld (源链)。感谢您的支持!


SpringBoot开发案例之整合Kafka实现消息队列... 前言 最近在做一款秒杀的案例,涉及到了同步锁、数据库锁、分布式锁、进程内队列以及分布式消息队列,这里对SpringBoot集成Kafka实现消息队列做...
LinkedIn开源Cruise Control:一个Kafka集群自动化运维新利器... Kafka近年来日渐流行,LinkedIn的1800台Kafka服务器每天处理2万亿个消息。虽说Kafka运行得十分稳定,但要大规模运行Kafka,在运维方面仍...
如何在Tensorflow.js中处理MNIST图像数据 有人开玩笑说有 80% 的数据科学家在清理数据,剩下的 20% 在抱怨清理数据……在数据科学工作中,清理数据所占比例比外人想象的要多得多。一般而言,训练...
9- OpenCV+TensorFlow 入门人工智能图像处理-8- OpenCV+TensorFl... 图像美化 案例1: 直方图 案例2: 直方图均衡化 案例3: 亮度增强 案例4: 磨皮美白 案例5: 图片滤波 ...
Simple TensorFlow Examples In this post, we are going to see some TensorFlow examples and see how it’s easy...