存储架构

Read and import file with Pandas to MySQL database

微信扫一扫,分享到朋友圈

Read and import file with Pandas to MySQL database
0

Pandas is an open-source library for python. I am going to use this library to read a large file with pandas library. The file is around 7 GB
in size and i need to extract and filter the data from the file and save it to the MySQL database
.

Of cause only pandas library is not enough, we need to establish a MySQL connection through python and then have to insert the filtered data into the database.

Required packages

Let’s start

First of all, we need to import the required libraries like pandas, mysql
-connector and
sqlalchemy
.

import pandas as pd
import mysql.connector
from sqlalchemy import types, create_engine

I am not going in details for create_engine
or types. But if you want to know in details then please following the given links which redirects
you to the documentation part.

Setup MySQL connection with Python

Now let’s establish a database connection so that we direct insert the filtered data into the database.

# MySQL Connection
MYSQL_USER 		= 'root'
MYSQL_PASSWORD 	= 'root'
MYSQL_HOST_IP 	= '127.0.0.1'
MYSQL_PORT		= 3306
MYSQL_DATABASE	= 'database_name'

engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+':'+MYSQL_PASSWORD+'@'+MYSQL_HOST_IP+':'+MYSQL_PORT+'/'+MYSQL_DATABASE, echo=False)

Read large file with Pandas

A bit challanging part is to read the large file and the inserting data into the database. We can’t do it at once because it takes a lot of time and we can’t insert billions of records at once.

To handle this kind of situation we have read the file in chunks and use that chunk data to insert.

pd.read_csv("large_file.txt", usecols=['column_one', 'column_two'], sep=" ")

Let’s understand the above code first. I have used the read_csv()
method to read the TXT
file. Let’s apply chunk on the read_csv()
method and insert data with to_sql()
method.

chunksize = 500

idx = 1
for df in pd.read_csv("large_file.txt", chunksize=chunksize, usecols=['column_one', 'column_two'], sep=" "):

	if idx == 1: 
		exists = 'replace'
	else:
		exists = 'append'


	df.to_sql(name='database_table_name', con=engine, if_exists=exists, index=False, chunksize=chunksize)


	print(str(chunksize * idx)+" Processed");
	idx = idx+1

Complete Codes

Please find the complete set of codes.

import pandas as pd
import mysql.connector
from sqlalchemy import types, create_engine


# MySQL Connection
MYSQL_USER 		= 'root'
MYSQL_PASSWORD 	= 'root'
MYSQL_HOST_IP 	= '127.0.0.1'
MYSQL_PORT		= 3306
MYSQL_DATABASE	= 'geolocation'

engine = create_engine('mysql+mysqlconnector://'+MYSQL_USER+':'+MYSQL_PASSWORD+'@'+MYSQL_HOST_IP+':'+MYSQL_PORT+'/'+MYSQL_DATABASE, echo=False)


chunksize = 500

idx = 1
for df in pd.read_csv("large_file.txt", chunksize=chunksize, usecols=['column_one', 'column_two'], sep=" "):

	if idx == 1: 
		exists = 'replace'
	else:
		exists = 'append'


	df.to_sql(name='database_table_name', con=engine, if_exists=exists, index=False, chunksize=chunksize)


	print(str(chunksize * idx)+" Processed");
	idx = idx+1

This is a basic concept to handling large file with Pandas library and insert the data directly to MySQL database. Of cause the code could improve more. Please feel free to add your thoughts in the comment section.

阅读原文...


微信扫一扫,分享到朋友圈

Read and import file with Pandas to MySQL database
0

Avatar

Read, Enrich and Transform Data with AWS Glue Service

上一篇

Create interactive ggplot2 graphs with plotly

下一篇

评论已经被关闭。

插入图片

热门分类

往期推荐

Read and import file with Pandas to MySQL database

长按储存图像,分享给朋友