Fixing Python SOAP handling, by using XML and JSONB fields in PostgreSQL

综合技术 2016-04-30

Synchronization of huge amount of data between two systems can sometimes be problematic, especially if one of the systems is accessed through a slow API.

The service we had to integrate with was exposed through SOAP API. The API itself was slow and it had ~4,5 seconds time-lag per 10 records, but there was a bigger problem - Python SOAP libraries/implementations are really, really slow, f.e. +20 sec per request to process them.

So in a case of 500,000 records, it would take approximately one week to get all of the data. So, we had to find a way to boost the synchronization time.

Here are all the details about what we had done to find a suitable solution for this problem.

Attempt 1: Finding a suitable Python library

The first idea as mentioned earlier - was to find a suitable Python library. We found the suds library that allows SOAP to Python objects transformation. Unfortunately, suds library’s last update was from five years ago. After a thorough search, we found a suds-jurko , a fork of the original suds. In theory, module did a query to the server and got the objects. Then we would have to map the received objects to our database. However, the matter of slow synchronization lasted. In practice, the module needed 20 seconds to parse a single API response . Add 4,5 seconds from servers application and you get almost 25 seconds per 10 records. We wanted to faster synchronization, so we abandoned the idea of using suds library.

Attempt 2: Implementing a Java wrapper

The second approach was in some way similar to the first – we could utilise some Java tools and generate whole POJO client. In fact InteliJ IDEA was very helpful with that - it tool 20 minutes to generate simple POC. But it would add another layer of abstraction and technology to the system. So finally we abandoned this approach.

Attempt 3: Dump all XML responses and work with them locally

Finally, we decided to store all API responses (raw XMLs) the way SOAP api delivered them and then think later about import method.

The data dump took roughly 35 hours; for some reason,some of the data had two endpoints on the SOAP server. Finally, we had all raw XML files in our database (PostgreSQL with XML native types). All we had to do is to come up with some kind of mapping and an importer.

The solution

  1. We used Python xmltodict library to transform XML fields into a Python dictionary.

  2. Then, the newly formed Python dictionary would be stored JSONB field in Postgres.

Seems we hit a bullseye. It turns out PostgreSQL is quite good at filtering over nested JSONB fields:

Before index:

syncDB> select count(*) from core_jbjorder where data -> 'client' -> 'delivery_address' ->> 'city' = 'Szczecin';  
| count  |
| 198762 |
Time: 3.279s

After index:

syncDB> create index on jbjorder ((data -> 'client' -> 'delivery_address' ->> 'city'));  
Time: 3.910s  
syncDB> select count(*) from core_jbjorder where data -> 'client' -> 'delivery_address' ->> 'city' = 'Szczecin';  
| count  |
| 198762 |
Time: 0.018s

The use of JSONB allowed us to gradually transfer our data from document-like JSONB field into old, good relational schema.

Lesson learned

Splitting responsibility for data fetching and transformation was a good decision. First was to store all of the raw XMLs to save time before we even came up with import implementation - it is always easier to work with data locally. The second was the application of the JSONB field in PostgreSQL.

So general cue in tackling such problems would be to divide it into smaller problems and attacking them one by one. These way you can focus on simple solutions and engage more team members in creative process of problem solving!

责编内容by:TEONITE (源链)。感谢您的支持!


Encrypting and Decrypting with python and nodejs I'm trying to encrypt some content in Python and decrypt it in a nodejs applicat...
python3:logging模块 输出日志到文件 python自动化测试脚本运行后,想要将日志保存到某个特定文件,使用python的logging模块实现 参考代码: import loggin...
Python机器学习算法面试题,唯一的缺点就是资料太充足,史上最全... P(A∩B)=P(A) P(B|A)=P(B) P(A|B) 所以有:P(A|B)=P(B|A)*P(A)/P(B) 对于给出的待分类项,求解在此项...
Code Phonology If you have been following my posts and talks the last few months, you will hav...
REST API vs SOAP Web Services Management Back in the day Web Services were the de facto standard for accessing "systems ...