Automating bulk data imports

营销策划 2015-01-15

is a great backend-as-a-service (BaaS) product. It removes much of the hassle involved in backend devops with its web hosting service, SDKs for all the major mobile platforms, and a generous free tier. Parse does have its share of flaws, including various reliability issues (which seem to be getting rarer), and limitations on what you can do (which is reasonable price to pay for working within a sandboxed environment). One such limitation is the lack of APIs to perform bulk data imports. This post introduces my workaround for this limitation (tl;dr: it’s a PhantomJS script

I use Parse for two of my projects: BCRecommender
and Price Dingo
. In both cases, some of the data is generated outside Parse by a Python backend
. Doing all the data processing within Parse is not a viable option, so a solution for importing this data into Parse is required.

My original solution for data import was using the Parse REST API via ParsePy
. The problem with this solution is that Parse billing is done on a requests/second basis. The free tier includes 30 requests/second, so importing BCRecommender’s ~million objects takes about nine hours when operating at maximum capacity. However, operating at maximum capacity causes other client requests to be dropped (i.e., real users suffer). Hence, some sort of rate limiting is required, which makes the sync process take even longer.

I thought that using batch requests
would speed up the process, but it actually slowed it down! This is because batch requests are billed according to the number of sub-requests, so making even one successful batch request per second with the maximum number of sub-requests (50) causes more requests to be dropped. I implemented some code to retry failed requests, but the whole process was just too brittle.

A few months ago I discovered that Parse supports bulk data import via the web interface
(with no API support). This feature comes with the caveat that existing collections can’t be updated: a new collection must be created. This is actually a good thing, as it essentially makes the collections immutable. And immutability makes many things easier

BCRecommender data gets updated once a month, so I was happy with manually importing the data via the web interface. As a price comparison engine, Price Dingo’s data changes more frequently, so manual updates are out of the question. For Price Dingo to be hosted on Parse, I had to find a way to automate bulk imports. Some people suggest emulating the requests made by the web interface
, but this requires relying on hardcoded cookie and CSRF token data, which may change at any time. A more robust solution would be to scriptify the manual actions, but how? PhantomJS
, that’s how.

I ended up implementing a PhantomJS script that logs in as the user and uploads a dump to a given collection. This script is available on GitHub Gist
. To run it, simply install PhantomJS and run:

$ phantomjs --ssl-protocol any import-parse-class.js   

See the script’s source
for a detailed explanation of the command-line arguments.

It is worth noting that the script doesn’t do any post-upload verification on the collection. This is done by an extra bit of Python code that verifies that the collection has the expected number of objects, and tries to query the collection sorted by all the keys that are supposed to be indexed (for large collections, it takes Parse a while to index all the fields, which may result in timeouts). Once these conditions are fulfilled, the Parse hosting code is updated to point to the new collection. For security, I added a bot user that has access only to the Parse app that it needs to update. Unlike the root user, this bot user can’t delete the app. As the config file contains the bot’s password, it should be encrypted and stored in a safe place ( like the Parse master key

That’s it! I hope that other people would find this solution useful. Any suggestions/comments/issues are very welcome.

Image source: Parse Blog

责编内容by:Yanir Seroussi (源链)。感谢您的支持!


Phantomjs does not execute function in page.evalua... I'm scraping a Facebook page with the PhantomJS node module ( https://gith...
phantomJs之殇,chrome-headless之生 技术雷达快讯 :自2017年中以来,Chrome用户可以选择以headless模式运行浏览器。此功能非常适合运行前端浏览器测试,而无需在屏幕上显示操作过程。在...
Phantomjs does not execute function in page.evalua... I'm scraping a Facebook page with the PhantomJS node module ( https://gith...
骚年,看我如何把 PhantomJS 图片的 XSS 升级成 SSRF/LFR... 译者: D@先知安全技术社区 原文地址: 《Escalating XSS in PhantomJS Image Rendering to SSRF/L...