Web Scraping In JavaScript With NanoPipe

微信扫一扫,分享到朋友圈

Web Scraping In JavaScript With NanoPipe

Web Scraping In JavaScript With NanoPipe

With the ever increasing amount of content on the Web, even if you are focused on just select sites, web scraping can require a lot of processing power. As a result, it needs to take advantage of parallel or asynchronous tasks. And, by its nature, web scraping generally uses asynchronous processing just to get its inputs using fetch . Put on top of this the complexities of parsing for the data you want and things can get gnarly fast as you descend into callback caves, Promise purgatory, or hard to debug generator outages. The tiny (less <450 bytes gzipped) NanoPipe library can help.

NanoPipe lets you create asynchronous chainable functions for use in processing pipelines in three easy steps while minimizing the direct use of Promises, callbacks, or async/await.

  1. Define and declare to NanoPipe your functions.
  2. Create a pipeline using your functions.
  3. Feed input to your pipeline.

Below we walk you through creating a simple scraping pipeline. When you are done, you will be able to use it like this:

const scraper = NanoPipe()
 .getUrl()
 .toDOM()
 .logTitle()
 .splitAnalysis();
scraper.pipe(["https://www.cnn.com","https://www.msn.com"]);
scraper.pipe(["https://www.foxnews.com"]);

Set-up

For our web scraper we will need NanoPipe, so run:

npm install nano-pipe

We will also need functions to get the content at URLs, convert the content into DOMs, provide us with processing feedback, actually analyze the content, and save the results of the analysis.

There are two great NodeJS libraries that will make our life simpler:

  1. node-fetch
  2. jsdom

Both of these are installed as dev dependencies for NanoPipe because the example provided here is also in the examples directory for NanoPipe.

For the example we will also mock-up two data stores. You could use Redis , Mongo , KVVS , or some other store:

To get started, create a file scrape.js with the following contents:

const NanoPipe = require("../index.js"),
  fetch = require("node-fetch"),
  JSDOM = require("jsdom").JSDOM;
const db1 = {
    put(key,value) {
     console.log(`Saving ${value} under ${key} in db1`);
    }
  },
  db2 = {
    put(key,value) {
      console.log(`Saving ${value} under ${key} in db2`);
  };

Defining and Declaring Functions

Add a function to get the text at a UR:

async function getUrl() {
  const response = await fetch(this);
  return response.text();
}

Note that getUrl does not take an argument. It gets its argument value from the pipeline in which it is used and the argument will always be this .

Add a function to convert text into a DOM:

function toDOM() {
  return new JSDOM(this);
}

Add a function we can use so that we know what is happening. Note, we just return this to keep passing data down the pipeline:

function logTitle() {
  console.log(this.window.document.title)
  return this;
}

Under the assumption we will be processing a lot of URLs with a lot of associated data and we are going to put info about URL content heads in one data store and bodies in another, create a function to split the processing into multiple pipes and save the analysis. Note how you can pass in arguments with the save function defined later.

function splitAnalysis() {
  NanoPipe()
   .analyzeHead()
   .save(db1)
   .pipe([this.window.document.head]);
  NanoPipe()
   .analyzeBody()
   .save(db2)
   .pipe([this.window.document.body]);
}

Define the functions for analyzing the head and the body. For our example we just simulate an asynchronous process and capture the length in a resolved Promise. NanoPipe will automatically handle functions that return Promises.

function analyzeHead() {
// this could invoke processing on another server or thread
// return 
  return Promise.resolve({title:this.title,
    length:this.head.innerHTML.length});
}
function analyzeBody() {
// this could invoke processing on another server or thread
// return 
  return Promise.resolve({title:this.title,
    length:this.body.innerHTML.length});
}

Finally, create a function to save the results:

function save(db) {
  db.put(this.title,this.length);
}

We can now declare all the functions to NanoPipe:

NanoPipe
  .pipeable(getUrl)
  .pipeable(toDOM)
  .pipeable(logTitle)
  .pipeable(splitAnalysis)
  .pipeable(analyzeHead)
  .pipeable(analyzeBody)
  .pipeable(save);

Defining A Pipe

To define a pipe, just call NanoPipe and chain together the calls you want!

const scraper = NanoPipe()
  .getUrl()
  .toDOM()
  .logTitle()
  .splitAnalysis();

Use The Pipe

Now you can use the pipe multiple times. Add the below lines at the end of your file:

scraper.pipe(["https://www.cnn.com","https://www.msn.com"]);
scraper.pipe(["https://www.foxnews.com"]);

Save the file and using NodeJS v9 or v10 run the command:

node --harmony scraper.js

Note that the results logged may not be in the order specified above, evidence that asynchronous processing is happening.

Closing Comments

If you are wondering how NanoPipe works, watch for a follow-up article on the wonders of async generators.

They are not quite like riding a half-pipe, but I hope you have fun with NanoPipes.

If you liked this article, don’t forget to give it clap!

Mathematical Illustrations: A manual of geometry and postscript

上一篇

ANT attribute in regex pattern

下一篇

你也可能喜欢

Web Scraping In JavaScript With NanoPipe

长按储存图像,分享给朋友