综合技术

making web crawler want to extract the images in url

微信扫一扫,分享到朋友圈

making web crawler want to extract the images in url
0

i want to make web crawler that extract title ,description ,keywords and images from any given url..after extraction i want to save in database… my code does not work for images… any help will be appreciated

var $ = cheerio.load(html);
    var title = $('head title').text();
    var keywords = $('head meta[name=keywords]').attr('content');
    var desc = $('head meta[name=description]').attr('content');
    var links = $('a');
    var img= $('img').attr('content')
    console.log('Crawling "%s" | %s',title,this.url);
    async.map(links.map(function(){
        var href = $(this).attr('href');
        if(href && href != self._url && !(/^#(w)+/.test(href)) && !util.imageRegexp.test(href)){
         if(util.isExternal(href)){
         return 'INSERT INTO `queue` SET `id` = ''+util.id()+'', `url` = '+self.conn.escape(href)+', `from` = '+self.conn.escape(from);
          console.log("self.conn.escape" + self.conn.escape)
          }
          else {
          return 'INSERT INTO `queue` SET `id` = ''+util.id()+'', `url` = '+self.conn.escape(util.resolveRelativeURL(href,self._url))+', `from` = '+self.conn.escape(from);
          }
          }
          return false;
         }).filter(function(el){
        return !!el;
        })
        ,this.conn.query.bind(this.conn),function(e,result){
        if(e){
        console.log('Error writing queue.');
        console.log(e);
        }
        });
    this.conn.query('INSERT INTO `websites` SET ?',{
        id:util.id(),
        url:this.url,
        from:from,
        title:title,
        keywords:keywords || '',
        img:img || '',

        desc:desc || ''
    }

Problem courtesy of: ana

Solution

If by $('img').attr('content')
you want to download the image itself as a file, that won’t work as the image data itself is a separate resource from the HTML, which simply identifies the image’s URL. So you’ll need to make an HTTP GET request for the image by its src
attribute value and save that as a file. Node’s core http client library will work, as will npm modules such as request
or superagent
.

Solution courtesy of: Peter Lyons

阅读原文...


微信扫一扫,分享到朋友圈

making web crawler want to extract the images in url
0

Node.js Recipes

波士顿的一场暴风雪,催生了世界首个厄尔尼诺预测模型

上一篇

Google 推出免费利器,让编程小白也能开发游戏

下一篇

评论已经被关闭。

插入图片

热门分类

往期推荐

making web crawler want to extract the images in url

长按储存图像,分享给朋友