MyDataProvider » Blog » TOP Libraries for Web scraper Development

TOP Libraries for Web scraper Development

  • by

Python

Scrapy

An open source and collaborative framework for extracting the data you need from websites.

Scrapy project web site:http://scrapy.org/

Scraping Web Pages with Scrapy
Installing Scrapy On Windows
Type Framework
First release date 2008
Issues count 221
License BSD License
Programming language Python
Current version 1
Last release date 2015
Open source Yes

 

BeautifulSoup

In a fast, simple, yet extensible way.

BeautifulSoup project web site: http://www.crummy.com/software/BeautifulSoup/

Scrape Websites with Python + Beautiful Soup 4 + Requests -- Coding with Python
Python BeautifulSoup Web Scrape
Parse HTML Page using BeautifulSoup Python
Parsing With Python's BeautifulSoup
Last release date 2015
Open source Yes
Type Library
First release date 2004
Issues count 58
License BSD License
Programming language Python
Current version 4.4.1

 

mechanize (Python)

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize .

mechanize (Python) project web site: https://github.com/jjlee/mechanize/

https://youtube.com/watch?v=p4dOPXWaeLI

A quick and simple introduction to Mechanize
Python 2 Install Mechanize
First release date 2010
Issues count 60
License BSD-style License
Programming language Python
Current version 0.2.5
Last release date 2011
Open source Yes
Type Library

 

Requests (Python)

Python HTTP Requests for Humans

Requests (Python) project web site: https://github.com/kennethreitz/requests/

Python Requests - 1. Analyzing and Scraping with Python Requests
Python Tutorial - Web Login Using Requests Module
Requests Module in Python
Current version 2.9.1
Last release date 2015
Open source Yes
Programming language Python
First release date 2011
Issues count 70
License Apache 2 License

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

html5lib project web site: https://github.com/html5lib/html5lib-python

https://youtube.com/watch?v=dWlhrL1l3QU

Type Library
First release date 2013
Issues count 56
License Any
Programming language Python
Current version 1.0b8
Last release date 2015
Open source Yes

 

urllib2

urllib2 extensible library for opening URLs
urllib2 project web site: https://docs.python.org/2/library/urllib2.html

Python 2.7 Tutorial Pt 13 Website Scraping
[Python] Extract informations out of a website using urllib2
First release date 1990
Open source Yes
Programming language Python
Current version Stable
Last release date 2015
License Python Software Foundation License
Type Library

 

PHP

Requests (PHP)

Requests for PHP is a humble HTTP request library. It simplifies how you interact with other sites and takes away all your worries.

Requests (PHP) project web site: https://github.com/rmccue/Requests

Type Library
First release date 2012
Issues count 29
License ISC License
Programming language PHP
Current version 1.6.1
Last release date 2015
Open source Yes

Buzz

Buzz is a lightweight PHP 5.3 library for issuing HTTP requests.

Buzz project web site: https://github.com/kriswallsmith/Buzz

Type Library
First release date 2010
Issues count 44
License MIT License
Programming language PHP
Current version 0,15
Last release date 2015
Open source Yes

Guzzle

It is  a simple PHP Web Scraper

guzzle project web site: https://github.com/guzzle/guzzle

Programming language PHP
Current version 6.1.1
License Any
Type Library
Open source Yes

Goutte

Goutte is a web scraping library. It provides a nice API to crawl websites and extract data from the HTML/XML responses.

Goutte project web site: https://github.com/FriendsOfPHP/Goutte

First release date 2012
Issues count 40
License MIT License
Programming language PHP
Current version 3.1.0
Last release date 2015
Open source Yes
Type Library

 

Ruby

data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.

data_miner project web site: https://github.com/seamusabshere/data_miner

Type Library
First release date 2009
Issues count 8
License MIT License
Programming language Ruby
Current version 3.0.0
Last release date 2014
Open source Yes

pismo

pismo – Web page content analysis and metadata extraction

pismo project web site: https://github.com/peterc/pismo

Issues count 11
License MIT License
Programming language Ruby
Current version 0.7.4
Last release date 2013
Open source Yes
Type Library
First release date 2010

Nokogiri

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support

Nokogiri project web site: https://github.com/sparklemotion/nokogiri

Last release date 2015
Open source Yes
Type Library
First release date 2008
Issues count 180
License MIT License
Programming language Ruby
Current version 1.6.8.rc1