MyDataProvider » Blog » TOP Libraries for Web scraper Development

TOP Libraries for Web scraper Development

  • by



An open source and collaborative framework for extracting the data you need from websites.

Scrapy project web site:

Type Framework
First release date 2008
Issues count 221
License BSD License
Programming language Python
Current version 1
Last release date 2015
Open source Yes



In a fast, simple, yet extensible way.

BeautifulSoup project web site:

Last release date 2015
Open source Yes
Type Library
First release date 2004
Issues count 58
License BSD License
Programming language Python
Current version 4.4.1


mechanize (Python)

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize .

mechanize (Python) project web site:

First release date 2010
Issues count 60
License BSD-style License
Programming language Python
Current version 0.2.5
Last release date 2011
Open source Yes
Type Library


Requests (Python)

Python HTTP Requests for Humans

Requests (Python) project web site:

Current version 2.9.1
Last release date 2015
Open source Yes
Programming language Python
First release date 2011
Issues count 70
License Apache 2 License


html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

html5lib project web site:

Type Library
First release date 2013
Issues count 56
License Any
Programming language Python
Current version 1.0b8
Last release date 2015
Open source Yes



urllib2 extensible library for opening URLs
urllib2 project web site:

First release date 1990
Open source Yes
Programming language Python
Current version Stable
Last release date 2015
License Python Software Foundation License
Type Library



Requests (PHP)

Requests for PHP is a humble HTTP request library. It simplifies how you interact with other sites and takes away all your worries.

Requests (PHP) project web site:

Type Library
First release date 2012
Issues count 29
License ISC License
Programming language PHP
Current version 1.6.1
Last release date 2015
Open source Yes


Buzz is a lightweight PHP 5.3 library for issuing HTTP requests.

Buzz project web site:

Type Library
First release date 2010
Issues count 44
License MIT License
Programming language PHP
Current version 0,15
Last release date 2015
Open source Yes


It is  a simple PHP Web Scraper

guzzle project web site:

Programming language PHP
Current version 6.1.1
License Any
Type Library
Open source Yes


Goutte is a web scraping library. It provides a nice API to crawl websites and extract data from the HTML/XML responses.

Goutte project web site:

First release date 2012
Issues count 40
License MIT License
Programming language PHP
Current version 3.1.0
Last release date 2015
Open source Yes
Type Library




Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.

data_miner project web site:

Type Library
First release date 2009
Issues count 8
License MIT License
Programming language Ruby
Current version 3.0.0
Last release date 2014
Open source Yes


pismo – Web page content analysis and metadata extraction

pismo project web site:

Issues count 11
License MIT License
Programming language Ruby
Current version 0.7.4
Last release date 2013
Open source Yes
Type Library
First release date 2010


Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support

Nokogiri project web site:

Last release date 2015
Open source Yes
Type Library
First release date 2008
Issues count 180
License MIT License
Programming language Ruby
Current version 1.6.8.rc1