TOP Libraries for Web scraper Development


[toc]

Python

Scrapy

An open source and collaborative framework for extracting the data you need from websites.

Scrapy project web site:http://scrapy.org/

TypeFramework
First release date2008
Issues count221
LicenseBSD License
Programming languagePython
Current version1
Last release date2015
Open sourceYes

 

BeautifulSoup

In a fast, simple, yet extensible way.

BeautifulSoup project web site: http://www.crummy.com/software/BeautifulSoup/




Last release date2015
Open sourceYes
TypeLibrary
First release date2004
Issues count58
LicenseBSD License
Programming languagePython
Current version4.4.1

 

mechanize (Python)

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize .

mechanize (Python) project web site: https://github.com/jjlee/mechanize/



First release date2010
Issues count60
LicenseBSD-style License
Programming languagePython
Current version0.2.5
Last release date2011
Open sourceYes
TypeLibrary

 

Requests (Python)

Python HTTP Requests for Humans

Requests (Python) project web site: https://github.com/kennethreitz/requests/



Current version2.9.1
Last release date2015
Open sourceYes
Programming languagePython
First release date2011
Issues count70
LicenseApache 2 License

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

html5lib project web site: https://github.com/html5lib/html5lib-python

TypeLibrary
First release date2013
Issues count56
LicenseAny
Programming languagePython
Current version1.0b8
Last release date2015
Open sourceYes

 

urllib2

urllib2 — extensible library for opening URLs
urllib2 project web site: https://docs.python.org/2/library/urllib2.html


First release date1990
Open sourceYes
Programming languagePython
Current versionStable
Last release date2015
LicensePython Software Foundation License
TypeLibrary

 

PHP

Requests (PHP)

Requests for PHP is a humble HTTP request library. It simplifies how you interact with other sites and takes away all your worries.

Requests (PHP) project web site: https://github.com/rmccue/Requests

TypeLibrary
First release date2012
Issues count29
LicenseISC License
Programming languagePHP
Current version1.6.1
Last release date2015
Open sourceYes

Buzz

Buzz is a lightweight PHP 5.3 library for issuing HTTP requests.

Buzz project web site: https://github.com/kriswallsmith/Buzz

TypeLibrary
First release date2010
Issues count44
LicenseMIT License
Programming languagePHP
Current version0,15
Last release date2015
Open sourceYes

Guzzle

It is  a simple PHP Web Scraper

guzzle project web site: https://github.com/guzzle/guzzle

Programming languagePHP
Current version6.1.1
LicenseAny
TypeLibrary
Open sourceYes

Goutte

Goutte is a web scraping library. It provides a nice API to crawl websites and extract data from the HTML/XML responses.

Goutte project web site: https://github.com/FriendsOfPHP/Goutte

First release date2012
Issues count40
LicenseMIT License
Programming languagePHP
Current version3.1.0
Last release date2015
Open sourceYes
TypeLibrary

 

Ruby

data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.

data_miner project web site: https://github.com/seamusabshere/data_miner

TypeLibrary
First release date2009
Issues count8
LicenseMIT License
Programming languageRuby
Current version3.0.0
Last release date2014
Open sourceYes

pismo

pismo – Web page content analysis and metadata extraction

pismo project web site: https://github.com/peterc/pismo

Issues count11
LicenseMIT License
Programming languageRuby
Current version0.7.4
Last release date2013
Open sourceYes
TypeLibrary
First release date2010

Nokogiri

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support

Nokogiri project web site: https://github.com/sparklemotion/nokogiri

Last release date2015
Open sourceYes
TypeLibrary
First release date2008
Issues count180
LicenseMIT License
Programming languageRuby
Current version1.6.8.rc1



Do you have any task for us?
Contact us or