TOP Libraries for Web scraper Development

Share

Python

Scrapy

An open source and collaborative framework for extracting the data you need from websites.

Scrapy project web site:http://scrapy.org/

Type Framework
First release date 2008
Issues count 221
License BSD License
Programming language Python
Current version 1
Last release date 2015
Open source Yes

 

BeautifulSoup

In a fast, simple, yet extensible way.

BeautifulSoup project web site: http://www.crummy.com/software/BeautifulSoup/

Last release date 2015
Open source Yes
Type Library
First release date 2004
Issues count 58
License BSD License
Programming language Python
Current version 4.4.1

 

mechanize (Python)

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize .

mechanize (Python) project web site: https://github.com/jjlee/mechanize/

First release date 2010
Issues count 60
License BSD-style License
Programming language Python
Current version 0.2.5
Last release date 2011
Open source Yes
Type Library

 

Requests (Python)

Python HTTP Requests for Humans

Requests (Python) project web site: https://github.com/kennethreitz/requests/

Current version 2.9.1
Last release date 2015
Open source Yes
Programming language Python
First release date 2011
Issues count 70
License Apache 2 License

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

html5lib project web site: https://github.com/html5lib/html5lib-python

Type Library
First release date 2013
Issues count 56
License Any
Programming language Python
Current version 1.0b8
Last release date 2015
Open source Yes

 

urllib2

urllib2 — extensible library for opening URLs
urllib2 project web site: https://docs.python.org/2/library/urllib2.html

First release date 1990
Open source Yes
Programming language Python
Current version Stable
Last release date 2015
License Python Software Foundation License
Type Library

 

PHP

Requests (PHP)

Requests for PHP is a humble HTTP request library. It simplifies how you interact with other sites and takes away all your worries.

Requests (PHP) project web site: https://github.com/rmccue/Requests

Type Library
First release date 2012
Issues count 29
License ISC License
Programming language PHP
Current version 1.6.1
Last release date 2015
Open source Yes

Buzz

Buzz is a lightweight PHP 5.3 library for issuing HTTP requests.

Buzz project web site: https://github.com/kriswallsmith/Buzz

Type Library
First release date 2010
Issues count 44
License MIT License
Programming language PHP
Current version 0,15
Last release date 2015
Open source Yes

Guzzle

It is  a simple PHP Web Scraper

guzzle project web site: https://github.com/guzzle/guzzle

Programming language PHP
Current version 6.1.1
License Any
Type Library
Open source Yes

Goutte

Goutte is a web scraping library. It provides a nice API to crawl websites and extract data from the HTML/XML responses.

Goutte project web site: https://github.com/FriendsOfPHP/Goutte

First release date 2012
Issues count 40
License MIT License
Programming language PHP
Current version 3.1.0
Last release date 2015
Open source Yes
Type Library

 

Ruby

data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.

data_miner project web site: https://github.com/seamusabshere/data_miner

Type Library
First release date 2009
Issues count 8
License MIT License
Programming language Ruby
Current version 3.0.0
Last release date 2014
Open source Yes

pismo

pismo – Web page content analysis and metadata extraction

pismo project web site: https://github.com/peterc/pismo

Issues count 11
License MIT License
Programming language Ruby
Current version 0.7.4
Last release date 2013
Open source Yes
Type Library
First release date 2010

Nokogiri

Nokogiri () is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support

Nokogiri project web site: https://github.com/sparklemotion/nokogiri

Last release date 2015
Open source Yes
Type Library
First release date 2008
Issues count 180
License MIT License
Programming language Ruby
Current version 1.6.8.rc1

 

.NET

Fizzler

A .NET library to select items from a node tree based on a CSS selector. The default implementation is based on HTMLAgilityPack and selects from HTML documents. There over 140 unit tests – see below for more information. The tests are based on the jQuery selector engine tests.

Fizzler project web site: https://code.google.com/p/fizzler/

Last release date 2009
License GNU Lesser GPL
Type Library
First release date 2009
Open source Yes
Programming language .NET
Current version 0,9

AngleSharp

AngleSharp is a web scraping library for c#/.NET. It supports extraction data from HTML5, MathML, SVG and CSS and DOM.

AngleSharp project web site: https://github.com/AngleSharp/AngleSharp

Sources example find here: https://github.com/mydataprovider/.NET-AngleSharp

Type Library
First release date 2013
Issues count 28
License MIT License
Programming language .NET
Current version 0.9.0
Last release date 2015
Open source Yes

Html Agility Pack

HtmlAgilityPack is a leading library for .NET / c# web scraping for today.

This is an leading .NET HTML parser that builds a read/write DOM and supports XPATH . It is a .NET code library that allows you to parse HTML files.

Html Agility Pack project web site: https://htmlagilitypack.codeplex.com/

Our company is using Html Agility Pack for web scraping & price monitoring

Example project: https://github.com/mydataprovider/.NET-HtmlAgilityPack-web-scraping

Issues count 245
License Microsoft Public License (Ms-PL)
Programming language .NET
Current version 1.4.6
Last release date 2012
Open source Yes
Type Library
First release date 2006

 

CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.

CsQuery project web site: https://github.com/jamietre/CsQuery

It is important that CsQuery is not being actively maintained!

So we do not recommend to use in for web scraping.

If you want to test library than take DLL via NuGet because if you try to compile library from githubsources you will receive compilation errors.

Issues count 65
License MIT License
Programming language .NET
Current version Stable
Last release date 2015
Open source Yes
Type Library
First release date 2011
Share