MyDataProvider » Blog » Web Scraping for Journalists

Web Scraping for Journalists

  • by

Scraping is getting a computer to harvest information from multiple websites online allowing them to collect large data and is the most effective way for journalists to get to the story first and find exclusives that nobody else has. It is also a great tool for reporters who know how to code as more and more public institutions now publish their data on their websites.Do web scraping for journalists exist?

The Legality

There are however questions raised about what data a journalist can access without breaking the law or without ‘seemingly’ hacking. There’s a very thin line here, and most all journalists are guided by a code of ethics. It’s also a fair assumption that in cases where an institution has published data on their website, this isn’t necessarily public.

Government servers host private information about their citizens. Accessing this would be a violation of privacy laws. There’s a very thin line between scraping and hacking, and that is the respect of the law. Protected data should not be pried into.

If it isn’t available to the public, then it isn’t available to journalists either. Even in such a cut-throat career where breaking the story that nobody else has, respect of the law still applies.

Web Scraping for Journalists Tools

There are a few web scraping tools that are perfect for journalists web scraping.

Scraper

Scraper is a free Chrome extension. The tool is easy to use when you need to extract plain data from the website. After you download and install the software in your browser, highlight the website you want to be scrapped. Right-click and click the similar button. A window will pop up with information similar to what you had highlighted.

Scraper is the best web scraping tool for plain text extraction. You cannot scrape images or complicated objects using the tool. It does not harvest large volume of text but it is easy to use and most suitable for beginners. The tool uses XPath to determine what information to scrape. With this tool, you can easily navigate through if you have coding knowledge.

Outwit Hub

Outwit hub is another web scraping tool you can get for free. This tool is a Firefox extension. The tool can be used by beginners and experts easily. With this tool, you can easily scrape images, documents, PDFs.

After scraping data, the tool returns data in a visual presentation. This helps non-coders an easy time to understand the data returned. The data extracted is exported in different formats while images and documents are saved in the hard drive.

Scraperwiki

The scraper wiki platform has been updated recently. The platform allowed experienced coders to run their own codes in the browser. Recently, the platform has moved to custom or pre-made tools that work best for beginners.

BeautifulSoup

Beautiful soup is quite different from the above options. The scraping tool deals more with coding knowledge. Despite this, the tool is easy to use and navigate through. When using this tool, it does not require you to have much code to extract data from the web.

BeautifulSoup does a good job fetching data from URL and it allows you to parse data with no hassle. In case you are looking for a tool that you can create codes to extract what you need, this is the tool for you.

Scrapy

Scrapy web scraping tool is similar to BeautifulSoup. It works by creating your own code that you can use to extract the data you want. However, Scrapy web scraping tool is more robust than BeautifulSoup. It can act as a full web scrape framework. Scrapy is an example of a python package and installed via pip.

CODING

There are quite a number of skills journalists must master. Coding is one of these. It ensures a journalist stays ahead of the pack. It also gives them a chance to become more computer savvy in an inexpensive way. There are loads of free tutoring tools available online. You can use them to learn how to scrape data.  All that’s required is self-confidence! With the technology today, web scraping for journalists has been made easier as anybody can do this.