Knight Center
Knight Center


Unraveling data scraping: Understanding how to scrape data can facilitate journalists' work

Ever heard of "data scraping?" The term may seem new, but programmers have been using this technique for quite a while, and now it is attracting the attention of journalists who need to access and organize data for investigative reporting.

Scraping is a way of retrieving data from websites and placing it in a simple and flexible format so it can be cross-analyzed more easily. Many times the information necessary to support a story is available, however it is found in websites that are hard to navigate or in a data base that is hard to use. To automatically collect and display this information, reporters need to turn to computer programs known as "scrapers."

Even though it may seem like a "geek" thing, journalists don't need to take advanced courses in programming or know complicated language in order to scrape data. According to hacker Pedro Markun, who worked on several data scraping projects for the House of Digital Culture in Sao Paulo, the level of knowledge necessary to use this technique is "very basic."

“Scrapers are programs easy to handle. The big challenge and constant exercise is to find a pattern in the web pages' data - some pages are very simple, others are a never-ending headache," said Markun in an interview with the Knight Center for Journalism in the Americas.

Markun has a public profile on the website Scraperwiki, that allows you to create your own data scraper online, or to access those written by others.

Like Scraperwiki, other online tools exist to facilitate data scraping, such as Mozenda, a simple interface software that automates most of the work, and Screen Scraper, a more complex tool that works with several programming languages to extract data from the Web. Another similar useful software is Firebug for Firefox.

Likewise, Google offers the the program Google Refine for manipulating confusing data and converting it into more manageable formats.

Journalists also can download for free Ruby, a simple and efficient programming language, that can be run on Nokogirito do scrapings on documents and websites.

Data is not always available in open formats or easy to scrape. Scanned documents, for example, need to be converted to virtual text. To do this, there is a function that can be found in Tesseract, an OCR (Optic Character Recognizer) tool of Google that "reads" scanned texts and converts them to virtual texts.

Information and guidelines about the use of these tools are available on websites such as Propublica, which offers several articles and tutorials on scraping tools for journalism. YouTube videos also can prove a helpful source.

Even if you have adopted the hacker philosophy, and reading tutorials or working hands-on tends to be your way of learning, you may encounter some doubts or difficulties when using these tools. If this is the case, a good option is to get in contact with more experienced programmers via discussion groups such as Thackday and Scraperwiki Community, which offer both free and paid-for alternatives to find someone to help do a scraping.

While navigating databases might be old school for some journalists, better understanding how to retrieve and organize data has gained in importance as we've entered an age of information overload, making taking advatage of such data-scraping tips all the more worthwhile.


Subscribe to our weekly newsletter "Journalism in the Americas"

Boletim Semanal (Português)
Boletín Semanal (Español)
Weekly Newsletter (English)
Marketing by ActiveCampaign