Unraveling data scraping: Understanding how to scrape data can facilitate journalists' work
Ever heard of "data scraping?" The term may seem new, but programmers have been using this technique for quite a while, and now it is attracting the attention of journalists who need to access and organize data for investigative reporting.
Scraping is a way of retrieving data from websites and placing it in a simple and flexible format so it can be cross-analyzed more easily. Many times the information necessary to support a story is available, however it is found in websites that are hard to navigate or in a data base that is hard to use. To automatically collect and display this information, reporters need to turn to computer programs known as "scrapers."
Even though it may seem like a "geek" thing, journalists don't need to take advanced courses in programming or know complicated language in order to scrape data. According to hacker Pedro Markun, who worked on several data scraping projects for the House of Digital Culture in Sao Paulo, the level of knowledge necessary to use this technique is "very basic."
“Scrapers are programs easy to handle. The big challenge and constant exercise is to find a pattern in the web pages' data - some pages are very simple, others are a never-ending headache," said Markun in an interview with the Knight Center for Journalism in the Americas.
Like Scraperwiki, other online tools exist to facilitate data scraping, such as Mozenda, a simple interface software that automates most of the work, and Screen Scraper, a more complex tool that works with several programming languages to extract data from the Web. Another similar useful software is Firebug for Firefox.
Likewise, Google offers the the program Google Refine for manipulating confusing data and converting it into more manageable formats.
Data is not always available in open formats or easy to scrape. Scanned documents, for example, need to be converted to virtual text. To do this, there is a function that can be found in Tesseract, an OCR (Optic Character Recognizer) tool of Google that "reads" scanned texts and converts them to virtual texts.
Information and guidelines about the use of these tools are available on websites such as Propublica, which offers several articles and tutorials on scraping tools for journalism. YouTube videos also can prove a helpful source.
Even if you have adopted the hacker philosophy, and reading tutorials or working hands-on tends to be your way of learning, you may encounter some doubts or difficulties when using these tools. If this is the case, a good option is to get in contact with more experienced programmers via discussion groups such as Thackday and Scraperwiki Community, which offer both free and paid-for alternatives to find someone to help do a scraping.
While navigating databases might be old school for some journalists, better understanding how to retrieve and organize data has gained in importance as we've entered an age of information overload, making taking advatage of such data-scraping tips all the more worthwhile.
- Plaza Pública: In-depth, nonprofit news site in Guatemala tackles taboo themes (Interview)
- 13 lessons from ISOJ to innovate journalism according to the blog #nohacefaltapapel
- How to use Facebook Live for journalism and improve user engagement: Lessons from Spanish-language media
- Mexican reporter Marcela Turati calls on U.S. journalists to investigate trafficking networks north of the border
- Journalists’ beginner guide to coding