Knight Center
Knight Center


Five tools to extract "locked" data in PDFs

Journalists and researchers are used to encountering--more often than they would like--"locked" data in Adobe Acrobat PDF files. The format is a nightmare for someone who wants to manipulate or reference large quantities of information because it functions like an image file and is not readable by many computer programs. 

Extracting data from PDFs for open use is not a simple task, as ProPublica reporter Jeremy B. Merrill, one of the contributors to the "Dollars for Docs" project, can attest. The Knight Center for Journalism in the Americas asked programmers and specialists in data journalism, including the ex-editor of the Guardian Datablog, Simon Rogers, for their recommendations and identified some free tools to facilitate the conversation from PDFs to an open format, like CSV tables.  

Remember, no converter is perfect. This is because PDFs can hold scanned information (that requires another kind of conversion, like OCR), complex tables (with columns or rows spanned multiple cells) or without graphic lines, in short, distinct patterns that hinder the correct formatting of the converted file. 

Rogers recommended always keeping in mind if there were changes to the structure of the document that could invalidate the information gathered. According to the journalist, the best way to do this is to randomly check the converted data to see if it's different from the original. And don't be fooled, there will almost always be a need to clean up the data when using an automatic conversion, especially for tables.  

1. Cometdocs

In just a few clicks transforming your PDF into an Excel (XLS), ODS, TXT or up to 50 other formats is just minutes away. Cometdocs does not require a login but having an account allows you to access other functions, like storage and direct download of the converted file. 

Upload the file, up to 100MB (a reasonable size), that you want to convert, select the format and be sure to include your email. Soon, the converted file will arrive in your inbox. You can also anonymously share files. Click here to see how.

2. Zamzar

The interface is just as simple as Cometdocs. Just upload the file and get a new version in your email. But there is one warning: if you convert multi-page PDFs into spreadsheets, the information from each page shows up in different tables, making the clean up job even harder. 

3. Nitro PDF to Excel

This one is Rogers' tip for converting PDFs into Excel spreadsheets. While it is a paid service, it offers several free options. It works the same way as Zamzar or Cometdots: just upload the file and the desired format arrives in the inbox. The advantage here is that the service specializes in converting to Excel files. 

4. PDFtoText

PDFtoText is free, open source, and does a great, quick job working with well-defined tables. However, it doesn't do a great job with documents with complex layouts and multiple headers. Journalist Jeff Porter of Investigative Reporters and Editors, wrote detailed instructions for how to use the application

5. Tabula

Tabula, created by a group of journalists and developers at ProPublica and the Knight-Mozilla Fellowship, is a free, open code application that allows users to upload their files and select the tables from the PDF they want to extract into CSV files (check out a demo here). It does a good job even with tables that lack clear definition. The downside to this software is that the user requires programming knowledge (installation is manual). But, its developers promise changes that should simplify its use, so it's worth keeping Tabula on your short list of digital tools for computer-assisted reporting. 

*Brazil's Sunshine Law

True, it's not technically a tool but it's a great way to obtain public documents, especially in countries whose sunshine law requires that information be made public in "format legible by computers," like in Brazil. Making access to information requests also helps encourage public entities that maintain this information to make it available in non-restrictive formats. 


Subscribe to our weekly newsletter "Journalism in the Americas"

Boletim Semanal (Português)
Boletín Semanal (Español)
Weekly Newsletter (English)
Marketing by ActiveCampaign