Knight Center
Knight Center

JOURNALISM IN THE AMERICAS Blog

Five tools to extract "locked" data in PDFs



Journalists and researchers are used to encountering--more often than they would like--"locked" data in Adobe Acrobat PDF files. The format is a nightmare for someone who wants to manipulate or reference large quantities of information because it functions like an image file and is not readable by many computer programs. 

Extracting data from PDFs for open use is not a simple task, as ProPublica reporter Jeremy B. Merrill, one of the contributors to the "Dollars for Docs" project, can attest. The Knight Center for Journalism in the Americas asked programmers and specialists in data journalism, including the ex-editor of the Guardian Datablog, Simon Rogers, for their recommendations and identified some free tools to facilitate the conversation from PDFs to an open format, like CSV tables.  

Remember, no converter is perfect. This is because PDFs can hold scanned information (that requires another kind of conversion, like OCR), complex tables (with columns or rows spanned multiple cells) or without graphic lines, in short, distinct patterns that hinder the correct formatting of the converted file. 

Rogers recommended always keeping in mind if there were changes to the structure of the document that could invalidate the information gathered. According to the journalist, the best way to do this is to randomly check the converted data to see if it's different from the original. And don't be fooled, there will almost always be a need to clean up the data when using an automatic conversion, especially for tables.  

1. Cometdocs

In just a few clicks transforming your PDF into an Excel (XLS), ODS, TXT or up to 50 other formats is just minutes away. Cometdocs does not require a login but having an account allows you to access other functions, like storage and direct download of the converted file. 

Upload the file, up to 100MB (a reasonable size), that you want to convert, select the format and be sure to include your email. Soon, the converted file will arrive in your inbox. You can also anonymously share files. Click here to see how.

2. Zamzar

The interface is just as simple as Cometdocs. Just upload the file and get a new version in your email. But there is one warning: if you convert multi-page PDFs into spreadsheets, the information from each page shows up in different tables, making the clean up job even harder. 

3. Nitro PDF to Excel

This one is Rogers' tip for converting PDFs into Excel spreadsheets. While it is a paid service, it offers several free options. It works the same way as Zamzar or Cometdots: just upload the file and the desired format arrives in the inbox. The advantage here is that the service specializes in converting to Excel files. 

4. PDFtoText

PDFtoText is free, open source, and does a great, quick job working with well-defined tables. However, it doesn't do a great job with documents with complex layouts and multiple headers. Journalist Jeff Porter of Investigative Reporters and Editors, wrote detailed instructions for how to use the application

5. Tabula

Tabula, created by a group of journalists and developers at ProPublica and the Knight-Mozilla Fellowship, is a free, open code application that allows users to upload their files and select the tables from the PDF they want to extract into CSV files (check out a demo here). It does a good job even with tables that lack clear definition. The downside to this software is that the user requires programming knowledge (installation is manual). But, its developers promise changes that should simplify its use, so it's worth keeping Tabula on your short list of digital tools for computer-assisted reporting. 

*Brazil's Sunshine Law

True, it's not technically a tool but it's a great way to obtain public documents, especially in countries whose sunshine law requires that information be made public in "format legible by computers," like in Brazil. Making access to information requests also helps encourage public entities that maintain this information to make it available in non-restrictive formats. 



2 comments

 
Prasad Gurla wrote 3 years 33 weeks ago

Nice list.

Thank you for the links.

 
Guest wrote 4 years 9 weeks ago

IntelliGet

You forgot to mention IntelliGet and Monarch Pro. These are perhaps the best tools to extract data from PDF

Add your comment

The content of this field is kept private and will not be shown publicly.
By submitting this form, you accept the Mollom privacy policy.

Newsletter





Please Subscribe to our Weekly Newsletter Below!

Full Name

Email *
Select the lists you wish to subscribe to
Boletim Semanal (Português)

Boletín semanal (Español)

Weekly Newsletter (English)
email marketing
by activecampaign

Facebook

Recent comments