16 January 2015

Free data journalism tools - data extracting and cleaning

Extracting data from PDF

There are plenty of people who strongly believe that the best place for their data is PDF, everyone does it, statistical offices, international organizations. There is only one reason behind it, their pleasure, yes they find it pleasurable to imagine people who try to extract their data from PDF.

OK the reality is a bit different the format was created to present documents in a manner independent of applications and operating systems.


is an open source tool for freeing data tables locked inside PDF files. Tabula allows to extract such data into a CSV or Excel spreadsheet.

It's worth to remember that PDFs are tricky beasts so there is no tool that works in 100% of cases.

Extracting data from websites


allows you to structure the data you find on webpages into rows and columns, using simple point and click technology. There are three tools which you can use: extractors, crawlers and connectors. Input data dictate which tool you select.

Input can be:

1. data from table //then you have to use extractors//
At first glance this case looks a bit weird as copy and paste seems like a quicker option but thanks to building an extractor you can generate a code which you can place in Google Spreadsheet, then it will be easier to update your table when the source is updated.

2. data extracted from every page of a website when pages matches the pattern mapped by you //crawlers//

Svarovski' website.
If you want product details for every product that a selected internet shop offers, you build an extractor for one product page and then automatically you get the data for all the other products as they have the same product pages. Just like on the picture above you have to select what is important for you; name of the product, price, size, description.

3. data connected to a search box //connectors//

Walmart's search box filled with butter :). You can get all data about this specific product.
You can download all data matching conditions placed in a search box.
After building connector, you can change the conditions and perform search in your connector. In our butter example, you can replace butter with bread and pull all the bread data from the Walmart page into a table.

Good tutorial consisting usage of extractors, crawlers and connectors => Get started with Import.io

I've learned about Import.io from Andy's blog VizWiz.

Cleaning data

Data Wrangler

It's an interactive tool for data cleaning and transformation. The tool has an easy to understand interface.

A few things that can be done in Data Wrangler:
- mass deleting empty rows, columns
- mass deleting rows consisting selected text
- reshaping data, transforming data from cross tab
- filling empty cells
- splitting columns.

Check Data Wrangler's website which contains a tutorial and a sample of data to play with. The tool is easy to learn.

Data Wrangler in action, transformation to cross tab.


It's a tool for working with dirty data, it cleans, transforms from one format to another.

Some basic and useful functionality can be learned just in a few minutes, but if you want to get what is the best from OpenRefine then you have to invest your precious time into reading Documentation and online tutorials.

Refine opens in a browser, I use Google Chrome so it looks like this:

No comments:

Post a Comment