Extracting data from PDF
There are plenty of people who strongly believe that the best place for their data is PDF, everyone does it, statistical offices, international organizations. There is only one reason behind it, their pleasure, yes they find it pleasurable to imagine people who try to extract their data from PDF.
OK the reality is a bit different the format was created to present documents in a manner independent of applications and operating systems.
OK the reality is a bit different the format was created to present documents in a manner independent of applications and operating systems.
Tabula
is an open source tool for freeing data tables locked inside PDF files. Tabula allows to extract such data into a CSV or Excel spreadsheet.It's worth to remember that PDFs are tricky beasts so there is no tool that works in 100% of cases.
Extracting data from websites
Import.io
allows you to structure the data you find on webpages into rows and columns, using simple point and click technology. There are three tools which you can use: extractors, crawlers and connectors. Input data dictate which tool you select.Input can be:
1. data from table //then you have to use extractors//
WSJ.com |
2. data extracted from every page of a website when pages matches the pattern mapped by you //crawlers//
Svarovski' website. |
3. data connected to a search box //connectors//
Walmart's search box filled with butter :). You can get all data about this specific product. |
After building connector, you can change the conditions and perform search in your connector. In our butter example, you can replace butter with bread and pull all the bread data from the Walmart page into a table.
Good tutorial consisting usage of extractors, crawlers and connectors => Get started with Import.io
I've learned about Import.io from Andy's blog VizWiz.
Cleaning data
Data Wrangler
It's an interactive tool for data cleaning and transformation. The tool has an easy to understand interface.A few things that can be done in Data Wrangler:
- mass deleting empty rows, columns
- mass deleting rows consisting selected text
- reshaping data, transforming data from cross tab
- filling empty cells
- splitting columns.
Check Data Wrangler's website which contains a tutorial and a sample of data to play with. The tool is easy to learn.
Data Wrangler in action, transformation to cross tab. |
OpenRefine
It's a tool for working with dirty data, it cleans, transforms from one format to another.Some basic and useful functionality can be learned just in a few minutes, but if you want to get what is the best from OpenRefine then you have to invest your precious time into reading Documentation and online tutorials.
Refine opens in a browser, I use Google Chrome so it looks like this:
No comments:
Post a Comment