11 May 2018

Talend Data Preparation – Tutorial – Part 1

Talend offers a software for cleaning and formatting data. The software is very intuitive. I’m using free Desktop version, so I can’t get wild with data sources. There are only two options: Excel and CSV. 

Surprisingly let’s import the dataset at the beginning…
Talend includes some datasets to play with.

I’ll use Customer Contact Data. 

 ….but you are free to use your own dataset.
  You can give a nice easy name to your preparation.

I see that some emails are not valid, so I select the column email by clicking on the header. My dataset has only 982 rows.

I don’t have any superpowers, I knew that some data doesn’t match the cell format as under each column is a Data Quality bar. 
Green – data matches the format
White – there is nothing (NULLs)
Orange –  the cells include data but it doesn't match the format.

Some information about quality of the selected column is also present in Data Profiling Tab.

There is possibility to delete the cells with invalid values or even delete the entire rows consisting problematic cells. Before using such drastic steps, it is a must to investigate the details.

There was comma instead of dot in an email, so I’ve corrected that manually by double clicking the cell, I’ve made a tick next to apply to all cells with this value. It’s useful when a value appears multiple times. Here it plays no role lol.

After correcting the comma I’ve only 5 incorrect values (at the beginning of the process I had 6!) and I will clear the cells as there is no chance to guess the correct addresses.

Number of nulls has increased. To delete the orange filter click at the cross.

I see more problems, there is an empty space after Frank. 

What if Frank with his empty space was in the row number 900? Should I devote an hour to search for Frank? There are other places, where I can find him easier.
I’m checking Patterns in Data Profiling tab…
No, he is not here.

Frank, you are surrounded

I’ve discovered more problems, there are two names starting with a small letter.

The problem with empty spaces is solved by selecting function Remove trailing and leading characters (whitespaces) and the cure for the second problem is Change to title case.

Mask data function gives possibility to hide some sensitive information.

The column job_title consists too many details about positions, Find and group similar text function is a good choice in this situation. 

Talend can advise but the final decision belongs to you, if you don’t want to accept entire group suggested by Talend, you have to delete a tick in the first column.

If you want to keep a suggested group but delete some members from it, untick members in the second column.
To change names of suggested groups double click them.
Talend automatically saves all activities in this side bar on the left.
Function order can be changed by pulling the functions up or down. You can delete steps by clicking on the rubbish bin next to an unwanted function.

Exporting button is in the right upper corner. All the changes don't affect data in your data source, you have to export data to be able to analyse your new clean dataset.

No comments:

Post a Comment