Saturday, November 3, 2018

Big Data - we forgot to clean our data

Recently I have been working with lots of data coming from various business area such as maintenance, financial transaction, etc., and I found an interesting thought from much top management and young leaders whom don't have enough experience handling data from the source up till visualization.

The first thing comes out from their mind was can it be done in a few hours (or some of them thought it was in a blink of the eyes or in split seconds). Normal question was "When can we see the report or chart? Can we see it tomorrow" and the worst I get "I want it to be ready by today before noon".

Image result for unclean dataMy first reaction was WTF!! (but I won't say it loud). I will normally negotiate with them as most of them don't know the process and the data that they requested. Most of them are easily attracted/amazed by superb visualization presentation by Visualization Tool's Marketing team. The thought everything can be done easily as those marketing people said. They just forgot that in a business presentation, the data set used by those marketing people are prepared and cleaned before being utilized in the tool. For example, Tableau's presentation will always use Sales data for their sample.

It is true that many of visualization tools nowadays are capable to process and display any kind of data. With certain skills, you can massage and clean your data on the fly. I've done that and I know it can be done.  BUT... surely at a cost which from my perspective, it is no beneficial at all.

Why do I say so?

  1. You cannot guarantee that the data you read is 100% clean. You might need to do lots of conversions, data massaging, replacements and calculations. This will definitely incur additional processing power and time during report population. I've come across with many data which either irrelevant, unclean (character in a supposed to be numerical column, date define as string, etc), or contain null values.
  2. You may need to perform lots of table joint or union which can cause your report server or tool to be resource hungry.
  3. You need to understand the data too. Each column and how it shall be visualize must be understood before you can present it correctly.

