Data preparation is the process of cleaning raw data before it is processed and analyzed. The data preparation definition may entail reformatting data sets, initiating corrections, and combining data sets to make it richer.
What is data preparation?
Data preparation serves business units such as data warehousing and business intelligence. The demand for data preparation is also driven by business users who wish to analyze data without technical proficiency in tools such as SQL or Python.
The 5 D’s of data preparation can help you understand the mechanics of this term. They are:
With vast amounts of data at your disposal, how do you understand which one is best suited for your specific needs? Efficient data discovery directly correlates to how well you’ve maintained a comprehensive data catalog including high-level statistics about the data’s quality.
After the discovery process, it’s important to detain the data selected. This involves a temporary staging area and makes use of managed storage such as a big data repository. Detaining your data adequately ensures it’s not subject to manipulation or a loss of integrity.
Distilling your data helps make it fit for its intended use. Distillation directly refers to refining the data, making sure it’s optimized for standard reporting and query management. This assists with better analytics and predictive outcomes.
Documentation is the process of recording the processes you used to discover, detain, and distill your data.
At this final stage of data preparation, the key goal is to structure the data in a format that can be easily consumed by the end-user or process. Delivery should also stick to data governance policies, so as to avoid things like exposing sensitive information.