

Data preparation activities will clear those up. Reduce noise - Images, text, and data can have “noise”, which is extraneous information or pixels that don’t really help with the machine learning project.If you don’t have enough image data, you can actually “multiply” it by simply flipping or rotating images while keeping their data formats consistent. Enhancing and augmenting data - Sometimes you need extra data to make the machine learning model work, such as calculated fields or additional sourced data to get more from existing data sets.(Disclosure: I’m a principal analyst with Cognilytica)ĭata Preparation: More than Just Data CleaningĪccording to Cognilytica’s report, there are many steps required to get data into the right “shape” so that it works for machine learning projects: For example, if you’re trying to get your machine learning algorithm to correctly identify cats inside of images, you need to feed that algorithm thousands of images of cats, appropriately labeled as cats, with the images not having any extraneous or incorrect data that will throw the algorithm off as you build the model. In order for supervised forms of machine learning to work, especially the multi-layered deep learning neural network approaches, they must be fed large volumes of examples of correct data that is appropriately annotated, or “labeled”, with the desired output result. Even more importantly, and perhaps surprisingly, is how human-intensive much of this data preparation work is.

According to a recent report from AI research and advisory firm Cognilytica, over 80% of the time spent in AI projects are spent dealing with and wrangling data.
