Improving Dataset Creation for Machine Learning
The greater the amount of high-quality data you have, the better your machine learning model will be. This post looks at the challenges (and solutions) involved in creating and organizing a dataset when you’re starting from scratch.
Over the past decade, one of the hottest areas of innovation has been to apply machine learning to real-world applications. Google and Amazon are now using machine learning models to transcribe speech to text, interpret language, and analyze images and videos. However, to achieve a degree of high accuracy in these applications, training these state-of-the-art deep neural networks requires unfathomably large annotated datasets.
Access to large datasets is beneficial in the machine learning world, so it’s not surprising that the greater the amount of high-quality data you have, the better your model will be. But how can you build a business when the dataset for your application doesn’t exist? This post looks at the challenges involved in creating and organizing a dataset, before offering some solutions.
Problem #1: New datasets are expensive
Creating new datasets can be costly. You need both the data (image, sensor value sequence, etc.), as well as the ground truth (label). Labelling the data can often be time consuming, but since it doesn’t actually require skill it can be outsourced. However, in other cases, you need the result of a lab test or the opinion of a specialist to acquire the label, and this process can quickly become expensive, especially when working to create a minimum viable product.
Problem #2: Dataset quality is a challenge
Dataset quality is a major challenge in machine learning. A dataset with mislabelled data will yield poor classification results. Your dataset should accurately represent the reality you want to describe. To create your dataset, you want to take a relevant statistical sample of your target population. This sample should be identical and independently distributed, even if the population isn’t.
For example, if you want to categorize images of dogs into different breeds, you’ll need to have the same number of images for each dog breed. To start, every piece of data collected should meet a minimum level of quality, and the data needs to contain the subject (signal) and have a minimal amount of “noise”. Noise refers to irrelevant information that can mask the signal you’re trying to detect. This noise could be a blurred image or any number of other anomalies in the measured values. For example, when looking for dental anomalies in a patient image, the dataset should include the teeth, be in focus, and be properly magnified.
Problem #3: Keeping the dataset organized
Another challenge is keeping the dataset organized by using relevant metadata. How do you link the input data with the labels, especially when there can be a time delay between when the data is collected and when the label is applied (e.g. if the label is provided by a lab test several days later)?
Metadata such as camera settings, firmware versions, or sensor settings are key features to associate with the data. These settings can dramatically affect the nature of the data collected, which at some point will then need to be retrieved. Being able to efficiently create filters using metadata can save major headaches during the model creation process.
With large datasets, storing and managing all these data streams can become overwhelming. For example, MistyWest built a model for a client project on data that had been collected from a device in January that had a classification accuracy of 97% when tested. More data was added two months later, and the accuracy dropped to 78%. Now we faced a challenge figuring out what caused the accuracy drop. Since the dataset was effectively organized, we could analyze the metadata to see if there is a logical reason for the change. In this case, the culprit was a configuration change made to the sensor, so we excluded the data from the second batch and reverted the sensor setting.
Problem #4: Obtaining a balanced dataset
It can be difficult to create a balanced dataset, especially in medical applications, as it requires that you obtain a similar amount of healthy and unhealthy sample data. In many cases, you may have many more healthy subjects than unhealthy ones, and with an unbalanced dataset your model will be biased towards whatever data you have more of. Data augmentation techniques as well as specialized techniques for unbalanced datasets can be used in these cases, but nothing replaces good input data. Therefore it is critical that if unhealthy sample data occurs infrequently, you don’t miss it.
Solution #1: Bound the problem
The key to creating a cost-effective dataset is to bound the problem for the use case. By introducing constraints, such as image quality and orientation, you can drastically limit the amount of training data needed to create an effective model. Ensuring that the incoming data is high quality will be easier for you to train your model to validate its use.
In a simple, controlled environment, does the data contain enough information to make a sufficient model? It’s like creating a playground. If the answer is “yes”, the playground can be made more complex by reducing the number of constraints applied.
Solution #2: Build tools to automate dataset creation
You can build tools that collect data faster and ease logistical challenges. This usually consists of a system that:
- Collects the input data (from a video or sensor stream, for example)
- Creates an inventory of the input data
- Allows easy access to the data to execute exploratory data analysis tasks (do the basics: statistical analysis, visualize your data and get a feel for it).
- Provides easy and convenient access for labelling.
The method of labelling the data depends on the use case – it could include additional sensors (i.e., the ones you’re trying to replace), or a system for user input. An automated system can also pull relevant metadata and store it alongside the data. The availability of metadata can help to:
- Allow or simplify anomaly detection
- Helps to explain variability in the data
Solution #3: Instantaneous feedback
How do you ensure that you’re capturing quality data, especially when a rare sample is encountered? One answer is to ensure that every collected dataset is of high quality. In some cases this can be achieved by a lightweight quality analysis either directly on the device or through a low latency cloud based assessment. This fast turnaround feedback loop can trigger a re-acquisition of the dataset.
For example, with sensor data, the analysis could be targeted at assessing the noise level in the data. For an image, it could be assessing the focus or exposure of the images. If the data doesn’t pass this initial assessment, a prompt can be created to reacquire the data before moving on to the next data sample.
Conclusion
Building a dataset can be a challenging process. You must keep data quality and organization in mind, while not spending more than you can afford. Ensuring you understand your business problem will help guide you towards the best path for your machine learning dataset.
This article was originally written by Kristina Pearkes, former Firmware Engineer at MistyWest, in collaboration with Andreas Putz, former Computer Scientist at MistyWest, for the MistyWest blog.