This website uses cookies to help improve your user experience
Imagine you decide to follow Apple’s idea and show users targeted ads for stores near their current location. Shops are eager to collaborate with you, users are happy to use your app, and you’re basking in financial success. In theory. For some reason, things aren’t going as planned, and your solution keeps encouraging some old lady to explore a seemingly nearby bakery, unaware that she had strolled past it a few miles ago.
But why is this happening, when the GPS signal from her phone is clear and uninterrupted? Well, it all boils down to how your solution interprets that absolutely accurate GPS data.
In reality, satellites provide highly unstructured raw information, oftentimes showing erratic movements around the true trajectory. And while a human eye might be able to discern the overall movement trend and identify points A and B, an ML algorithm is unlikely to comprehend anything without specific clarifications.
But the good news is that there is a perfect middleman between the chaos of humanity and the brilliance of AI — data preparation.
So, circling back to the lost and confused grannies, all it takes to ensure they receive accurate ads for stores and don’t miss any freshly baked rolls is to leverage a Kalman filter. It will turn chaotic GPS location data into a precise representation for ML and let it clearly see where and how the person is moving.
If we look at the world through the eyes of ML algorithms, we’ll understand that over 90% of the information surrounding us — whether it’s sounds, images, or even texts — is essentially an indecipherable jumble.
Data preparation, on the other hand, is what makes this data comprehensible. The process itself involves structuring unstructured data, data cleaning, eliminating defects and errors from it, adding missing elements, and enriching the existing data to enhance the accuracy of the algorithm and the ability to predict a certain outcome.
At its core, effective data preparation is what stands as the trampoline that has catapulted machine learning into popularity in recent years.
Precision in tracking an individual’s location, pinpointing available parking spaces, or forecasting treatment outcomes for patients using ML algorithms became possible only because these models were supplied with appropriately processed and annotated data created by humans.
Because regardless of the volume of data, if you can’t interpret it the right way for the algorithm, a machine will be nearly useless, or, in some cases, even harmful.
Let’s imagine the idea with the GPS-driven store advertising worked brilliantly, and now the owner of a leading airline company seeks your expertise in ensuring their brand safety. The task at hand is as follows: amidst the diversity of websites on the Internet, this airline company, as an advertiser, doesn’t want its ads to appear on sites related to terrorists or those mentioning aviation disasters. To easily achieve this using ML algorithms, you should begin the data preparation process.
You, naturally, start with data engineering — all aspects of collecting relevant information from diverse sources. In the context of our brand reputation case, the main source is the Internet. This process entails gathering content from a variety of websites and online platforms, including extracting text, media, metadata, and other relevant details to ensure accurate filtration in the future.
Your next logical step — data labeling — kicks in once you’ve already stored the data. This step requires human input, as individuals make assessments and attach relevant and informative labels to data that serve the purpose of providing context for the machine learning model. This process can also be facilitated by frameworks like, for instance, Amazon SageMaker Ground Truth.
To bring the data preparation to a close, you opt for data cleaning for ML algorithms — the process of purifying text and handling the data for machine learning model training. You remove less significant words, such as articles, prepositions, and anything that can prevent distortion of the overall picture.
By adhering to these steps, success becomes inevitable, and you secure the airline’s reputation for many years to come.
What’s next? None other than even a greater project!
Given your machine learning skills and considering that Will Smith is too old for all this, you’re now entrusted with an exciting top-secret intergalactic mission: to decipher messages from extraterrestrial civilizations with a unique way of presenting information, and use it to tackle loads of earthly challenges. So far, understanding their messages by our human translators in broad strokes isn’t sufficient for using this data to train any machine learning model and, consequently, benefiting humanity.
But since going solo in this grand mission is hardly an option, team up with the fantastic Ivan Zolotov, an experienced go-between for Earth and Space (disguised as Oxagile’s data engineer with over 10 years of expertise and a stellar track record in machine learning projects). He will guide you step by step through the data preparation process, helping you navigate potential pitfalls and ensuring you achieve the most accurate results and predictions with machine learning based on any, even extraterrestrial data.
Since there are numerous features in the input data that can impede ML models from generating accurate results, let’s pay some special attention to the first step in any data workflow — data exploration.
To easily weed out any irrelevant elements and make sure all your data is coherent, I suggest ticking all the boxes of this comprehensive checklist during the data exploration phase:
What does it mean? If we want to predict the weather every hour on Jupiter, then our training data should also be on an hourly basis. Attempting to predict hourly weather using daily data would not yield meaningful results.
There’s no one-size-fits-all measure determining when data is truly sufficient. However, let me share with you one common practice.
What does it mean? We’re predicting the lifespan on Pluto based on some random data. Our dataset includes the object’s race, the number of tentacles, its weight, and its height. Each of these properties is a feature. So, the amount of data we need should be equal to the number of features in our input data, multiplied by 10.
If we’re predicting a number based on previous observations, not a category, then the multiplier should be not 10, but 50.
It is worth noting that these figures are more of an empirical observation, and in each specific case, these numbers may vary.
Check for missing data, which can occur for various reasons such as sensor malfunctions, voltage spikes, or gaps in data collection.
It’s also crucial to distinguish between absent values due to technical issues (e.g., sensor malfunction) and those that are genuinely inapplicable to a given object.
Like dealing with missing data, it’s important to determine whether the duplicates you find are true, or if it’s a data error, such as a sensor malfunction sending incorrect data for a brief period.
There’s no specific number defining an acceptable or unacceptable number of duplicates, but this aspect still requires attention.
What does it mean? Imagine that, following a survey, we identified three professions: space pilot, galactic stylist, and interstellar psychologist. However, while some among those surveyed specified their specialty as “interstellar psychologist”, others simply wrote “space shrink”. So, it’s crucial to make sure that individuals belonging to the same category, but expressing it differently, are still accurately classified.
To grasp and tackle this issue, let’s take a preparatory step. If we’re not using cloud tools and are working with Python, we’ll need to write a program. This program should generate a distinct chart or histogram displaying the values for a specific field. If, for instance, we notice that in a particular professional field, we have 100 “interstellar psychologists” and 2 “space shrinks”, it becomes clear that they essentially mean the same thing and should be merged.
At least some knowledge of the subject area might come in handy here.
What does it mean? Consider a scenario where we aim to forecast the demand for sunblock cream among space residents. The challenge arises from the fact that our training data includes a significant representation of Neptune residents, who experience an average temperature of minus 214 °C. This skewed dataset leads our model to predict that selling sunblock in space would be a fiasco. However, this has nothing to do with reality, since our input data is simply unrepresentative.
What does it mean? Some time ago, our machine learning model tried to recognize aliens and their conversations in audio files. To train our ML algorithm, we loaded various files: somewhere aliens were silent, and somewhere they were talking. And everything seemed fine during controlled predictions. But later, in real-life conditions, we realized that the files with conversations were always the same length (5 seconds), and the model got used to it, assuming that’s how long aliens actually speak. And it simply didn’t recognize any longer speeches as alien talks.
What conclusion can be drawn from this? If the prediction result heavily relies on just one property, it’s a clear indication that we might be dealing with data leakage.
For instance, while CSV files are widely supported by many tools, for complex models prioritizing performance and data volume, alternative formats like HDF and HDF5, as well as TFRecord, may prove more efficient. Benchmarks indicate that the training time for a model using TFRecord files can be as low as 3% compared to a model trained on CSV files, representing a difference of 30 to 50 times.
This is particularly relevant when you use some cloud infrastructure for model training, where the time spent on training translates into additional expenses.
All the libraries for working with these files are available, so there’s no need to write anything from scratch. The key is simply to choose the right one to make the model training happen much faster, especially when dealing with a large volume of data.
You can use open-source data, but this brings us back to point 1, where you need to ensure that the raw data you have will allow you to predict what you need.
What does it mean? You have your own data collected from sensors, the Internet of Things, or a smart home on Mars. However, it might be worthwhile to include external forecasts like weather to predict how much electricity might be needed tomorrow for protection against a meteor shower.
Great, all the checkboxes are ticked, and we’ve got a batch of spot-on data. Now, let’s work our technical magic to make sure we don’t miss a single drop of valuable insight from this information.
Today, there are many accessible frameworks for creating charts, like Matplotlib (a fairly common library for Python), which can generate visuals for your data, allowing you to observe any existing critical aspects. The best part is that such tools are usually not just for tech superheroes — they’re incredibly user-friendly and automated.
It’s essential to check that the sample is representative. If the dataset represents a snapshot of the space population, to make predictions, we need to ensure that the data in this sample corresponds to the entire space population, not just one state in America, for instance.
a) ML models cannot handle text directly; therefore, our categorical data (such as the names of intergalactic professions we’ve mentioned) needs to be represented as numbers. There are several ways to do this: the most popular is the so-called “one-hot” encoding, and the second is ordinal encoding.
b) To enhance accuracy, it is recommended to perform scaling or normalization, transforming numerical values into a different range to ensure they have a consistent order. There are various methods for doing this, such as scaling to arrange, which involves bringing all numerical values into a range, for example, from 0 to 1.
If we have properties like height and annual income, both are numerical, but height may range up to 200 centimeters (humans considered), while income can vary widely depending on the currency, ranging from hundreds to millions, depending on exchange rates in your country or planet. Although both are numerical, the orders of magnitude differ.
So in this case, 0 may represent the smallest value of height, and 1 — the largest possible value.
Obviously, there is no and can’t be another opinion about the necessity of preprocessing data for both training and real-world use of machine learning to obtain accurate predictions.
If we don’t catch certain patterns, overlook some details, or let our machine learning model focus too much on the wrong parameters, we might miss the existence of extraterrestrial civilizations or, even worse, lose out on valuable opportunities and profits from the raw data we have.
At the same time, there’s no reason to fret about the time-consuming and intricate nature of these processes. Thankfully, there are services like Amazon Sagemaker that streamline all the steps into a cohesive tool, making the entire experience much smoother.
In the interest of time, we won’t go into more detail, however, our team of data engineers have a wealth of stories derived from their hands-on development experience and are eager to share their insights. If you want to hear more and explore the potential ways ML solutions can elevate your business, Oxagile’s here for that.