“How much data do I need to power my AI model?”
It seems like every day we hear more about the promise and potential of Artificial Intelligence and for good reason. AI is driving more technology by the moment, and the trend is certain to continue.
At the same time, most of the amazing stories we hear about AI come from applications where the datasets are huge. In these applications, the model (which parses through the data to “learn”) uses data to predict the results. The question then becomes: how large of a dataset do you need to produce accurate results? Not surprisingly, the answer is “It depends.”
Key Factors That Determine How Much Data AI Needs
Though all artificial intelligence involves both a model and training data, the models vary significantly, depending on their intended use. Here are how data scientists broadly consider the three factors that govern how much data a model requires:
Complexity of the Model
The more parameters within the model, the more data sets required to train the model. For example, an AI model that is tasked with identifying the manufacturer of a car from a given image would have a given set of parameters to do so, such as the badge on the hood, or the shape of the vehicle. With these relatively finite parameters, the model could make a prediction. Yet if the purpose of the model was to also suggest the average selling price of the given vehicle in addition to the manufacturer, the model would necessarily become more complex. Depending on how robust the desired result would be could require datasets of age and condition, but perhaps even supply and demand as well as regional differences due to local economies. Thus, the more complex model, the more data that would be required.
Diversity of Input
The AI model needs to understand a variety of inputs. In the case of a chatbot, for example, models might need to account for different languages. Moreover, they might need to provide different levels of readability, depending on the education level of the user. Further refinements might consider informal, formal, and even pop culture references. Creating additional diversity has the potential to increase the effectiveness of the chatbot, but requires more data to do so accurately.
Error Tolerance
The objective of the model also impacts data quantity. For example, a 10% error rate may be acceptable for a model that predicts weather patterns, but unacceptable when detecting patients who are at risk of developing lung cancer. In other words, the more risk-averse your algorithm is, the more data you’ll need for seamless results.
The Type of Model Dictates How Much Data You Need for AI
The type of model being used also has a corresponding level of data required to generate accurate responses. Here are common examples of different models and recommended data sets.
Regression Analysis
Regression analysis is commonly used to predict a result (the dependent variable) based on the factors that believed to impact those results (the independent variables). For regression analysis, data scientists first consider the number of independent variables, relative to the number of observable results within the dataset. Estimations run as low as five observations per variable to as high as twenty or more observations per independent variable.
As an applicable example, Facebook relies on many independent variables to determine which content users are most likely to engage with. Research indicates that Facebook “knows” roughly 200 different factors (the independent variables, in this case) about their users, though not all come into play at any given time (For instance, one’s interest in sports teams may not correlate at all to one’s proclivity to like a cat video, though if the data were to show a correlation, it might.) The given variables would be juxtaposed with the number of “observations” (how many people share these same independent variables) to predict a likelihood of engagement.
Time Series Analysis
In time series models, data scientists attempt to predict future results, based on events within the past. Here again, recommendations vary. The simplest way to think about time-series data is that you must have equally as much data from the past to predict the same distance into the future. Said differently, revenue projections for a year from now for a company that is less than a year old will be far less accurate than those for a company with multiple years of data.
However, if one is only attempting to predict average sales for a given day of the week (and the business is not subject to seasonality), even a few week’s worth of data could be useful enough to begin to train the model.
Image Classification
Models relying upon image classification often rely upon datasets of thousands of images, though as described above, much of this has to do with the required error tolerance, as many of the applications for image classification require a high degree of accuracy.
Likewise, models relying on text analysis are also quite data-dependent and resource-intensive.
The combination of the type of model and the factors required within the model aptly suggest that current artificial intelligence often requires massive amounts of data to be useful. Fortunately, this may not be the case in the future.
The Future of Artificial Intelligence: Mimicking Human Reasoning
While accurate results via machine learning require more data, pressures related to privacy are increasingly throttling both the use and access to this data. There’s no greater example of this than changes within the digital advertising space, where companies like Facebook and Google are now feeling massive pressure to both collect and use less information about us.
Concerns about the privacy of data coupled with the inherent weaknesses of models without sufficient or reliable data are giving way to new sorts of artificial intelligence that researchers explain will more accurately reflect how true human intelligence works.
These so-called top-down approaches claim to be faster, more flexible, and require fewer data points than their bottom-up counterparts. Moreover, early tests from this new age of artificial intelligence have at times been astonishing, with far more accurate results being generated with far less data.
While artificial intelligence that is more conducive to privacy and less reliant on massive datasets will benefit everyone, the most dramatic impacts might be found in applications where big data simply hasn’t existed, previously. For every enterprise organization with tens of thousands of users, there are hundreds of small to medium-sized companies that have been unable to take advantage of AI, simply because existing models are generally ineffective with small data.
Ironically, then, the biggest thing in artificial intelligence over the next couple of years could likely be the rise of small data.