Big data describes the volumes of data that your company generates, every single day. Both structured and unstructured. Analysts at Gartner estimate that more than 80 percent of enterprise data is unstructured. Meaning, they can be text files from IT logs, emails from customer support, direct Twitter messages from customers, and employee complaints to your HR department. This type of diverse and scattered data sources is true of almost every enterprise.
A big data strategy, on the other hand, is a glorified term for how you’ll collect, store, document, manage, and make the data accessible to the rest of the company. When companies don’t have a good data strategy, they spend enormous amounts of time just getting their data into a usable form when needed.
A big data strategy involves planning around how you collect, store, document, manage, and make the data accessible to the rest of the company.
But, you may be wondering, what’s this “Big Data” got to do with AI?
Modern AI applications thrive on data. Depending on the problem, it can be your very own structured or unstructured data.
In fact, according to IBM’s CEO, Arvind Krishna, data-related challenges are the top reason IBM clients have halted or canceled AI projects. Forrester Research also reports that data quality is among the biggest AI project challenges. This goes to show how critical data or rather big data is for AI.
A Support Ticket Routing Example
Let’s take a machine learning model that automatically routes support tickets to the appropriate support agents. In order to build this model, you’d need a large volume of historical support tickets and the corresponding routing. Historical here means all the old, resolved tickets.
This historical routing data is then used to automatically learn patterns so that the machine learning model can make predictions on new incoming tickets.
If this data is not stored or is not accessible to your data scientists, then you’ll have to rely on some external data sources which may not be ideal. Because for AI applications, it’s not just any data, it’s also good data that’s needed.
Alternatively, you can continue your manual way of routing, until you’re able to perform some intentional data collection. Unfortunately, this will set you back six months to a year depending on the volume of incoming data.
This problem happens all the time!
“And so you run out of patience along the way, because you spend your first year just collecting and cleansing the data…And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.”
And, it happens because companies generally don’t have a data strategy let alone a good data strategy. Data is acquired and used on an ad-hoc basis and for very specific purposes.
A recent survey of C-level executives representing companies like Ford Motors and Johnson and Johnson showed that over 50 percent of the companies were NOT treating their data as a business asset at all. What’s more interesting is that the leaders admit that technology isn’t the problem. People and processes are.
The Side-Effects of Not Having a Data Strategy
When it comes to AI development, there are typically 3 problems that companies struggle with in terms of data :
- Lack of data — data needed to train a model is non-existent
- Incomplete data—only parts of the data is available or stored
- Limited or no access to data—data is stored away in a location that is inaccessible to the company at large. This can be due to security reasons or general infrastructure issues.
These problems tend to happen due to the lack of planning around how data will be collected, stored, cleaned, and made accessible to the rest of the company—i.e., your data strategy. These data problems may not completely halt your AI initiatives but can have a negative impact on your business. Let’s see why and how.
#1: Hampers exploratory analysis
Exploratory analysis of data can help determine what’s possible and what’s not with AI.
There are two ways companies start AI projects. First, is to simply explore the data and determine what’s possible with it. This approach works if it ties in with a big pain point in your company. Otherwise, you’ll be doing AI for the sake of AI—a topic which I’ll cover in a separate post.
The second way is to start with a pain point and then determine if AI is the right approach. At which point, your data scientists will have to determine if the company data can support the initiative.
Either way, you need to have access to data and access to the right data to determine feasibility. A broken or non-existent data infrastructure will cripple this.
You need to have access to data and access to the right data to determine feasibility of AI projects
Exploratory analysis will also help surface potential issues in your data, such as data imbalance issues, and sparsity issues before you start a well-formed project. For more context, the steps in the figure below are some of the tasks that data scientists perform during exploratory data analysis.
#2: Stale predictions & recommendations
When you don’t have a centralized datastore with fresh data, companies work around this by acquiring one-time data dumps. This is acceptable for development but can be harmful in practice.
That’s because the data used for development may not be reflective of the current reality.
For example, if you develop a product recommendation engine trained on customer data from 2018, to make recommendations in 2020 you may be in for a surprise. Customers may be shunning your recommendations as you’ve lost touch with their current taste.
Due to COVID-19, customers may be more cost-sensitive or may prefer products containing disinfecting properties. If you recommend only high-end products or natural products, then customers will assume that you’ve lost touch with their taste and start ignoring the recommendations. This is often referred to as “stale recommendations” or “stale predictions”.
Stale predictions refer to predictions “learned” from outdated historical data or data that does not reflect current reality
Having access to fresh data allows models to be retrained periodically to ensure the output of models remains of high quality. Models typically need to adapt because:
- Customer behavior can change over time (think pre- and post- COVID-19)
- Underlying data distributions can change over time (when you were a startup vs. now)
- Governmental rules and policies on data use may change
- Societal norms may change, requiring limited use of certain information (think
A data strategy works towards preventing staleness by ensuring your data, old or new can always be accessed and ready for retraining models.
#3: Low quality models
Low-quality models, in other words, models with low accuracy can make gross mistakes on prediction or recommendation tasks. For example, categorizing a support ticket as pertaining to a “login issue” when in fact it’s related to “fraudulent account access” can have disastrous consequences. This is especially true if the issue is time-sensitive and relates to the health and safety of people.
In 2013, IBM partnered with The University of Texas MD Anderson Cancer Center to develop a new “Oncology Expert Advisor” system, a clinical decision support technology powered by IBM Watson. Unfortunately, Watson was making incorrect and downright dangerous cancer treatment advice. Reports state that the problem happened because the AI was trained on a small number of hypothetical cancer patient data, rather than real patient data which resulted in inaccurate recommendations.
This is clearly a problem of data quality.
And, data quality issues can be introduced by a broken data infrastructure when:
- Data is not centralized
- You have access to only a subset of the data
- The volume of data is small
It prevents data scientists and models from getting an accurate, holistic view of things.
Using the support ticket example, if your machine learning model is trained on data from a single satellite office that deals primarily with “login issues”, it’s knowledge of all other types of support issues is limited. The end result—often a model that looks good on paper, but useless in practice.
Data warehousing and integration of your diverse data sources can minimize this by bringing completeness to your data. It also ensures that your data is more easily accessible throughout the company.
#4: Brings bias to life
A broken data setup can introduce bias in your AI applications.
Let’s take facial recognition for example.
With facial recognition, you can identify or verify the identity of an individual using their face. A report released by NIST revealed that top facial recognition algorithms suffer from bias along several lines including race, gender, and age. For example, some of the facial recognition systems misidentify Asian- and African-Americans far more often than Caucasians.
The usual cause of such bias—the underlying data! It most probably lacked representation.
An MIT study found that a popular dataset used to train facial recognition systems was estimated to be ~78 percent male and ~84 percent white. Very little representation of females and other races. Which explains why many facial recognition systems have an ingrained bias in them.
When data scientists have access to limited or incomplete data, which is not a reflection of reality, it becomes difficult to ensure sufficient representation. This results in the data source itself becoming biased or skewed in a technical sense. And this effect is perpetuated through your machine learning models.
Facial recognition algorithms are no different. The algorithms “learn” to identify a face after being shown millions of pictures of human faces. However, if the faces used to train the algorithm are predominantly white men, the system will have a harder time recognizing anyone who doesn’t fit.
This is dangerous!
By ensuring that ALL your data is centralized and tightly integrated, you can ensure that your data is more complete and more representative of your customers, employees, products and services. While this does not completely eliminate bias, this minimizes the possibility of it happening.
Point of caution: If you happen to serve a niche audience intentionally or unintentionally, your data may be inherently skewed. Depending on the application, you may need additional strategies to eliminate potential bias.
#5: Causes significant delays in AI initiatives
Finally, the fact that you don’t have data to work with, or don’t have access to the data can be a permanent setback to companies looking into adoption of AI.
Every project you start may require jumping over hoops to get data just to assess feasibility of the project. As you saw in the case of IBM, projects were canceled or stalled partly due to the lack of data. The problem gets worse when you’ve already hired talented data scientists, only to realize that they’re unable to start projects or drive planned projects forward because of data issues.
If you’re looking to become more efficient and competitive in your industry, AI adoption is important. But, a data strategy is even more critical as it’s not just the foundation for AI, it’s also the foundation for all analytics and reporting capability in your organization.
Where do you think your data strategy is headed?
Not having a big data strategy can become costly in the long run. If you’re not treating your data as a business asset, you’re missing out on the opportunity to make good data-driven decisions and introducing automation with AI.
Projects may be indefinitely delayed, you may be making lousy, out-of-touch predictions or you may be inadvertently introducing bias in your algorithms. All of this can have a negative impact on your customers and your business at large.
If you don’t have a good data infrastructure in your company, the best place to start is to determine the gaps. This needs to be done in collaboration with a data warehousing or a data engineering team.
Some starting questions to answer as you’re making plans to improve your data collection and management capabilities may include:
- What types of data are we currently collecting?
- Is that granularity of data collection sufficient?
- Is the data scattered across various locations or somehow centralized?
- Are these known data sources accessible across the company? If not, why?
- What’s the most cost-effective way to make what we already have more centralized?
- Are we just storing the raw data or are we making it more usable?
A point worth making is that, without a data strategy, you can still embark on AI initiatives. However, it’ll be one-off projects, and you may end up with some of the problems outlined above. You can always start AI initiatives while also investing in your data strategy.