By Senthil Kumar
AI models can benefit as much from soft data such as personal anecdotes as much as hard data.
It’s well known among data science circles that the more diverse your set of training data, the more accurate your model will be. This includes structured, unstructured, and semistructured data. However, not all data is treated equally, especially when it comes to unstructured data. Soft data such as collective memory and personal anecdotes can be challenging to access, but they can help build better decision-making systems.
A well-designed decision-making system may leverage multiple models, also known as a model ensemble, to look at optimizing efficiencies (for example, at Slate, we use construction schedule optimization, cost optimization, resource utilization, and the like). Instead of a single AI model that analyzes various features from the data sets and formulates recommendations, we can set forth multiple smart learning agents, each of which looks at attributes of data and patterns to come up with a model score or an outcome. One model may be incentivized to look at schedule optimization factoring in supply chain constraints (and formulate recommendations) while another may be incentivized to look at contractor performance.
We can think of each AI model (or smart agent) as being an “expert” in its chosen domain. A supervisory algorithm (a supervisory model) then takes the best outcomes from each of these experts and formulates a recommendation based on what the human intent/desired objective is.
As the supervisory model looks at the efficiency outcomes, it also feeds back to appropriate models, what the gating parameters ought to be and what calibrations should be made to feature weight and relevance. This is based on a combinatorial analysis of outputs from this and other models, and what the collective intent is. This helps the learning process of the models draw on their own experience, as well as the experience of others in the model collection.
This mode of model training can be termed as “collective co-evolution.” There is also a notion of “competitive co-evolution” (with each AI smart agent competing with the others to meet the same objective). Together, I term these as a “collective memory” and “collective model intelligence,” because the models do share their experiences with other models and together evolve and formulate appropriate decision recommendations. This collective model intellect helps improve decision-making.
Anecdotal evidence plays just as strong a role, especially in cases where the source data involves human judgement. Take this example from the construction industry. The industry is poised for a significant transformation thanks to various technologies, but there is still a lot of intellectual property that resides within human interactions, anecdotal notes in documents, and other artifacts about construction methodologies and processes.
One source of model training is the “Lessons Learned” notes typically logged after major construction projects are complete (or when they reach major milestones). This source records what worked well and what did not in a construction project. Much of this can be qualitative data, which an efficient AI system can still learn from. Anecdotal evidence is also made part of the training data set because much of human intellect in construction is considered unstructured. Human supervision of model outcomes during the training phase helps orient models towards meaningful insights
Decoding Dark Data
With the advent of technologies and sophisticated machinery, we have produced more data in the last two years than all of humanity since its beginning. Organizations are good at generating and collecting data in varied forms. The data may be unstructured, remain inaccessible or forgotten, or may be unused to source relevant insights. Thus, darkness envelopes this data. A potential gold mine of insights remains dark due to forgotten or unused data.
Smart AI model training will harness and leverage data from a multitude of sources, both structured and unstructured, and tap into resident dark data. Some of the technologies and techniques used for insight mining and model training that use dark data involve:
- NoSQL databases for processing large volumes of unstructured data
- Data mesh to unify, normalize, and federate data envelopes
- Graph databases to identify relationships among data
- Techniques such as natural language processing for classification and correlation of data and organizing data intent, cognitive computing for intent deciphering
- Utilizing GPUs (graphical processing units) and TPUs (tensor processing units) to perform advanced and parallel processing
We must be cognizant of the fact that models with self-learning capabilities will learn from both good and bad data. If there is no governing process to validate the model learning, this can introduce unintentional biases and negative connotations. Simulations with multiple scenario analysis and human-in-the-loop validation (during the training process) will help introduce discipline in how the machines are learning and interpreting this dark data.
Overcoming Model Bias
There can be biases in machine learning, some of which are intentional biases caused by human interactors and some as the effects of incorrect training and validation of the model. Several issues that lead to biases may also lie with the data set fed to the machine learning models. When we lean towards models such as unsupervised learning and pass a corpus of data that may be intentionally or unintentionally biased, the learning system may carry forward these biases.
Biases can be addressed by a combination of algorithmic choices, supervision, ensuring fairness in the training data set, diversification of training sets, diversification of labeling (where human efforts at labeling are needed), conscious injection of explainability (building the model so its output includes reasons why certain decisions were made by the model), and using techniques such as counterfactual fairness, among others.
If the model learning and execution design takes account of AI decision-making explainability, it can be used to validate and look for biased learning. Scenario injection through training data sets and observing the model outputs for biases is a beneficial technique as well.
Counterfactual reasoning and counterfactual fairness explore outcomes that did not actually occur but which could have occurred under different conditions. This can be a very useful technique in looking for biased outputs during model training. This technique ensures that a model’s outputs are the same in a counterfactual universe if certain bias-inducing attributes such as race, gender, or others were changed.
About the Author
Senthil Kumar is the CTO and head of AI at Slate Technologies where he heads a global technology organization focused on delivering the most modern software and technology approaches to leverage data across building production. You can reach the author via email or LinkedIn.
Source: TDWI