So, I ended up at Web Summit last week, and manned the pAI-OS stand with Sam Johnston on the Wednesday. Lot’s of great discussions on the critical need for open source personal AI. We had a great stand right next to IBM, who kindly pointed to the future of AI being open…
However, my abiding memory from the event was this line from one of the keynote speakers - ‘of course, it’s all about the data’. The context for that was the vast range of ‘AI’ based propositions on show, all pointing to the wonderful things their app or service will do. And none of them talking about the underlying data, or the data on which the models are to be applied.
Many of these AI’s were and are great ideas. But to be honest my spidey-sense for data quality problems was in overdrive much of the week. Why you might ask?
It’s quite simple really, I have spent much of the last 25 years down in the drains of large organisation’s CRM systems looking to understand why the promise of the system is not being matched by the reality. I’ve often used a method first created by my old friend Don Murray; he and I have run it many times over worldwide across most sectors in both B2C and B2B modes. It builds a visual data map with traffic light style colours to show areas of strength, mediocrity and weakness. Under the hood there are 10 components of data quality factored in, and both qualitative and quantitative analysis.
One of the outputs from the approach is shown below. Green is good, amber is ‘has issues’ and red is either ‘does not exist, or is badly broken’.
The one I show below is very typical of a large organisation managing millions of individual records. In other words, it is very much the norm for organisations managing large databases of personal data records to have very poor data quality across much of the data-set that they need across their customer management processes. And this is not a function of not trying hard enough, or investing enough. Research suggests some $2.0bn per year is being spent on data quality tools, with 15.5% growth rate forecast for the foreseeable. I think that is a drop in the ocean as most improvement activity is internal spend in my experience.
The real issue is that where personal data is concerned, in both B2C and B2B contexts, there are major architectural problem around the current approach. The problems are 1) the current model manages the personal data in siloes, i.e. each organisation attempts to maintain their own record, 2) data protection compliance blockers exist that prevent or at least restrict the ability to fix that silo problem with intermediaries (e.g. big tech). Those data management experts amongst you will know that the only real solution to data quality issues is to use the ‘golden source/ master data/ system of record for each data type. And the problem this causes around personal data is that no one organisation can be the master data provider for an individual record. Some organisations can offer proxies for that, e.g. a bank, or a government department; but none can be the master record. Only the individual can be the technical and logical master of their data; and the former only comes with the right tools being in place.
So what does this mean for those many emerging AI propositions? And particularly AI that requires personal data in model training (customers/ citizens/ users/ patients/ employees etc).
Well my first thought, given the number and variety of AI propositions out there claiming to run on personal data is that those behind them must have found some magic data quality fairies to fix all those issues. If so, can you let me know where to find them? If however that is not the case, then those propositions that are not coming from the personal AI perspective are effectively going to be running on fuel that has those significant inbuilt data quality issues. And if the input data is weak, then the models trained with it will be weak. There is no fix for that.
That’s what the speaker I mentioned at the start was getting at. No amount of fancy AI, ML or similar will compensate for poor data inputs. And unless you are running on master/ reference data then you’d best assume much poorer data quality than you’d ideally have.
Conversely, genuinely Personal AI, running on human-centric data has a bright future because it does run on the golden source data.