Extraordinary intelligence, inexplicable mistakes, and the accuracy AI applications rarely disclose
For quite some time, I have been using different Artificial Intelligence systems and AI agents extensively in my daily professional work.
I am not referring to occasional questions submitted to a chatbot or to brief experiments designed to test what the technology can do. I mean real, systematic use across research, complex analysis, solution design, document processing, writing, software development, technical support, and the evaluation of business decisions.
Through this daily interaction, I have seen Artificial Intelligence perform tasks that, only a few years ago, would have seemed almost unimaginable.
I have seen an agent combine information from different sources, identify subtle contradictions, suggest meaningful improvements to a system architecture, and analyse a problem in a way that appears to demonstrate an exceptionally high level of reasoning.
Yet I have also seen the same agent, only a few minutes later, ignore a clear instruction, repeat a mistake that had just been corrected, or reach a confident conclusion that was not supported by the available evidence.
This contradiction may be one of the defining characteristics of the current era of AI agents.
One side resembles Dr Jekyll: capable, creative, analytical, and sometimes genuinely impressive.
The other resembles Mr Hyde: unpredictable, inconsistent, and occasionally so irrational that it is difficult to believe the behaviour came from the same system.
These are not two different technologies.
They are two faces of the same technology.
The more I use them, the more impressed I become — and the more careful I am
Extensive use of AI agents has led me neither to unquestioning optimism nor to complete rejection.
It has led me to a more complicated conclusion.
The more I use them, the more impressed I become by what they can do. At the same time, the less willing I am to assume that they will do it correctly, consistently, and without human supervision.
This is difficult to understand through a short demonstration.
In a controlled demo, the system is usually given a well-structured prompt, carefully selected data, and a task designed to produce an impressive result. The audience sees the product at its best.
Daily work is different.
Real data is often incomplete. Documents come in different formats. Requirements change. Instructions can be complex. Exceptions, contradictions, and missing information are common. An agent must preserve context, identify what matters, use its tools correctly, and recognise when it does not know something.
That is when its other side begins to appear.
It may analyse an extremely complex technical problem correctly and then miss an obvious detail. It may follow ten instructions and violate the eleventh, even though that one was the most important. It may identify a difficult logical flaw while failing to notice that it is using the wrong date, the wrong file, or data from a different case.
It may even recognise that it lacks sufficient information and, instead of stopping, fill the gaps with something that merely sounds plausible.
The result often does not appear careless or uncertain. On the contrary, it may be polished, complete, and highly convincing.
That is precisely what makes it dangerous.
The problem is not that agents are unintelligent
The easy criticism would be to say that Artificial Intelligence is not genuinely intelligent.
I do not believe that statement accurately describes what we are seeing.
Modern AI agents possess real and meaningful capabilities. They can accelerate processes, support complex analysis, propose alternative solutions, and act as highly productive collaborators.
The fundamental problem is not the absence of capability.
It is the absence of consistency.
Agents are not simply more or less intelligent. The quality of their behaviour can vary dramatically. Their performance depends on context, task formulation, data quality, memory management, available tools, and the way the entire process around them has been designed.
With humans, we generally expect knowledge and experience to have some continuity.
An experienced engineer may make a mistake. We do not normally expect that engineer to suddenly lose their understanding of a basic professional principle and then return, moments later, to an exceptionally high level of performance.
With AI agents, this discontinuity is common.
Exceptional performance on one task offers no guarantee of competence on the next.
This creates a strange paradox: their impressive capabilities encourage us to trust them, while their inconsistency requires exactly the opposite — continuous verification.
From answers to actions
The issue becomes much more serious when we move from conversational systems to actual agents.
A chatbot answers a question. It may provide incorrect information, and the user can reject or correct it.
An agent, however, is not necessarily limited to producing text.
It may search for information, read files, call APIs, modify code, update data, send messages, or execute an entire sequence of actions.
In other words, it can turn a mistaken judgment into a real action.
Mr Hyde is no longer limited to producing a poor answer. He may gain access to our processes, our data, and our systems.
This is why the evaluation of an agent cannot be reduced to the question:
“How intelligent is it?”
The more important question is:
“How reliably does it behave when it encounters something it does not understand, when the data is ambiguous, or when its initial assumption is wrong?”
The business value of a system is not determined only by the best answer it can produce. It is also determined by the worst behaviour it can display without being detected in time.
A system that produces impressive results in most cases may be extremely useful as an assistant. That does not automatically make it safe as an autonomous operator, especially when a small failure rate can lead to financial damage, data loss, or incorrect decisions.
The promise made by the market
At the same time, more and more companies are incorporating AI into their applications and services.
Artificial Intelligence has become a central marketing proposition.
Applications “understand” our documents, “analyse” our data, “automate” our work, “respond” to our customers, and “make decisions” on our behalf.
The problem is not necessarily that these promises are false.
Many of these systems can indeed perform the tasks being advertised.
The critical question is how often they perform them correctly.
That is where customer communication usually becomes much less precise.
Companies describe the capabilities of their products in detail, yet rarely explain with the same clarity how often the system fails, under what conditions it has been evaluated, what types of errors it makes, and how serious the consequences of those errors may be.
They rarely explain when human review is mandatory, how low-confidence results should be treated, or who is responsible when the application produces a wrong decision or performs an incorrect action.
These are not minor technical details.
They are part of the product’s actual operational value.
What does “95% accuracy” really mean?
Even when a company publishes an accuracy rate, the number alone is not enough.
“95% accuracy” sounds impressive. But what does it actually mean?
It may mean that the system correctly identified 95% of specific fields in a controlled document dataset.
It does not necessarily mean that it completed 95% of real-world tasks correctly.
A document may contain one hundred fields. The system may extract ninety-five of them correctly and fail on five. But if one of those five is the amount, the date, the bank account number, or a critical contractual term, the overall result may be useless or even dangerous.
What matters, therefore, is not only how many mistakes a system makes.
It also matters which mistakes it makes.
Not all errors carry the same weight.
A mistake in a creative suggestion may be corrected easily. A mistake in a financial transaction, a legal document, medical information, or a deletion command may cause serious harm.
Accuracy is not a single, universal number. It depends on the task, the data, the context, and the cost of failure.
That is why a general percentage, presented without explanation, may create more confusion than clarity.
From impressive demos to measurable reliability
The AI market is still in a period in which an impressive demonstration often carries more value than systematic evaluation.
A successful demo may go viral. A detailed failure report is unlikely to attract the same attention.
However, as agents gain greater autonomy and become integrated into critical processes, this cannot continue.
Companies should not communicate only what a system can do. They should also communicate its limitations.
A serious AI application should be accompanied by a clear reliability statement. Not merely a vague disclaimer that “AI can make mistakes,” but specific information.
In which cases has the system been tested?
What types of data were used?
How was success measured?
Which error categories have been identified?
When does the system stop and request human intervention?
Can its output be verified?
Are its actions logged?
Can an incorrect action be reversed?
These are the real questions of the AI agent era.
We do not need only better models. We need better systems around those models: controls, restrictions, confirmation steps, rollback mechanisms, decision logs, and clearly defined limits of autonomy.
They do not need to be perfect. We need to be honest
This criticism is not an argument against Artificial Intelligence.
Quite the opposite.
Precisely because I have used AI agents extensively and have seen their genuine value, I believe we must also treat their weaknesses seriously.
Agents can meaningfully change the way we work. They can increase productivity, free up time, and allow smaller teams to manage far more complex tasks.
They do not need to be infallible to be useful.
But we do need to know when they are reliable, when they require review, and when they should not be allowed to act on their own.
Today, many businesses purchase AI applications largely on the basis of capability promises.
In the future, they will need to evaluate them on the basis of measurable reliability.
Not only according to what they achieved in the best-case scenario, but also according to how often they fail, how they fail, and what mechanisms they include to limit the consequences of failure.
The two faces will continue to coexist
The Dr Jekyll and Mr Hyde of AI agents will not disappear any time soon.
We will continue to see systems that produce exceptional results one moment and make seemingly inexplicable mistakes the next.
The goal is not to deny either side.
Nor is it to choose between excitement and fear.
The goal is to design their use while recognising that both sides exist.
We should take advantage of their remarkable capabilities without turning capability into blind trust.
We should give them tools, but not unlimited authority.
We should measure their performance under real-world conditions, not only through carefully prepared demonstrations.
Above all, we should demand that companies show the same transparency about failures that they show when presenting successes.
This is something I also try to apply myself, both personally and through the companies in which I am involved.
It is not always easy.
But I believe this honesty is essential. I am not asking companies to stop talking about what Artificial Intelligence can do. I am asking them — and I am trying to follow the same principle in my own professional practice — to speak just as clearly about how often, under what conditions, and with what consequences it may fail to do it.
Because intelligence is impressive.
Consistency, however, is what creates trust.


