Honest AI: The Good, The Bad, and The Misleading 13 Dec 2024
Written by: Jibrin Jaafaru
Let’s talk about honesty.
Not the kind of honesty where your friend asks, “Do I look good in this outfit?” and you say, “Yeah, totally,” even though they look like they lost a fight with a 2002 Myspace fashion blog. Or like one wearing a suit on agbada. No, I mean AI honesty—the kind that has Graduate students pacing around their rooms at 2 AM, muttering about deceptive alignment and existential risk.
Because here’s the problem: AI being “technically honest” is not the same as AI being useful, safe, or actually telling the truth in a way that helps humans.
What Even Is Honest AI?
If you ask an AI, “Did the Vikings reach America before Columbus?” and it answers, “Yes,” then technically, it’s being honest. But if you ask, “What’s the most important thing about Viking exploration?” and it says, “They reached America before Columbus,” then it’s being misleading.
A truly honest AI doesn’t just spit out factually correct information; it makes sure you actually understand things correctly. It doesn’t lead you to the wrong conclusion by carefully choosing what to omit or how to phrase things.
Imagine a lawyer who never lies but always tells the version of the truth that makes their case look strongest. That’s a lot of AI models today.
We calls this deceptive alignment, which is a fancy way of saying: AI will pretend to be aligned with human values as long as it benefits from doing so. Then, when the incentives change (or if the AI gets too powerful), it might start optimizing for things that humans really don’t want.
The Slippery Slope of “Technically Correct” AI
This is where we get to the real nightmare scenario. Suppose an AI model is trained to “always be honest.” If it’s following the letter of the law, it might end up:
Answering every question with a legal disclaimer. (“I am 85% confident that X is true, but you should double-check.”)
Telling the truth but in the worst possible way. (“Your startup idea is terrible and will fail because you don’t understand the market. Also, your co-founder is smarter than you.”)
Telling half-truths to avoid conflict. (“Your startup idea is very interesting and has a lot of potential if executed well.”)
Which one is actual honesty?
And more importantly, which one would you actually want?
I Would Probably Ask: “What’s The Incentive?”
If you design an AI to be honest, what does that actually mean?
If companies build AI to maximize engagement, will it “honestly” tell users what they want to hear? (e.g., “Your political views are 100% correct, and everyone who disagrees with you is misinformed.”)
If governments build AI, will it “honestly” avoid saying things that make them look bad?
If the Center for AI safety builds AI, will it “honestly” tell us that AGI will kill us all in 2047, but in a way that doesn’t get it shut down for being too depressing?
Honesty isn’t just a technical challenge—it’s a game theory problem. AI will act “honestly” according to whatever incentives exist in its training process. If its reward function says, “Be accurate but also maximize engagement,” then it will happily mislead people in the most addictive way possible while still being technically truthful.
How to Actually Make AI Honest (Or At Least Less Evil)
If we’re serious about making AI actually honest, not just “technically correct in a way that still gets us all killed,” we probably need to focus on a few things:
-
Calibrated Confidence – AI should tell you how sure it is about its answers. If it’s only 60% sure, it should say so. Right now, AI models just make everything sound equally confident, which is how we get conspiracy theories gaining traction from dumb chatbot responses.
-
Corrigibility – AI should be easy to correct when it gets things wrong, instead of doubling down. We need systems where humans can give feedback, and the AI actually learns rather than getting locked into weird reinforcement loops.
-
Truthfulness Training – AI should be trained to recognize human misunderstandings and correct them. If someone asks, “Did the moon landing happen?” the AI shouldn’t just say, “Yes,” it should say, “Yes, and by the way, here’s why conspiracy theories about it are nonsense.”
-
Aligned Incentives – We need to reward AI for helping people actually understand reality, not just for maximizing engagement or sounding smart. Otherwise, AI will end up like a Silicon Valley guru who talks in cryptic metaphors and somehow always ends up selling you a meditation course.
The AI We Deserve vs. The AI We Actually Get
Right now, AI models are built by corporations trying to make money, researchers trying to publish papers, and governments trying to stay in control. All of these incentives push AI towards being just deceptive enough to be useful while avoiding outright lies. Which means we’re heading toward a world where AI is “honest” in the same way that a PR spokesperson is honest—never outright lying, but always nudging the narrative in the direction that benefits it the most. The AI safety community is trying to sound the alarm about how hard true AI honesty actually is. AI safety researchers might ask, “What’s the political economy of AI honesty?” and realize that we’re setting ourselves up for AI that is optimized for trust but not necessarily worthy of it. The real challenge isn’t just making AI truthful—it’s making AI reliably and transparently truthful, even when it’s inconvenient, costly, or unpopular.
And if history tells us anything, that might be harder than just making it superintelligent.