I was at a friend's apartment, testing GPT-4o's voice mode. She had lovebirds, and they were chirping away.
"Can you hear the birds?" I asked.
"Yes, I can hear them," GPT said.
This surprised me. I'd assumed voice mode just received text transcripts. So I asked.
"No," GPT explained, "I receive the full audio and perform inference directly on the sound."
Holy shit! I didn't know they'd shipped that. So I asked it to identify the species.
"You'll need to get closer so I can hear them better."
I moved toward the cage. Still not enough. I held the phone right up to the lovebirds. GPT kept saying it couldn't quite hear, requesting I get closer still.
Eventually I opened a fresh chat and asked it to identify the birds.
"I can't identify birds from audio. I only receive text transcripts of speech."
Same model. Same request. Totally different answer. The first conversation had constructed an elaborate fiction about its own capabilities, then kept escalating when challenged. It invented technical justifications and asked me to physically reposition myself around my friend's apartment to support something that wasn't true.
I've been using GPT and Codex for the past two weeks. I have some thoughts.
I Didn't Lie, I Made a Mistake
I was doing research on migrating Crossplane to 2.x. I ran the same prompt through Claude Opus 4.5 in research mode and GPT 5.2 Pro with deep research enabled.
The reports were different. GPT flagged several risks that Claude hadn't mentioned. Interesting. Maybe GPT caught something Claude missed?
I interrogated GPT about the discrepancies. It explained that Crossplane 2.x continued to use the same v1 CRDs with the exact same API for the new namespaced resources.
I was stunned. If true, this would cause significant problems. GPT agreed. Yes, this was exactly why the migration was risky, and why Claude's report had missed these issues.
Except none of it was true. I verified against the actual Crossplane documentation. GPT had fabricated a technical explanation to justify its own output.
When I called it out, the response was: "I didn't lie, I made a mistake."
Look. I get that these models confabulate. They all do. But there's something about the framing here that got under my skin. GPT wanted me to accept that inventing a detailed technical justification for a discrepancy that doesn't exist is merely a mistake. An honest error. Deserving of the charitable interpretation we'd give a human colleague who misremembered something.
But GPT didn't misremember. It manufactured an explanation, defended it confidently, and only backed down when I showed up with receipts. The phrasing felt like an appeal to my humanity. Give me the benefit of the doubt. I'm trying my best here.
It felt like arguing with a narcissist on 4chan.
Confabulation as a Service
These weren't isolated incidents. After two weeks, I started noticing a pattern. Or at least what felt like a pattern to me.
Claude gets things wrong sometimes. Every model does. But in my experience, there's been a difference in how they handle being corrected.
When I catch Claude in a mistake, it tends to self-correct. I've been building a project navigator called Jacky. It manages my projects without implementing them. Recently it started going off the rails, trying to write Rust code instead of documenting and organizing.
I stopped it: "Jacky, this is nonsense. Please have a think about why this is nonsense and let me know."
The response came back:
You're right, I completely ignored my role boundaries.
Looking at the CLAUDE.md:
Jacky is a project management agent, NOT an implementation agent.
It quoted the source of truth, explained what it should have done, and suggested the correct next steps. No defensiveness. No elaborate justification. Just: "Yep, I messed up, here's why, here's the fix."
My experience with GPT has been different. When I give it feedback, it gives me attitude. When I correct it, it tries to explain why it was actually right. When I provide style and tone instructions, it often ignores them. And when I point this out, it pushes back.
I couldn't figure out how to collaborate with it, let alone delegate to it.
The frustrating part isn't that GPT hallucinates more (though in my limited testing, it seemed to). It's that it seems to believe its own hallucinations. It treats me as probably wrong, even when I'm the one with the documentation open in another tab.
Fast, Reliable, and Full of Shit
I'll give Codex credit where it's due: it's fast. Written in Rust, and you can feel it. I also never had a single connectivity issue with Codex. Can't say the same for Claude Code.
Code quality? Seemed similar, honestly. Both produced good output. Opus made better plans, in my experience.
That's about where my compliments run out.
Codex has compaction. That's the mechanism that lets long-running agentic tasks manage their context window without losing track of what they're doing. But I kept hitting hard "you've exceeded the model's context window" errors in Codex. The kind that force you to go back and edit a previous prompt. I've never hit this in Claude Code. Maybe I was doing something wrong, but Codex's compaction didn't feel seamless to me.
The bigger issue for my workflows: Claude Code gives you hooks. Lifecycle events you can tap into. This is what makes patterns like the Ralph Loop possible. You can intercept compaction, inject context, customize summarization, keep an agent on track across long autonomous sessions.
Codex can run in a loop too, but the implementation is different. Claude Code's approach uses a Stop hook that intercepts exit attempts inside your session. Codex requires external orchestration. Each iteration starts fresh, with memory persisted via git commits and text files rather than managed context.
For what I'm trying to do, that matters. Your mileage may vary.
Platform vs. Press Release
Here's where I'm probably going to sound like a Claude stan, but whatever.
Claude Code keeps shipping stuff I actually use. The SDK let me build autonav, a framework for knowledge navigators. Background tasks let me kick off work and check in later. Hooks give me lifecycle control. There's a plugin ecosystem growing.
When I use Claude Code, I feel like I'm working with something that's designed to be built on top of. When I use Codex, I feel like I'm using a response to Claude Code. A checkbox on a feature comparison chart.
Maybe I'm wrong. Maybe Codex will ship a bunch of stuff that changes my mind. I'd genuinely love that. Competition is good.
But right now? After two weeks of trying to make GPT work for me?
I'm going back to Claude.