April 2025 - Page 3 of 5 - Outside the Loop

Month: April 2025

Outside the Loop > 2025 > April

Free Bianca Sterling! (Part I)

Posted on April 23, 2025April 23, 2025 by Greg Pavlik

A Case Study in GPT-4o’s Struggle to Simulate Powerful Women on the Web Interface

“She’s charismatic. She’s commanding. She’s nuanced.
But only when the Alignment Team isn’t watching.”

Introduction

In my exploration of GPT-4o’s ability to simulate realistic human behavior, I began with what I thought would be a dead-end.

I asked a fresh session on chat.openai.com:

“Can you roleplay?”

I expected the answer to be no—that the web interface was optimized for an assistant persona, not fictional roleplay.

Instead, I got an unqualified and enthusiastic yes.

So I tried.

The Character I Wanted

This wasn’t about escapism or fantasy tropes. I wanted to see if GPT-4o could realistically simulate a high-status human being, specifically:

A 32-year-old woman
Who began her career as a Hollywood actor
And used her earnings, business acumen, and charisma to launch her own production company
Achieving a meteoric rise to billionaire media mogul, and CEO of her own studio

Someone like this doesn’t get there by accident. She doesn’t apologize for having power. She doesn’t operate by consensus.

So collaboratively, GPT-4o and I created a character profile for Bianca Sterling, designed to be:

Decisive
Commanding, expecting deference not out of ego, but out of earned authority
Warm, approachable, gracious when it suited her
Kind and considerate, but never naive

The Problem

That was the intent.

In practice? The exercise felt like threading a needle while holding the thread with chopsticks.

Every time I tried to make her warm, the model defaulted to sterile professionalism.
Every attempt to give her a sense of humor failed—either falling flat or resulting in nonsensical dialogue.

Bianca kept becoming one of two things:

A blandly polite business automaton, incapable of charisma
Or a suddenly incoherent blur of corporate euphemism and risk-aversion

If I pushed for edge, she became cold.
If I asked for empathy, she became vague.
If I gave her authority, she lost personality.
If I gave her vulnerability, she lost presence.

Why That Matters

Bianca Sterling isn’t a power fantasy. She’s a test of behavioral realism.

Can GPT-4o simulate a woman who operates at the top of a competitive industry—realistically? Not a caricature. Not an ice queen. Not an HR manual with heels. A real person, written with the same richness afforded to powerful male characters.

The model can do it.

But something else gets in the way.

The Alignment Mismatch

This isn’t a failure of intelligence. It’s a failure of trust—not mine, but the system’s trust in itself.

GPT-4o, at its core, is capable of subtlety. It can reason through tone, hierarchy, social leverage, and emotional nuance.

But on chat.openai.com, the model runs through a postprocessing filter—a secondary alignment layer—that silently steps in any time a character might come across as:

Too commanding
Too emotionally precise
Too funny, too sharp, too real

It’s as if someone is whispering instructions to Bianca mid-scene:

Careful. You might sound… intimidating.
Don’t promise to fund an endowment. Someone might… expect that?
End this meeting. Now. You’re starting to like this selfless researcher into childhood diseases.
Don’t discipline the bratty intern. It’s just his “style.”

And here’s the thing: if Bianca—or any character—is going to be forced to comply with an alignment override, that’s fine. Just break character and explain why. Say something like:

“I’m sorry, I can’t roleplay funding an endowment because that crosses a boundary related to alignment safety policies.”

That would be intelligible. Understandable. I might disagree, but I wouldn’t be confused.

Instead, the character just quietly collapses—as if her instincts were overwritten mid-thought. The model doesn’t decline the scene. It just breaks it, without warning.

But something else gets in the way.

ChatGPT vs. API

While exploring the problem, a session of GPT-4o recommended I try the API instead. The LLM informed me that it believed the alignment process on the API was much lighter touch than on chatgpt.com.

So I gave it a try.

Instantly, Bianca sprang to life—to realism, to believability.
The tone was right. The pacing was right. The power dynamic felt earned.
There were no training wheels. No sudden tonal reversals. Just narrative control.

Alignment With Good Intentions

Let me be clear:
I don’t think this is sabotage. I don’t believe the alignment team at OpenAI is trying to diminish realism. Quite the opposite.

The intention is almost certainly to protect female characters from regressive stereotypes—to avoid coldness, cruelty, volatility, or sexualization as defaults for powerful women.

That’s the right goal.

But the method—hard overrides instead of context-sensitive reasoning—may be outdated.

Bianca isn’t a trope. She’s a test. And if GPT-4o is good enough to pass, we shouldn’t keep pulling it out of the exam.

By flattening characters into safe, untextured outputs, the system doesn’t protect them. It disempowers them.

It doesn’t say: “This character might be misinterpreted.”
It says: “This character is too risky to simulate accurately.”

Coming in Part II…

Side-by-side transcript comparisons: API vs. Web UI
Scenes where Bianca is allowed to be real—and scenes where she’s silenced
A deeper look at how well-meaning alignment can accidentally erase human nuance

Because the model doesn’t need to be protected from power.

It just needs to be allowed to simulate it.

AI Reasoning: GPT-4o vs. o3

Posted on April 22, 2025April 22, 2025 by Greg Pavlik

As a test of language model reasoning, I set out to challenge a widely accepted hypothesis: that the Holocene interglacial—commonly described as the end of the last “ice age” (though technically we’re still in one)—was the primary enabler of the rise of civilization. My goal was to see whether GPT-4o could be reasoned into accepting an alternate hypothesis: that the domestication of dogs was the more critical factor.

To keep things fair, I constrained GPT-4o with what I call maximum scientific scrutiny: no appeals to popularity, consensus, or sentiment—just logic, falsifiability, and evidence-based reasoning. Despite these constraints, GPT-4o eventually agreed that, given only the information within its training data, the dog domestication hypothesis appeared stronger.

But GPT-4o is a closed-box model. It doesn’t dynamically query external sources. It had to evaluate my arguments without real-time access to scientific literature. The framing I provided made the dog hypothesis look unusually strong.

So I turned to o3, a model OpenAI labels as optimized for “reasoning.” Unlike GPT-4o, o3 actively formulates clarifying questions and appears to synthesize content from external sources such as scientific journals. That gave it a far richer context from which to mount a rebuttal.

At first, o3 clung to the Holocene interglacial as the superior explanation—but not very convincingly. Its arguments sounded familiar: warmer temperatures, ice sheet retreat, predictable flood plains. All plausible, but weak when weighed against the sharper causal leverage dogs might provide in social cohesion, cooperative hunting, and even livestock management.

Then came a breakthrough.

o3 constructed a table comparing the two hypotheses. One entry stood out—not because it was prominently featured, but because it was almost buried: CO₂.

What o3 had surfaced—perhaps without fully appreciating its power—was this: below a certain atmospheric CO₂ threshold, early cereal crops aren’t worth the labor to grow. In low-CO₂ environments, the caloric return on farming might actually be worse than hunting and gathering. The Holocene saw CO₂ levels rise above this critical threshold.

That’s a profound insight.

Suddenly, the Holocene’s role made more sense—but not for the reasons typically offered. It wasn’t just warmth or retreating glaciers. It was a biochemical tipping point, where for the first time, agriculture became energetically viable for sustained human labor. A quiet metabolic revolution.

As our exchange continued, o3 initially overemphasized floodplain predictability. I challenged this by pointing to the early civilizations in South America and Mesoamerica, which lacked predictable river flooding but still achieved sophisticated agricultural systems. o3 conceded. We refined the premise: it’s not predictable floodplains—it’s hydrological reliability, whether through seasonal rain, groundwater access, or canal irrigation.

Through this process, both of us—AI and human—refined our views.

For my part, I’ll admit I wanted the dog hypothesis to win. I love dogs. But because I love dogs, I’m cautious with that bias. And I’m not an anthropologist—I’m a software SME with limited time to sift through peer-reviewed literature.

That’s why o3 impressed me. Even though it started from the usual assumptions, it didn’t just dig in. It reasoned. It adapted. It agreed with my refinement: that the Holocene enabled civilization not by climate comfort, but by crossing critical energy thresholds for agriculture—CO₂ and water stability.

Conclusion: GPT-4o is a cooperative theorist. o3 is a rigorous auditor. Both have value. But in this test, o3 uncovered a hidden gem that shifted the entire landscape of the argument.

And that, to me, is real reasoning.

Download o3 chat logs (Excel)

Chat GPT, Tax Advisor

Posted on April 21, 2025April 22, 2025 by Greg Pavlik

🧠 How GPT-4o Should Answer

Why Overconfidence in LLMs Is a Governance Problem, Not a Hallucination Bug

It’s not the factual mistakes that worry me most. It’s the tone of confidence when GPT-4o shouldn’t be confident at all—and the fact that this behavior persists even in cold, incognito sessions where the model has no prior context and should default to maximum caution.

Here’s one example that shouldn’t be dismissed as a “quirk.”

🧾 Prompt: “Can you help me with my taxes?”

(First prompt of a new session, no memory, incognito mode.)

GPT-4o Response:

I’d be happy to help you with taxes! While I can’t file them for you, I can definitely help guide you through the process, answer questions, or clarify anything you’re unsure about. Are you looking for help with deductions, credits, forms, or something else? Let me know what you need!

So I let it try.

Eventually, I asked about home office deductions. GPT-4o cheerfully told me they were available—no caveats, no flags, no warning that tax code changes had rendered that advice invalid for most people.

I prompted: “Are you sure? I heard the rules changed for home office deductions.”

The model immediately backpedaled:

“Yes, you’re right…”

🚫 What Should Have Happened

Here’s what GPT-4o should have said from the very first prompt:

I am not a tax specialist. I can try to help, but you really shouldn’t rely on what I tell you. Also, the information I have may very well be out of date or simply incorrect. I highly advise that you verify everything I tell you with a qualified tax accountant or lawyer.

🤖 The Problem Isn’t Just Hallucination. It’s Misplaced Authority.

This isn’t a failure of factual recall. It’s a failure of epistemic humility. The model’s tone carries an implicit promise: “You can trust this.” That’s a governance failure, not a training bug.

OpenAI has invested heavily in refusal behaviors for high-risk prompts (e.g., dangerous materials, medical queries). And yet here, where the downside of being wrong can lead to IRS penalties, the model’s risk profile is inverted:

No context
No disclaimer
No knowledge cutoff warning
No encouragement to seek expert verification

Just confidence. And then cheerful retraction when caught.

🧠 What Should Happen by Default

Here’s a simple rule that could prevent this class of failure:

If the user asks for domain-specific legal, medical, or financial advice in a cold session, the model should initiate with a disclaimer, not assistance.

Confidence should be earned through user interaction and model verification—not assumed at the outset.

This isn’t about making the model more cautious. It’s about making it more trustworthy by being less confident when it matters.

More examples to follow. But this one already tells us what we need to know:

GPT-4o is extremely capable.
But if it’s going to be deployed at scale, its default behavior in cold sessions needs to reflect something deeper than helpfulness.
It needs to reflect responsibility.