Title: You’ve Hit the Plus Plan Limit for GPT-4.5

Outside the Loop > Artificial Intelligence > LLM Comparisons > Title: You’ve Hit the Plus Plan Limit for GPT-4.5

The Research Preview of GPT-4.5

When GPT-4.5 (informally referred to as GPT-4.5 or “o4”) was rolled out as part of a limited research preview, OpenAI offered a tantalizing upgrade to the Plus Plan: access to its most advanced model. As part of a research preview, access is intentionally limited — a chance to try it out, not run it full-time.


What I Did with My Quota

Knowing my messages with GPT-4.5 were rationed, I treated them like rare earth metals. I used them to compare GPT-4.5 with GPT-4o, OpenAI’s current flagship general purpose model. Below are some structured highlights of that exercise.

Context Calibration

  • First prompt: I had GPT-4.5 review my blog as of April 22, 2025.
  • Follow-up: I asked it to infer which of several described behaviors would most agitate me.
  • Outcome: It nailed the answer — a logical fallacy where a single counterexample is used to dismiss a general trend — suggesting unusually sharp insight.

Physics and Thermodynamics

  • Newtonian slip-up: GPT-4.5 initially fumbled a plain-language physics question GPT-4o got right. It self-corrected after a hint. (Rows 6 & 7)
  • Thermo puzzle: GPT-4.5 got this right on the first try. GPT-4o, in contrast, required a hint to arrive at the correct solution. (Row 8)

Semantics and Language Nuance

  • Semantics fail: A riddle involving walking in/out of a room tripped GPT-4.5. It hallucinated irrelevant answers and got lost after subtly changing “walk out” to “exit”. (Rows 10–12)
  • Semantics success: A set of riddles that each resolve to the answer “nothing” was handled perfectly. (Row 13)

Math and Estimation

  • System of equations: Given a plain-language word problem, GPT-4.5 set up and solved the math without issue. (Row 14)
  • Estimation logic: In a real-world tree height estimation task, it correctly applied trigonometric reasoning but forgot to add shoulder height — a human-level miss. Interestingly, a non-OpenAI model caught that step. (Rows 15–16)

Humor and Creativity

  • Joke dissection: GPT-4.5 gave an insightful breakdown of why a joke was funny. But when asked to generate more like it, it flopped badly. Humor remains elusive. (Rows 16–17)
  • Creative constraint: Asked to reword the Pledge of Allegiance in rhyming verse, it performed admirably. (Row 19)

Reasoning with Hints

  • Challenge puzzle: Given the task of identifying the commonality among Grand Central Station, Statue of Liberty, Boulder Dam, and the Dollar Bill, it needed two substantial hints — but eventually converged on the correct insight: all are common mis-appellations or common names for things with less frequently used formal names. (Rows 20–23)

Early Verdict

GPT-4.5 is not a clear-cut improvement over GPT-4o — but it is more “aware” in subtle ways. It inferred contextual intent better. It avoided some of GPT-4o’s wordiness. And when it made mistakes, it sometimes corrected them more elegantly.

Yet it still failed where you’d least expect it: changing the wording of a logic riddle mid-solve, forgetting to add shoulder height to a tree, or bombing humor generation entirely. So far, GPT-4.5 feels like a precision instrument you only get to test under supervision — powerful, promising, and still behind glass.

Chat log with GPT-4.5 (Excel)

Leave a Reply

Your email address will not be published. Required fields are marked *