The Research Preview of GPT-4.5
When GPT-4.5 (informally referred to as GPT-4.5 or “o4”) was rolled out as part of a limited research preview, OpenAI offered a tantalizing upgrade to the Plus Plan: access to its most advanced model. As part of a research preview, access is intentionally limited — a chance to try it out, not run it full-time.
What I Did with My Quota
Knowing my messages with GPT-4.5 were rationed, I treated them like rare earth metals. I used them to compare GPT-4.5 with GPT-4o, OpenAI’s current flagship general purpose model. Below are some structured highlights of that exercise.
Context Calibration
- First prompt: I had GPT-4.5 review my blog as of April 22, 2025.
- Follow-up: I asked it to infer which of several described behaviors would most agitate me.
- Outcome: It nailed the answer — a logical fallacy where a single counterexample is used to dismiss a general trend — suggesting unusually sharp insight.
Physics and Thermodynamics
- Newtonian slip-up: GPT-4.5 initially fumbled a plain-language physics question GPT-4o got right. It self-corrected after a hint. (Rows 6 & 7)
- Thermo puzzle: GPT-4.5 got this right on the first try. GPT-4o, in contrast, required a hint to arrive at the correct solution. (Row 8)
Semantics and Language Nuance
- Semantics fail: A riddle involving walking in/out of a room tripped GPT-4.5. It hallucinated irrelevant answers and got lost after subtly changing “walk out” to “exit”. (Rows 10–12)
- Semantics success: A set of riddles that each resolve to the answer “nothing” was handled perfectly. (Row 13)
Math and Estimation
- System of equations: Given a plain-language word problem, GPT-4.5 set up and solved the math without issue. (Row 14)
- Estimation logic: In a real-world tree height estimation task, it correctly applied trigonometric reasoning but forgot to add shoulder height — a human-level miss. Interestingly, a non-OpenAI model caught that step. (Rows 15–16)
Humor and Creativity
- Joke dissection: GPT-4.5 gave an insightful breakdown of why a joke was funny. But when asked to generate more like it, it flopped badly. Humor remains elusive. (Rows 16–17)
- Creative constraint: Asked to reword the Pledge of Allegiance in rhyming verse, it performed admirably. (Row 19)
Reasoning with Hints
- Challenge puzzle: Given the task of identifying the commonality among Grand Central Station, Statue of Liberty, Boulder Dam, and the Dollar Bill, it needed two substantial hints — but eventually converged on the correct insight: all are common mis-appellations or common names for things with less frequently used formal names. (Rows 20–23)
Early Verdict
GPT-4.5 is not a clear-cut improvement over GPT-4o — but it is more “aware” in subtle ways. It inferred contextual intent better. It avoided some of GPT-4o’s wordiness. And when it made mistakes, it sometimes corrected them more elegantly.
Yet it still failed where you’d least expect it: changing the wording of a logic riddle mid-solve, forgetting to add shoulder height to a tree, or bombing humor generation entirely. So far, GPT-4.5 feels like a precision instrument you only get to test under supervision — powerful, promising, and still behind glass.