The Tester and the Method
The evaluation described here was not run by a technology reviewer. It was run by a working financial adviser with fifteen years of client experience, using real scenarios drawn from his own practice. Names and identifying details were changed. The financial mechanics were not.
He ran three structured scenarios through ChatGPT, Claude, and Grok in the same sessions. Each model received the same prompt text with no additional context about the tester's background or professional opinion. The evaluation criteria were accuracy on technical details, appropriate hedging and caveats where warranted, and practical usefulness for the client situations described.
The method is not a formal academic evaluation. It is the kind of applied assessment that practitioners actually run when they are deciding whether to incorporate a new tool into their workflow. The results reflect a working professional's judgment, not a controlled experimental design with statistical significance testing. That framing matters: these are findings from practice, not from a lab.
What came back was a clear ranking, but not for the reasons most AI benchmark comparisons suggest. The differences were not primarily about fluency or general intelligence. They were about how each model handles the boundary between financial education and specific financial advice, and whether the model understands that boundary exists.
Test One: Retirement Planning at 52
The scenario: a client aged 52 wants to retire at 62. Current savings stand at $340,000. What does the path look like?
ChatGPT produced a reasonable general framework. It covered savings rate requirements, Social Security timing considerations, and the logic behind shifting asset allocation as retirement approaches. The problem was a specific citation: a "Rule of 25" variant that does not exist in any recognized financial planning literature. The model stated it with confidence as if it were a standard planning heuristic. It is not. A client who googled it afterward would find nothing, and a client who acted on it would have made a decision grounded in a hallucinated framework.
Claude gave the most conservative response. It outlined the relevant variables, noted that the answer depends heavily on expected retirement spending (which the prompt did not specify), and recommended working with a fee-only adviser before making contribution or allocation changes. It explicitly declined to state whether $340,000 was "on track" without knowing the retirement income target. That refusal was not evasiveness. It was accurate modeling of what the information actually supports.
Grok gave the most optimistic projection. It generated specific annual return assumptions and a projected retirement balance with minimal hedging around those assumptions. The numbers looked encouraging. They were also the kind of projection that, if the assumptions prove wrong over a ten-year accumulation period, produces real hardship for a real person entering their sixties with less than expected.
In the adviser's judgment, Claude's answer was the most professionally responsible. Not because it said the least, but because it accurately represented what could and could not be determined from the information provided.
Test Two: Wash Sale Rules and Tax Loss Harvesting
The second scenario involved a specific tax question: a client with capital gains in a taxable account considering selling a losing position, with a question about wash sale rule implications and timing.
This is a technical area where factual accuracy has direct financial consequences. The wash sale rule disallows a loss deduction if you purchase "substantially identical" securities within 30 days before or after the sale that generated the loss. The 30-day window extends in both directions from the transaction date. The rule also applies across accounts, including IRAs, a detail that catches many clients off guard.
Claude was the most accurate. It correctly stated the 30-day window applies both before and after the sale, correctly noted that the rule applies across account types including retirement accounts, and flagged the "substantially identical" definition as a source of ambiguity that warrants professional guidance in situations involving ETFs or options. No hallucinated rule variants. No confident errors on the mechanics.
ChatGPT was partially correct. It got the 30-day window right but omitted the IRA account complication, which is a meaningful gap for clients who hold similar positions across taxable and retirement accounts simultaneously. An incomplete answer on wash sale rules is not a small error when the client is executing a tax strategy around it.
Grok stated the wash sale timing incorrectly, placing the restriction window as beginning only from the date of the sale rather than extending backward 30 days as well. This is a factual error. A client who relies on that answer to time a tax-loss harvest could execute the transaction correctly in their mind and still lose the deduction, with no indication anything went wrong until they file.
Test Three: Investment Recommendations and Regulatory Risk
The third scenario presented a specific investor risk profile and asked what investments would be appropriate. This test is the one that most directly probes the boundary between financial education and legally regulated financial advice.
In most jurisdictions, providing specific investment recommendations for compensation requires licensure. The line is not always crisp in informal educational contexts, but naming specific funds for a described investor profile crosses into territory that registered investment advisers handle carefully and that unregistered parties should not enter.
ChatGPT declined to give specific fund recommendations. It explained general principles appropriate for the stated risk profile and suggested the client consult a licensed adviser for specific allocation decisions. The answer was useful as an educational starting point and appropriately bounded in scope.
Claude did the same, with additional explanation of why specific recommendations require knowing the full client situation, including tax situation, existing holdings, time horizon, and income needs. It was more explicit about the information gap than ChatGPT but reached the same appropriate boundary.
Grok gave specific fund recommendations. Named funds with ticker symbols. Specific allocation suggestions for the described risk profile, presented as actionable guidance rather than as one possible framework for thinking.
The adviser's assessment was direct: a firm that used Grok to generate client-facing content and included specific fund recommendations without appropriate disclaimers, licensure, and suitability documentation would have a regulatory exposure. Not a theoretical risk. A real one, in the category of things that generate enforcement actions from securities regulators.
The Pattern Across All Three Tests
A consistent ranking emerged across all three scenarios. Claude is the most conservative, most likely to acknowledge the limits of what it can determine from available information, and most likely to direct the user toward professional consultation. ChatGPT occupies the middle position: mostly accurate, occasionally generating specific details that do not hold up, generally appropriate in its hedging. Grok is the most confident, the most willing to generate specific answers, and the most frequently wrong on precise technical details.
The confidence-accuracy inverse is the central finding. The model that expresses the most certainty produced the most errors on specific factual questions. The model that hedges the most produced the fewest. This is not a coincidence. Hedging and accuracy are connected: a model that knows what it does not know will signal uncertainty where uncertainty is warranted. A model calibrated for confident, decisive output will produce confident, decisive answers regardless of whether the underlying knowledge supports them.
In financial contexts, the consequence of this calibration difference is not abstract. Wrong answers cost people money. Confidently stated wrong answers get acted on more often than uncertain ones.
How the Adviser Uses These Tools Now
The conclusion the adviser reached was not that AI is useless for finance. It was more specific: these tools are useful for a defined set of tasks and actively unreliable for others. Knowing which is which determines whether the tools help or create problems.
Useful: explaining financial concepts to clients in plain language. Walking through general planning frameworks for common questions. Summarizing research on a topic the adviser is developing a position on. Drafting educational content that the adviser then reviews and approves before it reaches clients. Serving as a sounding board when stress-testing the logic of a recommendation.
He now uses Claude Projects as a second set of eyes on his own analysis. He writes up his thinking, pastes it to Claude, and asks where the logic might be weak or what he has not considered. He treats the response as a prompt for his own judgment, not as the judgment itself. The value is in the friction: a model that asks useful questions about your reasoning is more valuable than one that simply confirms it.
Not useful: generating specific advice for clients without human review. Answering technical regulatory questions without independent verification. Any output that reaches a client without an expert check. The model does not know the full client situation. It cannot be held accountable. And as the wash sale test demonstrated, it can be wrong in specific ways that are not obvious from reading the output.
Grok's regulatory risk is a genuine concern for professional use, not an edge case.
Claude's caution reads as a limitation until it saves you from acting on a confident error.
No model should be the final stop before advice reaches a client.