To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.
The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
BC
September 26, 2025This analysis matches my experience testing different models in my local setup. The paper’s point about “confident guessing” is especially clear when comparing model outputs – I’ve seen that smaller models (7B-13B) on my RTX 3060 often make up detailed technical answers instead of admitting uncertainty, while larger models like the 70B variants I can run on my RTX 3090 tend to be more honest about knowledge gaps.
The “singleton rate” concept is intriguing from a local testing standpoint. When I benchmark models on technical documents or niche topics, those trained on broader but lighter datasets tend to hallucinate more than those with deeper domain coverage.
What stands out most is the evaluation problem – our current benchmarking often favors models that never say “I don’t know,” even when honesty would suggest otherwise. In my tests across different hardware, I’ve started including “uncertainty calibration” as a metric alongside standard performance measures. Models that properly show uncertainty often perform better in real-world use despite scoring lower on traditional benchmarks.
The paper highlights why testing different models is important – no single evaluation fully captures a model’s reliability.