To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
The technical storage or access that is used exclusively for statistical purposes.
The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
BC
October 3, 2025The likelihood ratio approach is sound, but choosing the smoothing constant (α=0.5) is arbitrary and significantly affects results. Different smoothing values would yield different rankings, especially for rare patterns. The article neither justifies why 0.5 was selected nor shows sensitivity analysis across various α values, which is important when claiming these are “the most distinctive” patterns.
The dataset limitation is more substantial than acknowledged. Using only the top 5,000 words per language heavily biases results toward common grammatical morphemes and omits technical vocabulary where languages might exhibit different patterns. When testing similar statistical methods on programming language detection, I found that word frequency lists miss domain-specific character patterns that appear in specialized texts.
The infinite likelihood ratio problem for language-specific characters exposes a fundamental issue: statistical uniqueness does not equal perceptual distinctiveness. A character appearing once in 5,000 words isn’t a useful “fingerprint” for human recognition; however, the math treats it as infinitely distinctive before smoothing. The article conflates statistical measures with the visual recognition patterns humans actually use.
The low distinctiveness of English (max log LR of 2.79) is intriguing, but the explanation about loanwords is speculative. It could also relate to orthographic borrowing patterns, shared Germanic/Romance roots with other European languages, or simply that English character patterns are statistically less concentrated. Without testing this hypothesis, it remains a guess about causation.