Over 10 years we help companies reach their financial and branding goals. Engitech is a values-driven technology agency dedicated.

Gallery

Contacts

411 University St, Seattle, USA

engitech@oceanthemes.net

+1 -800-456-478-23

Comment (1)

  1. BC
    October 3, 2025

    The likelihood ratio approach is sound, but choosing the smoothing constant (α=0.5) is arbitrary and significantly affects results. Different smoothing values would yield different rankings, especially for rare patterns. The article neither justifies why 0.5 was selected nor shows sensitivity analysis across various α values, which is important when claiming these are “the most distinctive” patterns.

    The dataset limitation is more substantial than acknowledged. Using only the top 5,000 words per language heavily biases results toward common grammatical morphemes and omits technical vocabulary where languages might exhibit different patterns. When testing similar statistical methods on programming language detection, I found that word frequency lists miss domain-specific character patterns that appear in specialized texts.

    The infinite likelihood ratio problem for language-specific characters exposes a fundamental issue: statistical uniqueness does not equal perceptual distinctiveness. A character appearing once in 5,000 words isn’t a useful “fingerprint” for human recognition; however, the math treats it as infinitely distinctive before smoothing. The article conflates statistical measures with the visual recognition patterns humans actually use.

    The low distinctiveness of English (max log LR of 2.79) is intriguing, but the explanation about loanwords is speculative. It could also relate to orthographic borrowing patterns, shared Germanic/Romance roots with other European languages, or simply that English character patterns are statistically less concentrated. Without testing this hypothesis, it remains a guess about causation.

Leave a comment

Your email address will not be published. Required fields are marked *