[cs/0412098] Automatic Meaning Discovery Using Google
Automatic Meaning Discovery Using Google
Authors: Rudi Cilibrasi (CWI), Paul M. B. Vitanyi (CWI, University of Amsterdam, National ICT of Australia)
Comments: 31 pages, 10 figures; eliminated some typos etc. On pages 1-3 corrected Eq (1) and handcrafted horse-rider and by-with examples using now 10 decimal precision
Subj-class: Computation and Language; Artificial Intelligence; Databases; Information Retrieval; Learning
ACM-class: I.2.4; I.2.7
We have found a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings. The world-wide-web is the largest database on earth, and the latent semantic context information entered by millions of independent users averages out to provide automatic meaning of useful quality. We demonstrate positive correlations, evidencing an underlying semantic structure, in both numerical symbol notations and number-name words in a variety of natural languages and contexts. Next, we demonstrate the ability to distinguish between colors and numbers, and to distinguish between 17th century Dutch painters; the ability to understand electrical terms, religious terms, and emergency incidents; we conduct a massive experiment in understanding WordNet categories; and finally we demonstrate the ability to do a simple automatic English-Spanish translation.