Orientamenti attuali della statistica linguistica


  • Alfredo Rizzi Università di Roma “La Sapienza”




Quantitative methods in linguistics have, for the most part, been developed this century. Today we have three disciplines: mathematical linguistics, developed after Chomsky's research, computational linguistics, developed along with computer hardware and software and the development of artificial intelligence, and statistical linguistics, the subject of this paper which has old traditions but profits from modern data analysis. The classical characterization of population and sample is subject to two different interpretations in statistical linguistics: that a part of a test is a sample of the whole test, and that every test is a sample of the language of the researcher. We can observe only parts of a language; every statistical elaboration is made on data belonging to tests of sample of the language. In our research we have investigate the entropy of the Italian language; we have considered a sample of 200.000 letters from an Italian newspaper and found that we can consider the test to be Markov's process of the eighth order; meaning that knowledge of only eight characters gives as much information to predict the next one as does the knowledge of more than eight characters. The author believes that data analysis will allow the characterization of the essential elements of a language, of the style of an author, of a particular work or the language of a particular group in order to compare different languages. However, statisticians are always concerned that complex statistical methodologies might be used in an uncritical manner.

How to Cite

Rizzi, A. (1992). Orientamenti attuali della statistica linguistica. Statistica, 52(4), 487–505. https://doi.org/10.6092/issn.1973-2201/917