terça-feira, 21 de dezembro de 2010

New Google Database Puts Centuries of Cultural Trends in Reach of Linguists

Interessante essa busca do conhecimento universal por meio dos instrumentos de busca do tipo Google.
Nunca antes na história do mundo -- com perdão da paráfrase -- foi possível aceder ao conhecimento pleno dos conceitos mais utilizados pelas civilizações e sociedades com tal facilidade, graças às novas "enciclopédias do saber universal" do tipo Google.

Word-Wide Web Launches
New Google Database Puts Centuries of Cultural Trends in Reach of Linguists
By ROBERT LEE HOTZ
The Wall Street Journal, December 17, 2010

(ver as imagens neste link)

Language analysts, sifting through two centuries of words in the millions of books in Google Inc.'s growing digital library, found a new way to track the arc of fame, the effect of censorship, the spread of inventions and the explosive growth of new terms in the English-speaking world.

A new report reveals how researchers are using Google's immense digital library to track cultural trends and catalogue human culture over the last 200 years. WSJ's Robert Lee Hotz discusses the cutting edge endeavor with WSJDN's Kelsey Hubbard.

In research reported Thursday in the journal Science, the scientists at Harvard University, Massachusetts Institute of Technology, Google and the Encyclopedia Britannica unveiled a database of two billion words and phrases drawn from 5.2 million books in Google's digital library published during the past 200 years. With this tool, researchers can measure trends through the language authors used and the names of people they mentioned.

It's the first time scholars have used Google's controversial trove of digital books for academic research, and the result was opened to the public online Thursday.

Analyzing the computerized text, the researchers reported that they could measure the hardening rhetoric of nations facing off for war, by tracking increasing use of the word "enemy." They also could track changing tastes in food, noting the waning appetite for sausage, which peaks in the 1940s, and the advent of sushi, the mentions of which start to soar in the 1980s. They documented the decline of the word "God" in the modern era, which falls sharply from its peak in the 1840s.

"We can see patterns in space, time and cultural context, on a scale a million times greater than in the past," said Mark Liberman, a computational linguist at the University of Pennsylvania, who wasn't involved in the project. "Everywhere you focus these new instruments, you see interesting patterns."

The digital text also captured the evolving structure of a living language, and almost a half-million English words that have appeared since 1950, partly reflecting the growing number of technical terms, such as buckyball, netiquette and phytonutrient.

"It is just stunning," said noted cultural historian Robert Darnton, director of the Harvard University Library, who wasn't involved in the project and who has been critical of Google's effort to digitize the world's books. "They've come up with something that is going to make an enormous difference in our understanding of history and literature."

All told, about 129 million books have been published since the invention of the printing press. In 2004, Google software engineers began making electronic copies of them, and have about 15 million so far, comprising more than two trillion words in 400 languages.

"We realized we were sitting on this huge trove of data," said Google Books engineering manager Jon Orwant. "We want to let researchers slice and dice the data in ways that allow them to ask questions they could not ask before."

The online library project has been hobbled by lawsuits, copyright disputes and fears over the potential for the company to have an information monopoly. "There have been computational hurdles, scientific hurdles, organizational and legal hurdles," said mathematician Erez Lieberman Aiden at the Harvard Society of Fellows, who helped create the database.

To avoid copyright violations, the scientists are making available the vast catalog of frequency patterns of words and phrases, not the raw text of books. Google Labs posted freely downloadable data sets and a special viewer at http://ngrams.googlelabs.com Thursday. These data sets consist of short phrases—up to five words —with counts of how often they occurred in each year.

Journal Community - DISCUSS
What an amazing tool for analyzing culture! Understanding ourselves better could be a tremendously postive development. The dark sides though, are the opportunities for molding thought patterns en masse and social engineering that such an understanding presents. Onward ho! May we be able to preserve original thought and free will.
—Vincent P. Emmer
They currently include Chinese, English, French, German, Russian and Spanish books dating back to the year 1500—about 4% of all books published. The database doesn't include periodicals, which might reflect popular culture from a different vantage.

By calculating how frequently famous personalities appear in Google's digitized texts, the Harvard researchers discovered that people these days become famous at a younger age than in previous eras and reach unprecedented peaks of notoriety. "The flip side is that people forget about you faster," said Harvard lead researcher J.B. Michel.

Measuring occurrences of prominent names, Mr. Michel and his colleagues found that Jimmy Carter leapt from obscurity around 1974, at the onset of his run for the U.S. presidency, to overtake Mickey Mouse, Marilyn Monroe and astronaut Neil Armstrong in published mentions. Once out of office, Mr. Carter began an equally sharp decline in mention. By contrast, the cartoon character, the astronaut and the movie star have continued their steady rise up the slope of fame.

In the same way, they identified instances of censorship by charting the abrupt disappearance of controversial figures from the written record.

Mentions of the popular Jewish artist Marc Chagall, for example, virtually disappear from German literature during the era of Nazi power between 1936 and 1944, when his work was banned, but not from English books of the same period.

Other scholars are using the new database to chart social and emotional concepts over the past 200 years.

"Empathy has shot up since the 1940s," said Harvard University cognitive scientist and linguist Steven Pinker, who is experimenting with the data in his own research. "Will power, self-control and prudence have declined."

Write to Robert Lee Hotz at sciencejournal@wsj.com

Copyright 2010 Dow Jones & Company, Inc. All Rights Reserved

========

Teste com "economic development":

Books Ngram Viewer

Graph these case-sensitive comma-separated phrases:
between and from the corpus with smoothing of .

Search in Google Books:

1800 - 1941 1942 - 1984 1985 - 1989 1990 - 1994 1995 - 2000 economic development

Run your own experiment! Raw data is available for download here.

Nenhum comentário:

Postar um comentário

Diplomatizzando

terça-feira, 21 de dezembro de 2010

New Google Database Puts Centuries of Cultural Trends in Reach of Linguists

Books Ngram Viewer

Nenhum comentário:

Postagem em destaque

O mais idiota dos colapsos imperiais da História - Paulo Roberto de Almeida

Academia.edu