All who are interested in computational linguists will be cheering this Google Research announcement.
... we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.And to think it wasn't so long ago that the British National Corpus seemed large. At any rate, while this is obviously very good news, the true measure of the usefulness of all this data will lie in the quality of the annotation; given the sheer size of Google's corpus, I doubt the quality of the tagging will be such as to make the BNC or even the Brown Corpus irrelevent any time soon.We believe that the entire research community can benefit from access to such massive amounts of data [...] We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
Comments