Shared item
1 shares
Is English more efficient than Chinese after all?
via Language Log by Mark Liberman on April 27, 2008
[Executive summary: Who knows?]
This follows up on a series of earlier posts about the comparative efficiency — in terms of text size — of different languages (”One world, how many bytes?“, 8/5/2005; “Comparing communication efficiency across languages“, 4/4/2008; “Mailbag: comparative communication efficiency“, 4/5/2008). Hinrich Schütze wrote:
I’m not sure we have interacted since you taught your class at the 1991 linguistics institute in Santa Cruz — I fondly remember that class, which got me started in StatNLP.
I’m writing because I was intrigued by your posts on compression ratios of different languages.
As somebody else remarked, gzip can’t really be used to judge the informativeness of a piece of text. I did the following simple experiment.
I read the first 109 or so characters from the xml Wikipedia dump and wrote them to a file (which I called wiki). I wrote the same characters to a second file (wikispace), but inserted a space after each character. Then I compressed the two files. Here is what I got:
1012930723 wiki
2025861446 wikispace
314377664 wiki.gz
385264415 wikispace.gz
385264415/314377664 approx 1.225The two files contain the same information, but gzip’s model does not handle this type of encoding well.
In this example we know what the generating process of the data was. In the case of Chinese and English we don’t. So I think that until there is a more persuasive argument we should stick with the null hypothesis: the two texts of a Chinese-English bitext are equally informative, but the processes transforming the information into text are different in that the output of one can be more efficiently compressed by gzip than the other. I don’t see how we can conclude anything about deep cultural differences.
Note that a word-based language model also would produce very different numbers for the two files.
Does this make sense or is there a flaw in this argument?
The flaw, clearly, was in *my* argument. I asserted that
modern compression techniques should be able to remove most of the obvious and simple reasons for differences in document size among translations in different languages, like different spacing or spelling conventions. If there are residual differences among languages, this either relates to redundancies that are not being modeled [of more complex kinds] or it reflects a different sort of difference between languages and cultures [such as differing habits of explicitness].
But Hinrich’s simple experiment shows that the first part of this assertion is simply false. At least, gzip compression can’t entirely remove even such a simple manipulation as the insertion of a space after every letter of the original. In principle, I believe, coders like gzip, based on accumulating a “dictionary” of previously-seen strings, should be asymptotically oblivious to such manipulations; but in the case at hand, we’re clearly a long way from the asymptote.
Hinrich’s note also prodded me to do something that I promised in one of the earlier posts, namely to try a better compression program on some Chinese/English comparisons. A few simple experiments of this type showed that I was even more wrong than Hinrich thought.
First, I replicated Hinrich’s experiment on English. I took the New York Times newswire for October of 2000 (from English Gigaword Third Edition, LDC2007T07). I created two derived versions, one by adding a space after each character of the original, as Hinrich did: and another by removing all spaces, tabs and newlines from the original.
I then compressed the three texts with gzip and with sbc, a compression program based on the Burroughs-Wheeler Transform, which seems to be among the better recent text-file compressors. The results:
Original Spaces added Space, tab, nl removed No compression 61,287,671 122,575,342 51,121,392 gzip -9 21,467,564 26,678,868 19,329,166 gzip bpB (bits per byte)
2.802 1.741 3.025 sbc -m3 11,881,320 12,702,780 11,632,941 sbc bpB 1.551 0.829 1.820 This replicates Hinrich’s result: the spaces-added text is about 24% larger after gzip compression, and about 7% larger after sbc compression. Better compression is reducing the effect, but not eliminating it.
In the other direction, removing white space makes the original file about 17% smaller, and this difference is reduced but not eliminated by compression (10% less after gzip, 2.1% less after sbc).
Next, I thought I’d try a recently-released Chinese/English parallel text corpus, created by translating Chinese blogs into English (Chinese Blog Parallel Text , LDC2008T06). I processed the corpus to extract just the text sentences.
Chinese English English/Chinese ratio No compression 814,286 1,034,746 1.271 gzip -9 362,565 366,322 1.010 gzip bpB 3.562 2.832 sbc -m3 263,073 254,543 0.968 sbc bpB 2.585 1.968 In the originals, the English translations are about 27% larger than the (UTF-8) Chinese originals, which is similar to the ratios seen before. However, even with gzip, the difference is essentially eliminated by compression. With sbc, the compressed English is actually slightly smaller than the compressed Chinese.
So I went back and tried one of the corpora whose compressed size was discussed in my earlier post (Chinese English News Magazine Parallel Text, LDC2005T10). Again, I processed the corpus to extract only the (Big-5 encoded) Chinese or English text, eliminating formatting, alignment markers, etc. To my surprise, in this case, the English versions come out smaller under both gzip and sbc compression:
Chinese English English/Chinese ratio No compression 37,399,738 54,336,642 1.453 gzip -9 22,310,891 19,803,723 0.888 gzip bpB 4.77 2.916 sbc -m3 16,708,712 12,458,354 0.746 gzip bpB 3.57 1.834 This is the same corpus as the one called “Sinorama” in the table in my first post on this subject (”One world, how many bytes?“, 8/5/2005), where the English/Chinese ratio before compression was given as 1.95, and after gzip compression as 1.19.
(Why the difference? Well, the numbers in my 2005 post reflected the results of compressing the whole file hierarchy for each language, without any processing to distinguish the text from other things; and the Chinese files were encoded as Big5 characters, meaning that even the Latin-alphabet characters in the sgml formatting codes were 16 bits each.)
My conclusions:
1. Hinrich is right — current compression techniques, from a practical point of view, reduce but don’t eliminate the effects of superficial differences in orthographic practices.
2. It’s a good idea to be explicit and specific about the sources of experimental numbers, so that others can replicated (or fail to replicate) the process. So what I did to get the Chinese/English numbers is specified below, for those who care.
For the Sinorama corpus (LDC2005T10), in the data/Chinese and data/English directories, I extracted the text via this /bin/sh command:for f in *.sgmdo egrep '^
//; s/<.seg> *$//'done >alltext and then compressed (using gzip 1.3.3 with the -9 flag, and sbc 0.970r3 with the -m3 flag).
For the Chinese blog corpus (LDC2008T06), in the data/source and data/translation directories, I extracted the text via
for f in *.tdfdo gawk -F 't' '{print $8}' $fdone >alltextand then compressed as above.
Shared by: