A linguistically realistic solution for Twitter’s character-count dilemma

Posted on

Twitter is testing a 280-character restriction for English tweets, and people have been excited about this. The topic naturally shifts to linguistic differences. For example, according to this article in the Washington Post, the different versions of Article 1 of the Universal Declaration of Human Rights have radically different character counts, as shown in the chart. The English version contains 170 characters, while the Chinese version only uses 43 characters.

According to Twitter’s algorithm, one Chinese character is counted as one character equivalent to a letter in an English word. This of course leads to more limitations on English than on Chinese. The Washington Post article summarizes this dilemma quite well, and I am going to quote it here (I’ll leave aside some of the linguistically inaccurate statements since they are not really pertinent to the topic in my post here):

The reason? Chinese characters — like those used in Japanese writing — are logograms, representing a full word. Characters in English and most other written languages, by contrast, represent sounds.

As far as your computer is concerned, when it’s displaying text, it doesn’t matter if a given character is a simple Roman letter like “I” or a relatively complex Chinese character like “我” (meaning me/I) — they’re each one character. So the word “elephant” takes up eight characters in English but just one character (象) in Chinese.

In other words, you can fit 140 elephants in a Chinese-language tweet but just 17 in an English one.

Therefore it really is just a matter of how characters are counted and what the equivalent writing units are across different languages. For example, if we convert the Chinese version of the Article 1 of the Universal Declaration of Human Rights into a romanized system, for example the Pinyin romanization system, we will get a lot more characters than just 43, as shown here in comparison to the English version.

Chinese version in characters:

人人生而自由, 在尊严和权利上一律平等。他们赋有理性和良心, 并应以兄弟关系的精神相对待。

Chinese version in Pinyin:

Rén rén shēng ér zì yóu , zài zūnyán hé quánlì shàng yīlǜ píngděng. Tāmén fùyǒu lǐxìng hé liángxīn, bìng yīng yǐ xiōngdì guānxì de jīngshén xiāng duìdài.

English version:

All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

As we can see, the Chinese version in Pinyin is still a little more concise than English and it might have something to do with real linguistic differences, e.g. Chinese has tones and lots of monosyllabic words, while English has no tones and probably most words are not monosyllabic.

But the Chinese translation is a little bit archaic, resembling Classical Chinese, thus being more concise than real spoken Modern Standard Chinese. If the Chinese version is rendered in a more modern style, it will easily be longer than the English version.

Therefore although there are indeed linguistic differences in terms of the economy of encoding information, the more obvious reason here is how to find a cross-linguistically comparable measure of units of writing. My solution is that the original written versions in all languages can be converted automatically into a romanized version, if it is not already an alphabetical writing system, and then all writings can be easily compared in terms of this romanized version. It certainly will take more processing on Twitter’s end to implement this method, but it is much better than increasing the limit to 280 characters for a few selected languages but leaving the limit unchanged for other languages.

For one thing, it is highly debatable what languages need this change of character-count limit and what languages do not. It probably will take a lot more study on each language in terms of the average of expression of the same idea and than normalize all these differences to a common denominator language and then decide the character-count limit for each language. But of course this is really not the kind of solution that any company wants to implement because it goes well beyond their normal range of operations.

Another reason is that the rationale for having a 140-character restriction is to enhance the quality of tweets. We are in an age of information overload. Having shorter but more meaningful tweets can definitely help us be more efficient. I’d rather read fewer but better tweets than read several long tweets that just keep rambling on and on.