Twitter needs manual language selection

Lots of Twitterers speak languages that are not English. For people who read tweets that are not in English, it is important that these tweets are marked as such. I feel Twitter needs a feature for this.

It would be nice if, when writing a tweet, we could manually select which language the tweet is in, and that Twitter would use that information to set the appropriate lang attribute on our content:

screenshot of tweet creation widget with a set language button; the tweet says in Dutch that I would like to share my opinion on using clasnames for everything Sharing a controversial opinion on CSS frameworks in the Dutch language

Twitter is an authoring tool, for which the Authoring Tool Accessibility Guidelines recommend that “accessible content production is possible” (Guideline B.1.2).

The `lang` attribute#heading-1

Language attributes identify which language some web content is in. They are usually set on a page level, added to the HTML element:

Most developers don’t write these attributes often, the code often lives somewhere in a template that we don’t touch every day, or ever. But it’s an important attribute. Setting it correctly gets your page to pass one whole WCAG criterion (3.1.1 Language of page).

In some cases, we have to set language attributes on individual elements, too, like if some of our content is not in the page’s main language. On the website I built for the British-Taiwanese band Transition, we combine content in Mandarin with content in English on one page:

website screenshot, there is English text with Chinese text alongside of it, and links to YouTube videos The Transition “Music” page

We picked en as the main language and set it on the <html> element. This meant we had to mark all Chinese content as zh, in this case zh-TW as it is specifically Mandarin as spoken in Taiwan. Of course, we could have written this the other way around, too. Usually we want to pick the language that’s most common on the page as the page’s language.

Setting a lang attribute on parts of a page is its own WCAG criterion, too (3.1.2 Language of parts), by the way.

The user need#heading-2

Setting the language is important for end users, like:

people who use a screenreader to read out content on a page
people who use a braille display
people who end up seeing a default font (browsers can select these based on language)
people who use software to translate content
people who want to right click a word in our content to look it up in a dictionary
people who use user stylesheets

The author need#heading-3

There is also an author need, both for people who write content and for web developers.

Content editors

People who write content may get browser-provided spellcheckers. They will work better if they know what the content’s language is. I think Twitter.com has somehow turned browser spellcheck off, but there may be Twitter clients or indeed other authoring tools where this is relevant.

Web developers

Language attributes are important for web developers, too, as it allows them to use the :lang() pseudo class in CSS more effectively.

Some CSS will behave differently based on languages. When you use hyphens: auto, the browser needs to look up words in a dictionary to apply hyphenation correctly. It has to know the language for this.

With appropriate language attributes, you can also use CSS features like writing modes and typographic properties more effectively. See Hui Jing Chen’s deep dive into CSS for internationalisation for more details.

Automating and `lang-maybe`#heading-4

Identifying languages can be automated. In fact, Twitter does this. When they recognise a tweet’s language, they add the relevant lang attribute proactively. See for instance the European Commission chair’s multilingual tweets:

three tweets by Ursula von der Leyen, in French, German and English with dev tools open and each tweet pointing to the lang attribute in the markup Twitter’s auto-added lang attributes in action

Yay! I think this is very cool (thanks ThainBBdl for pointing this out). The advances in natural language processing are really impressive.

Having said that, any automated system makes mistakes. Vadim Makeev shared:

Yes, sometimes they take my Russian tweets and render them as Bulgarian. It’s not just the lang, they also use some Cyrillic font variation that makes them harder to read.

It is safe to assume such mistakes will skew towards minority languages and miss subtleties that matter a lot to individual people, especially in areas where language is political.

On the one hand, I think it makes sense to deploy automated language identification. As there are a lot of users, Twitter can safely assume not everyone would set a language for all of their tweets. People might not know or care (insert sad face here), a fallback helps with that. On the other hand, if this tech exists, might it make more sense if a browser would deploy it rather than an individual website? Why not have the browser guess the content’s language, for every website and not just Twitter?

If browsers would do this, Twitter’s lang attributes may get in the way. They kind of give the impression that this information is author-provided. This makes me wonder, should there be a way for Twitter to say their declaration is a guess? lang-maybe?

Manual selection#heading-5

Automated language detection probably works best if it complements manual selection. It could help provide a default choice or suggestion for manual selection, and work as a fallback. So, I’m still going to make the case for a method for users to specify a language manually.

A per-tweet manual language picker would be great as it can:

give willing authors more control to avoid issues
avoid that language identification benefits are only had by users of the majority languages that AI models are best trained for
let authors express their specific intent

Summing up#heading-6

For non-English tweets to meet WCAG, they need to have their language declared with a lang atttribute. Twitter currently guesses languages, which is a great step in the right direction, but is likely of little help to speakers of minority languages. A manual selector would be a great way to complement the automation.

The lang attribute#heading-1