Unix Terminals: Surviving the Encoding Hell

Published April 15th, 2010, updated March 29th, 2013.

Every now and then, I see people using misconfigured text terminals. People show up in chatrooms and post gibberish or they leave broken umlauts in text, html and source files. This is mostly the case because they (or you) have a broken terminal configuration. In this post, I will try to explain how terminal encodings work and how you can fix things.

Generally spoken, things break if you are using a different terminal encoding than your peers. When you enter text like umlauts and other international characters, it gets encoded using your local terminal encoding (something like latin1, utf-8 or cp850). If a different encoding is used to display this data, you are likely to see gibberish and other strange effects in your terminal. Thus, we need to define what encoding we want to use for a specific file, a chatroom or on a system-wide level. A good guess would be utf8 nowadays, but us-ascii/ascii7 is also pretty common.

First, let’s find out our actual terminal encoding. Just enter some umlauts like “äöü” and show the binary representation in hexadecimal:

$ printf "äöü" | xxd
0000000: c3a4 c3b6 c3bc                           ......

In this example, we find “c3a4 c3b6 c3bc” which indicates that the umlauts got encoded into utf8. Other possible results would be “e4 f6 fc” for win1252 or “84 94 81″ for cp850. You can lookup some more encodings here. (Of course, you can also check the manual for your terminal emulator).

Now that we know our actual terminal encoding, we need to tell this to the system libraries and other console software. This is done using locale(5), a standard that is used by almost any program that is capable of doing character encoding and not just passes dumb binary data. To do so, you can list all available encodings by running “locale -a” and pick an appropriate one:

$ locale -a
C
de_DE.utf8
en_US.utf8
POSIX

This list contains entries in the format language_location.encoding; additional locales can be created using tools like locale-gen(1). I use “en_US.utf-8 hence my terminal uses utf8 and I prefer English program output. This locale string should be set as $LC_ALL as environment variable (or LC_CTYPE if you want to ignore the language and location). Some terminals do this automatically, but we can also do this in our ~/.profile file which is sourced whenever a new terminal is started. For compatibility with older software, we also set $LANG to the same value:

export LC_ALL=en_US.utf-8
export LANG="$LC_ALL"

You can check the result in a new terminal by typing “locale”; if you see “C” instead of your locale string, something went wrong and the locales felt back to the default settings. Check that your locale string is in the list. When everything looks ok, you should see the utf8 line in my umlauts test file (just type “cat umlauts.bin”).

Now that we have checked the local terminal settings, we should do the same for hosts where we ssh into. Luckily, ssh can forward our locales settings, just append “SendEnv LANG LC_ALL” to ~/.ssh/config and check that your locale is also available on the remote host. Voila, you have a properly working terminal with defined locales.

If you still see malformed characters, it is likely that you use software that does not know about locales at all and just passes raw data. In theory, such software should fall back to us-ascii/ascii7 and strip or replace all other characters. If this fails, you can either use another program or you are forced to use a terminal program with the same binary encoding (or avoid umlauts if you are on IRC;-).

  • Andy

    Hey Benjamin,

    Your post really helped me out! Thought I help you a bit by pointing out a quick typo in the article. You say “This list contains entries in the format location_language.encoding”, when I think you meant “language_location.encoding”. Not huge, just thought I’d let you know.

    Thanks!

  • http://cxcv.de/ Benjamin Schweizer

    Thanks, I’ve updated this.