When I’ve started learning Python, working with native texts was a nightmare: constant errors due to diacritics. Now, one year more experienced, I’d like to share with my observations.

 

What is Encoding?

Encoding is a way to convert a sequence of bytes into sequences of characters and in the opposite way. When is it used? Each text document is saved as bytes. When you want to display it, its binary code is converted to visible representation which are basically characters. So what all the fuss is about? Firstly, there are many encoding formats and secondly, a character is represented as different byte sequence for each encoding. If we don’t inform computer what is the format type, we could get unexpected characters or even errors:

>>>phrase='Pójdźmy na stołówkę!'
>>>print(phrase.decode('ISO 8859 2'))
Pójdźmy na stołówkę!

>>>print(phrase.decode('Latin-1'))
Pójdźmy na stołówkę!

>>>print(phrase.decode('UTF-8'))
Pójdźmy na stołówkę!

The basic encoding format is ASCII, which is able to encode 128 characters: most popular letters, Arabic numerals and punctuation. It’s usually deficient, as it’s not able to handle diacritics (ą ű Ỹ), non-alphabetic characters (ex. Japanese), unusual signs and punctuation (∑ ‰). Characters recognized by ASCII are compatible with most encodings. For European languages we could consider Latin-1 and Latin-2 which respectively deal with the west and east European languages. But there’s still lack of unusual signs, punctuation and some diacritics. What else, Latin-2 is problematic when used for website encoding.

So does an encoding, which is able to deal with all the possible languages and punctuation, exist? Yes, it’s UTF and its aim is to handle Unicode, standardized set of all characters used all over the world. UTF-8 which use 1 to 6-byte encoding is the most common (and the less problematic) UTF standard. It includes also many signs used by creating a website like arrows and emojis. Quite common is also UTF-16 which requires 16-bytes. We should still remember, that each UTF-8 and UTF-16 are not in line with each other. Ex. sign (U+20AC in Unicode standard) has three bytes in UTF-8 (\xe2\x82\xac) and two bytes (\xac\x20) in UTF-16LE.

Further, we will use two phrases in the following meaning:

Encoding– converting ‘visual’ character representation to its byte notation
Decoding– converting bytes representation to ‘visual’ representation (characters)
Diacritics- letters with an added glyph, like ąęźśćńłó

 

How to recognize the encoding of my text?

You should be told. If you scrape data from the internet, you can find it in metadata:

<meta charset="UTF-8" />

However, if you’re not sure what’s the encoding, you can use several tricks:

  1. if you see b‘\x00’ bytes, it’s probably encoded in UTF-16 or 32-bit, not an 8-bit scheme.
  2. You can use chardetect: https://pypi.python.org/pypi/ctharde.  Just navigate the file in terminal, write

chardetect filename

and the program will obtain, what’s the encoding:
chardetect test.csv
test.csv: utf-8 with confidence 0.99

Encoding in Python 2 and 3

Python 3

Python 3 is easy. UTF-8 is default format so you probably wouldn’t have problems with encoding. I dare to say, for Python 3 you can skip further reading.

Python 2

Python 2 uses ASCII as default encoding. Strings are represented as raw bytes, so it may cause problems. You need to either use str.decode() to represent bytes as a visual character or easier, write the declaration on the top of the script:

# -*- coding: utf-8 -*-

It’s a comment, nevertheless everything that matches the regex below and it’s located in the first two lines of code, will be understood as coding declaration

coding[=:]\s*([-\w.]+) -

Some IDE (ex. Spyder) adds UTF-8 declaration automatically. Remember, that this form of encoding declaration applies only to texts written directly in Python code. If you import text, you should declare its encoding as well. For example, I imported text verse which stands for phrase usychają uważając ją za przyjaciółkę. Let’s say I will find all the letters, space and newline characters:

>>>re.findall(r'([a-z \nćźżśńóąęł])', text)

['u', 's', 'y', 'c', 'h', 'a', 'j', '\xc4', '\x85', ' ', 'u', 'w', 'a', '\xc5', '\xbc', 'a', 'j', '\xc4', '\x85', 'c', ' ', 'j', '\xc4', '\x85', ' ', 'z', 'a', ' ', 'p', 'r', 'z', 'y', 'j', 'a', 'c', 'i', '\xc3', '\xb3', '\xc5', '\x82', 'k', '\xc4', '\x99', ]

Each diacritics is encoded as two bytes (ą:‘\xc4\x85’, ę: ‘\xc4\x99’, etc.) but it’s written as two separate characters ('\xc4‘, '\x85') which means the computer doesn’t recognize it as one letter. It may cause the problem, especially while iterating through letters. Whoops! Seems we didn’t declare encoding of imported text! The regex below handles diacritics properly (and btw. remove unnecessary characters):

re.sub(r'([^\s\n\w])', '', unicode(text,"utf-8"), flags=re.UNICODE)

 

\u015bl b’\x67’ u’\x56’

By working with the string you may find different sorts of character description.

\u stands for a character in Unicode notation, whereas \x means that the next two characters represent Unicode character in hexadecimal notation. b’word’ declare, that word is a sequence of raw bytes, and u’word’ stands for word text written in Unicode notation. Example: Unicode character U+00F3 may be represented as u‘ó’u'\u00F3' or u‘\xF3’, while its byte representation is the following:

>>> u'ó'.encode('UTF-8')
b'\xc3\xb3'

You should remember, that Unicode character without u at the beginning will be interpreted directly as string (at least in Python 2):

>>> print(‘\u1234’)
\u1234

On the opposite, u’\u1234' determine Unicode, and will be printed as ሴ.

HINT! If you see \x characters in your string and you can’t obtain their value by printing, it means it’s byte sequence. You should use b’ and declare decoding:

>>> print('\xe1\x95\xb2')
ᕲ
#character not properly detected

>>> print(b'\xe1\x95\xb2'.decode('UTF-8'))
 # that’s it!

And vice versa, we can obtain character from Unicode the byte encoding:

>>> print(u'\u1572')

>>> print('\u1572'.encode('UTF-8'))
b'\xe1\x95\xb2'

>>> print('ᕲ'.encode('UTF-8'))
b'\xe1\x95\xb2'

Tips and tricks

Diacritics in variable names

Python 3 is hipster as it allows to use non-ASCII characters in source code:

>>> różnica = 22
>>> różnica-20
2

However, it’s not recommended if there is any probability that the code will be used by others as:

  • They could have different keyboard settings and not be able to type diacritics using keyboard shortcuts.
  • Diacritics will be probably used in regional words, however, they may be meaningless for foreigners.
  • Diacritics may be similar to each other, thus it’s easy for non-native speakers to use improper character:

á vs à
Ĉ vs Č
Ō vs Ŏ

Replacement for non-recognized characters

If you’re afraid, that Python will skip unrecognized characters, you can determine default replacement and then print set of characters to double check, if suspected characters occur:

>>> word='łyżwę'
>>> word.encode('cp437', errors='replace')
b'?y?w?'

>>> set(word.encode('cp437', errors='replace').decode('UTF-8'))
{'?', 'w', 'y'}

No console error doesn’t mean ‘no error’

Other 8-bit encodings are able to decode any sequence of code, so if you use invalid coding, it may produce a noisy sequence of characters. You can see it in the example:

phrase='Pójdźmy na stołówkę!'

>>>print(phrase.decode('ISO 8859 2'))
Pójdźmy na stołówkę!

>>>print(phrase.decode('Latin-1'))
Pójdźmy na stołówkę!

>>>print(phrase.decode('UTF-8'))
Pójdźmy na stołówkę!

Co it’s good practice to print a dictionary of unique characters to determine if you used proper encoding. It’s also a good moment to decide if you should remove unwanted characters with regex.

>>> set(phrase.encode('UTF-8').decode('ISO 8859 2'))
{'w', '\x99', 'Ă', 'P', '\x82', 'n', 'Ĺ', 'Ä', 'ł', 'm', ' ', 'a', 's', 'o', 'y', 'ş', 'k', 't', '!', 'j', 'd'}