wesc (wesc) wrote,

Unicode{De,En}codeError: 'ascii' codec can't encode characters in position 0: ordinal not in rang...

[INTERMEDIATE]

BOO!
Here is your scary Python Halloween code and hopefully a trick or treat to remedy your situation that follows below.

Does the error in the Subject line look familiar? Yeah, this is what you have to go through because, "[it] seems that you've been living two lives," ASCII strings and Unicode strings. "One of these lives has a future, and one of them does not."[*]

Here is an example:

>>> t = u'\xae'

Problems occur when you're using/calling other code where there is some internal attempt to convert the string into ASCII, sometimes even by standard library code! If I try to do it on the command-line, I get the failure:

>>> str(t)
Traceback (most recent call last):
  File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)


What you need to do, because the software expects an ASCII string, is to turn it into the equivalent ASCII string (made up of binary data) yourself (from Unicode), so that the code can deal with it. You do this by encoding your Unicode string into ASCII (binary). For example, I've chosen to encode it in UTF-8 below. Notice that the str() call to my new string does not fail anymore:

>>> u = t.encode('utf-8')
>>> u
'\xc2\xae'
>>> str(u)
'\xc2\xae'
>>> print u
®


Notice that a valid encoding of the single Unicode character takes two bytes (at least in UTF-8 it does for this one character). The bottom line is that you need to ensure that your string is either a valid Unicode string or a valid binary/ASCII string (one that is encoded by some valid and supported codec). In other words, Unicode u'\xae' is valid, and so is binary '\xc2\xae' because it has been encoded -- it is the UTF-8 encoding of the Unicode u'\xae' character.

The ASCII/binary string '\xae' doesn't really have much meaning by itself other than being a byte with ASCII value 174. All valid printable ASCII characters are numbered 128 and less. This issue is less painful in Python 3.x where all strings are Unicode (and ASCII/binary strings are bytes arrays), but for those still on 2.x now, we have to partially live in (probably the worst of) both worlds. When you try to convert such a string back into Unicode, there is nothing that can "decode" it, so you'll get a UnicodeDecodeError:

>>> s = '\xae'


Given that "useless" binary string, attempting to convert it to Unicode will fail... these two are the same:

>>> unicode(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)


Likewise, if you try to encode it using UTF-8, it will also fail because it is not a UTF-8 encoded string, and it will try to decode via ASCII before trying to encode it via UTF-8 anyway:
>>> s.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)


Again, that string is only useful as the byte value 174 unless there is another encoding out there that makes this string a valid one. Okay, so I hope that we've helped you with a problem you've been having. We recommend you read more other sources of information. Don't take my word for it as I'm as trustworthy as... well, you can't believe everything that you read right?!? With that said, if you are a Unicode expert and have suggestions on how I can reword portions of this article, feel free to drop me a line -- you can find me on Twitter or Google+.

My last recommendation is that tripping up in this manner only happens if you are moving back and forth between your "two lives." If you stay all Unicode or all ASCII, that will make things more stable. If you have to operate in both modes, then at least keep the translations either at the front or at the end of your processing. In other words, if your data is stored in Unicode but your code operates on binary/ASCII (or vice versa), do one translation at the beginning, keep it the same during, then do a final reversion back to the data storage type. You will minimize your pain this way.

Finally there are many good resources out there where you can study the struggle between Unicode vs. ASCII/binary strings out there. If I get a chance to collate such a list, I'll post it here. For now, Google is your friend.

Happy Halloween!

[*] quotes (c)1999 Groucho II Film Partnership, Silver Pictures, Village Roadshow Pictures, Warner Bros. Pictures
Tags: 2.x, 3.x, ascii, bytes, data, decode, encode, error, python, python2, python3, strings, text, unicode
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 0 comments