The future
package can be used with or without unicode_literals
imports.
In general, it is more compelling to use unicode_literals
when
back-porting new or existing Python 3 code to Python 2/3 than when porting
existing Python 2 code to 2/3. In the latter case, explicitly marking up all
unicode string literals with u''
prefixes would help to avoid
unintentionally changing the existing Python 2 API. However, if changing the
existing Python 2 API is not a concern, using unicode_literals
may speed up
the porting process.
This section summarizes the benefits and drawbacks of using
unicode_literals
. To avoid confusion, we recommend using
unicode_literals
everywhere across a code-base or not at all, instead of
turning on for only some modules.
u''
prefixes is cleaner, one of the claimed advantages
of Python 3. Even though some unicode strings would require a function
call to invert them to native strings for some Python 2 APIs (see
Standard library incompatibilities), the incidence of these function calls
would usually be much lower than the incidence of u''
prefixes for text
strings in the absence of unicode_literals
.unicode_literals
than if an
explicit u''
prefix is added to every unadorned string literal.u''
prefixes are a SyntaxError
, making
unicode_literals
the only option for a Python 2/3 compatible
codebase. [However, note that future
doesn’t support Python 3.0-3.2.]unicode_literals
to a module amounts to a “global flag day” for
that module, changing the data types of all strings in the module at once.
Cautious developers may prefer an incremental approach. (See
here for an excellent article
describing the superiority of an incremental patch-set in the the case
of the Linux kernel.)Changing to unicode_literals
will likely introduce regressions on
Python 2 that require an initial investment of time to find and fix. The
APIs may be changed in subtle ways that are not immediately obvious.
An example on Python 2:
### Module: mypaths.py
...
def unix_style_path(path):
return path.replace('\\', '/')
...
### User code:
>>> path1 = '\\Users\\Ed'
>>> unix_style_path(path1)
'/Users/ed'
On Python 2, adding a unicode_literals
import to mypaths.py
would
change the return type of the unix_style_path
function from str
to
unicode
in the user code, which is difficult to anticipate and probably
unintended.
The counter-argument is that this code is broken, in a portability
sense; we see this from Python 3 raising a TypeError
upon passing the
function a byte-string. The code needs to be changed to make explicit
whether the path
argument is to be a byte string or a unicode string.
With unicode_literals
in effect, there is no way to specify a native
string literal (str
type on both platforms). This can be worked around as follows:
>>> from __future__ import unicode_literals
>>> ...
>>> from future.utils import bytes_to_native_str as n
>>> s = n(b'ABCD')
>>> s
'ABCD' # on both Py2 and Py3
although this incurs a performance penalty (a function call and, on Py3,
a decode
method call.)
This is a little awkward because various Python library APIs (standard and non-standard) require a native string to be passed on both Py2 and Py3. (See Standard library incompatibilities for some examples. WSGI dictionaries are another.)
If a codebase already explicitly marks up all text with u''
prefixes,
and if support for Python versions 3.0-3.2 can be dropped, then
removing the existing u''
prefixes and replacing these with
unicode_literals
imports (the porting approach Django used) would
introduce more noise into the patch and make it more difficult to review.
However, note that the futurize
script takes advantage of PEP 414 and
does not remove explicit u''
prefixes that already exist.
Turning on unicode_literals
converts even docstrings to unicode, but
Pydoc breaks with unicode docstrings containing non-ASCII characters for
Python versions < 2.7.7. (Fix
committed in Jan 2014.):
>>> def f():
... u"Author: Martin von Löwis"
>>> help(f)
/Users/schofield/Install/anaconda/python.app/Contents/lib/python2.7/pydoc.pyc in pipepager(text, cmd)
1376 pipe = os.popen(cmd, 'w')
1377 try:
-> 1378 pipe.write(text)
1379 pipe.close()
1380 except IOError:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 71: ordinal not in range(128)
See this Stack Overflow thread for other gotchas.
unicode_literals
¶Django recommends importing unicode_literals
as its top porting tip for
migrating Django extension modules to Python 3. The following quote is
from Aymeric Augustin on 23 August 2012 regarding why he chose
unicode_literals
for the port of Django to a Python 2/3-compatible
codebase.:
”... I’d like to explain why this PEP [PEP 414, which allows explicit
u''
prefixes for unicode literals on Python 3.3+] is at odds with the porting philosophy I’ve applied to Django, and why I would have vetoed taking advantage of it.“I believe that aiming for a Python 2 codebase with Python 3 compatibility hacks is a counter-productive way to port a project. You end up with all the drawbacks of Python 2 (including the legacy u prefixes) and none of the advantages Python 3 (especially the sane string handling).
“Working to write Python 3 code, with legacy compatibility for Python 2, is much more rewarding. Of course it takes more effort, but the results are much cleaner and much more maintainable. It’s really about looking towards the future or towards the past.
“I understand the reasons why PEP 414 was proposed and why it was accepted. It makes sense for legacy software that is minimally maintained. I hope nobody puts Django in this category!”
unicode_literals
¶“There are so many subtle problems that
unicode_literals
causes. For instance lots of people accidentally introduce unicode into filenames and that seems to work, until they are using it on a system where there are unicode characters in the filesystem path.”—Armin Ronacher
“+1 from me for avoiding the unicode_literals future, as it can have very strange side effects in Python 2.... This is one of the key reasons I backed Armin’s PEP 414.”
—Nick Coghlan
“Yeah, one of the nuisances of the WSGI spec is that the header values IIRC are the str or StringType on both py2 and py3. With unicode_literals this causes hard-to-spot bugs, as some WSGI servers might be more tolerant than others, but usually using unicode in python 2 for WSGI headers will cause the response to fail.”
—Antti Haapala