To reduce memory consumption and improve performance, Python uses three kinds of internal representations for Unicode strings:
- 1 byte per char (Latin-1 encoding)
- 2 bytes per char (UCS-2 encoding)
- 4 bytes per char (UCS-4 encoding)
When programming in Python all strings behave the same, and most of the time we don’t notice any difference. However, the difference can be very remarkable and sometimes unexpected when working with large amounts of text.
To see the difference in internal representations, we can use the
sys.getsizeof function, which returns the size of an object in bytes:
>>> import sys >>> string = 'hello' >>> sys.getsizeof(string) 54 >>> # 1-byte encoding >>> sys.getsizeof(string+'!')-sys.getsizeof(string) 1 >>> # 2-byte encoding >>> string2 = '你' >>> sys.getsizeof(string2+'好')-sys.getsizeof(string2) 2 >>> sys.getsizeof(string2) 76 >>> # 4-byte encoding >>> string3 = '🐍' >>> sys.getsizeof(string3+'💻')-sys.getsizeof(string3) 4 >>> sys.getsizeof(string3) 80
As you can see, depending on the content of a string, Python uses different encodings. Note that every string in Python takes additional 49-80 bytes of memory, where it stores supplementary information, such as hash, length, length in bytes, encoding type and string flags. That’s why an empty string takes 49 bytes of memory.
We can retrieve encoding directly from an object using
import ctypes class PyUnicodeObject(ctypes.Structure): # internal fields of the string object _fields_ = [("ob_refcnt", ctypes.c_long), ("ob_type", ctypes.c_void_p), ("length", ctypes.c_ssize_t), ("hash", ctypes.c_ssize_t), ("interned", ctypes.c_uint, 2), ("kind", ctypes.c_uint, 3), ("compact", ctypes.c_uint, 1), ("ascii", ctypes.c_uint, 1), ("ready", ctypes.c_uint, 1), # ... # ... ] def get_string_kind(string): return PyUnicodeObject.from_address(id(string)).kind
>>> get_string_kind('Hello') 1 >>> get_string_kind('你好') 2 >>> get_string_kind('🐍') 4
If all characters in a string can fit in ASCII range, then they are encoded using 1-byte Latin-1 encoding. Basically, Latin-1 represents the first 256 Unicode characters. It supports many Latin languages, such as English, Swedish, Italian, Norwegian and so on. However, it cannot store non-Latin languages, such as Chinese, Japanese, Hebrew, Cyrillic. That is because their codepoints (numerical indexes) defined outside of 1-byte (0-255) range.
>>> ord('a') 97 >>> ord('你') 20320 >>> ord('!') 33
Most of the popular natural languages can fit in 2-byte (UCS-2) encoding. The 4-byte (UCS-4) encoding is used when a string contains special symbols, emojis or rare languages. There are almost 300 blocks (ranges) in the Unicode standard. You can find the 4-byte blocks after the 0xFFFF block.
Why Python don’t use UTF-8 encoding internally
The most well-known and popular Unicode encoding is UTF-8, but Python doesn’t use it internally.
When a string is stored in the UTF-8 encoding, each character is encoded using 1-4 bytes depending on the character it is representing. It’s a storage efficient encoding, but it has one significant disadvantage. Since each character can vary in length of bytes, there is no way to randomly access an individual character by index without scanning the string. So, to perform a simple operation such as
string with UTF-8 Python would need to scan a string until it finds a required character. Fixed length encodings don’t have such problem, to locate a character by index Python just multiplies an index number by the length of one character (1, 2 or 4 bytes).
When working with empty strings or ASCII strings of one character Python uses string interning. Interned strings act as singletons, that is, if you have two identical strings that are interned, there is only one copy of them in the memory.
>>> a = 'hello' >>> b = 'world' >>> a,b ('o', 'o') >>> id(a), id(b), a is b (4567926352, 4567926352, True) >>> id('') 4545673904 >>> id('') 4545673904
As you can see, both string slices point to the same address in the memory. It’s possible because Python strings are immutable.
In Python, string interning is not limed to characters or empty strings. Strings that are created during code compilation can also be interned if their length does not exceed 20 characters.
- variable names
- argument names
- constants (all strings that are defined in the code)
- keys of dictionaries
- names of attributes
When you hit enter in Python REPL, your statement gets compiled down to the bytecode. That’s why all short strings in REPL are also interned.
>>> a = 'teststring' >>> b = 'teststring' >>> id(a), id(b), a is b (4569487216, 4569487216, True) >>> a = 'test'*5 >>> b = 'test'*5 >>> len(a), id(a), id(b), a is b (20, 4569499232, 4569499232, True) >>> a = 'test'*6 >>> b = 'test'*6 >>> len(a), id(a), id(b), a is b (24, 4569479328, 4569479168, False)
This example will not work, because such strings are not constants:
>>> open('test.txt','w').write('hello') 5 >>> open('test.txt','r').read() 'hello' >>> a = open('test.txt','r').read() >>> b = open('test.txt','r').read() >>> id(a), id(b), a is b (4384934576, 4384934688, False) >>> len(a), id(a), id(b), a is b (5, 4384934576, 4384934688, False)
String interning technique saves tens of thousands of duplicate string allocations. Internally, string interning is maintained by a global dictionary where strings are used as keys. To check if there is already an identical string in the memory Python performs dictionary membership operation.
The unicode object is almost 16 000 lines of C code, so there are a lot of small optimizations which are not mentioned in this article. If you want like to learn more about Unicode in Python, I would recommend you to read PEPs about strings and check the code of the unicode object.