
Why "é" == "é" can be False in Python
Here’s a Unicode gotcha that can cause very confusing bugs in Python:
Two strings can look the same on screen, but still be different internally:
import unicodedata
a = "é" # single code point: U+00E9
b = "e\u0301" # "e" + combining acute accent
print(a)
print(b)
print(a == b)
# False
print(len(a))
# 1
print(len(b))
# 2
They look the same, but Python sees different sequences of Unicode code points.
You can inspect what is really inside the strings with repr() and unicodedata.name():
import unicodedata
for char in "e\u0301":
print(repr(char), unicodedata.name(char))
Output:
'e' LATIN SMALL LETTER E
'́' COMBINING ACUTE ACCENT
To compare these strings more safely, normalize them first:
import unicodedata
a = "é"
b = "e\u0301"
a_normalized = unicodedata.normalize("NFC", a)
b_normalized = unicodedata.normalize("NFC", b)
print(a_normalized == b_normalized)
# True
NFC converts text into a composed form when possible, so "e" + combining accent becomes the single composed character "é".
This is especially useful when dealing with user input, copied text, filenames, search, imported data, or text from multiple languages.
Another related problem: invisible characters.
For example, a zero-width space can silently break comparisons:
text = "hello\u200b"
print(text == "hello")
# False
print(text)
# hello
print(repr(text))
# 'hello\u200b'
The regular print() output can hide what is going on, butrepr() makes the invisible character visible.
You can read a more in-depth explanation here:
https://pythonkoans.substack.com/p/koan-15-the-invisible-ink