u/vivis-dev

Here’s a Unicode gotcha that can cause very confusing bugs in Python:

Two strings can look the same on screen, but still be different internally:

import unicodedata

a = "é"          # single code point: U+00E9
b = "e\u0301"   # "e" + combining acute accent

print(a)
print(b)

print(a == b)
# False

print(len(a))
# 1

print(len(b))
# 2

They look the same, but Python sees different sequences of Unicode code points.

You can inspect what is really inside the strings with repr() and unicodedata.name():

import unicodedata

for char in "e\u0301":
    print(repr(char), unicodedata.name(char))

Output:

'e' LATIN SMALL LETTER E
'́' COMBINING ACUTE ACCENT

To compare these strings more safely, normalize them first:

import unicodedata

a = "é"
b = "e\u0301"

a_normalized = unicodedata.normalize("NFC", a)
b_normalized = unicodedata.normalize("NFC", b)

print(a_normalized == b_normalized)
# True

NFC converts text into a composed form when possible, so "e" + combining accent becomes the single composed character "é".

This is especially useful when dealing with user input, copied text, filenames, search, imported data, or text from multiple languages.

Another related problem: invisible characters.

For example, a zero-width space can silently break comparisons:

text = "hello\u200b"

print(text == "hello")
# False

print(text)
# hello

print(repr(text))
# 'hello\u200b'

The regular print() output can hide what is going on, butrepr() makes the invisible character visible.

You can read a more in-depth explanation here:

https://pythonkoans.substack.com/p/koan-15-the-invisible-ink

Why "é" == "é" can be False in Python