Semantic IDs for vulnerable code: finding 100× more cross-project clones than VUDDY
Semantic IDs for vulnerability detection. The recsys community has spent the last three years building out the TIGER substrate (Rajput et al., 2023): train a Residual-Quantized VAE on top of an item encoder, get a 3-integer discrete code per item, retrieve by hash bucket instead of by nearest neighbor. I applied that substrate to C/C++ vulnerability clone detection.
The result: on a 5000-function CVE registry, my system (SecSid) surfaces 112 non-fork cross-project vulnerability clones. VUDDY (Kim et al., S&P 2017), the canonical token-hash baseline in this corner of the security literature, finds 1.
How it works:
🔬 Frozen UniXcoder embeds each C/C++ function into a 768-d vector.
🔬 A 3-level Residual-Quantized VAE quantizes it through learned codebooks (128 × 128 × 512).
🔬 Output: a Semantic ID [c1, c2, c3]. c1 = broad family, c2 = specific vector, c3 = exact variant.
🔬 Hashable, prefix-comparable, O(1) lookup at any level. Functions sharing a SID prefix land in the same bucket.
The 111 clones VUDDY misses share the same abstract vulnerability shape across unrelated projects, but with different identifiers, types, and surface code. The codebook buckets on shape; the token hash sees them as 111 different functions.
A few examples from the registry:
• mysql-server + stunnel + weechat at SID [58, 10, 195]: "SSL credentials transmitted before TLS validation." Database client, TLS proxy, IRC client. Same bug, three domains.
• curl + evolution-ews + mysql-server at [54, 119, 242]: "SSL cert validation incomplete."
• libgcrypt + libssh2 + openssl at [110, 10, 407]: crypto state path. CWE-200, CWE-787.
First application of TIGER-style Semantic IDs to security clone detection that I can find. Mechanism transfer, not a new mechanism.
Full writeup :https://shrikar.com/writing/semantic-ids-for-vulnerable-code