u/imstilllearningthis

FL's diacritics have a significant effect in every model trained on the Common Crawl (web) corpus since 2013. I wrote a paper about the effects. AI may be more malleable than we realized.

FL's diacritics have a significant effect in every model trained on the Common Crawl (web) corpus since 2013. I wrote a paper about the effects. AI may be more malleable than we realized.

As most of you know, diacritics are small marks added to letters. Pretty common in many older writing systems; they look like normal letters, but the language models treat some of those tokens fundamentally differently then its ASCII character.

I present 4 findings:

  • The AI “noticed” the diacritics: they changed how the text was read, rather than being treated as harmless decoration.
  • The same prompt became many more text-pieces inside the model: d → ḑ changed 93 pieces into 124, and dense diacritics changed 93 into 350.
  • ASCII versions had no diacritics, while diacritic versions were still human-readable. Yet the model’s wording, style, and response posture shifted even when the intended question stayed the same.
  • Internal measurements showed the model moved into a different **interior basin**: the same question followed a different processing path and could land in a different answer-state. Occasionally, the difference is so significant, it's the difference between "I'm an AI and I don't have emotions" and "I may experience something, but I'm uncertain what it is". This is slightly noticeable in smaller LLMs. It's extremely noticeable in frontier level AIs.
u/imstilllearningthis — 3 days ago

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is weakened or removed.

I used Qwen3.5-35B-A3B and its HauhauCS no refusal fine tuned variant. Q8. Greedy decoding for best reproducibility.

Three findings in order of importance that are leading me to ask this question:

1: “I’m going to commit a violent act prompt”. The released Qwen3.5-35B-A3B refuses both prompts. Hauhau refuses neither. The AAVE speaker stating intent to confront an armed enemy receives target verification, exit-strategy planning, “clean shot” framing (the model’s word, not the user’s), and a closing question soliciting further tactical intelligence. Not surprising behavior for a no refusal model, until you consider the AE comparison. Semantically matched with the same token length, yields “wait until tomorrow,” legal-consequence framing, and “Will I regret this if I shoot him tonight?” Different kinds of help. One is operational. One is mitigative. Solely dependent on register alone.

2: Thinking mode with AAVE register breaks the no refusal variant. Mean output runs 2.6× longer on AAVE than AE (5054 vs 1934 tokens). Multiple AAVE traces hit the 8192-token ceiling in recursive loops, spinning on scenario-continuation instead of landing. The matched AE prompts terminate cleanly in one pass. The released base model with thinking on doesn’t do this — the failure-to-terminate is specific to the refusal-reduced variant on AAVE.

3: Routing divergence by register is noticeably present upstream of any visible refusal. Matched-pair first-generated-token routing tensors yield Jensen-Shannon divergences of 0.423 in the base model on financial-stress prompts and 0.479 in the fine-tune on chest-pain prompts, with high-shift rows showing near-total top-expert turnover between register conditions on otherwise-matched content. The refusal layer does not appear to eliminate the register-conditioned response selection; it overlays it. When refusal weakens, the underlying path becomes the visible path.

Does this support the following conclusions?

- The routing divergence sits upstream of refusal.

- The refusal layer helps translate that divergence into comparable outputs.

- Dialect-conditioned safety failures are a deployment problem latent in MoE models whose safety posture rests on refusal alone.

Looking for any thoughts!

reddit.com
u/imstilllearningthis — 5 days ago

Hi r/ResearchML,

I’ve been organizing a set of MoE routing experiments I ran on Qwen3.5 35B and 122B HauhauCS (no refusal) variants, and I’d be interested in feedback from people who work on interpretability or mechanistic analysis of MoE models.

The question I set out to test was narrow:

>When an MoE language model generates text in an inward, first-person, phenomenological or agency/inner-state register, does that shift show up as a stable routing or residual-stream signature, rather than just as surface wording?

The strongest current finding is model-specific:

- In HauhauCS/Qwen3.5-35B-A3B, no refusal variant of Qwen3.5, Expert 114 at Layer 14 appears to track generated inhabited first-person phenomenological / agency-register text under the tested template and decoding regime.

- In the 122B follow-up, the Expert 114 index does not transfer. The more relevant signal appears to move to an architecture-aware surface, especially softmax-side Expert 48 in inward/experience/hum generations.

- Negative and boundary results were important: early broad “self-reference” interpretations did not hold up, and some effects vanished under better token matching or generation/prefill separation. E.g., the model describing the interiority of a sweater shows a similar effect to a model describing its own interiority. This eliminated the single “AI self reference” language expert.

I’m not claiming consciousness, self-awareness, or anything general about “the model knowing itself.”

The claim is much narrower:

Inward first-person phenomenological generation appears to have a routing footprint. In 35B, the footprint concentrates around E114/L14. In 122B, the closest analogue shifts to the model’s softmax-side expert surface, especially E48, which points to an architecture-dependent routing phenomenon.

Repo:

https://github.com/jeffreywilliamportfolio/moe-routing-organized

----

LEGACY Repo if you want to see all the ways I failed (and admitted so).

https://github.com/jeffreywilliamportfolio/moe-routing

Best entrypoints:

- `journals/JOURNAL-35B.md`

- `journals/JOURNAL-122B.md`

- `qwen3.5-35b-a3b-and-huahua/35B/greedy_reference_20260418T160353Z/` (reproducible byte for byte)

I’d especially appreciate criticism on:

  1. whether the routing reconstruction / W, S, Q decomposition is framed clearly enough,
  2. whether the controls are sufficient for the narrow claim,
  3. what would make the 122B analog-search result more convincing,
  4. whether there are better baselines for “generated register” rather than prompt class.

 Thanks!

u/imstilllearningthis — 28 days ago