
u/initiald-ejavu

Why do Byte Pair Encoders substitute in order?
Hey guys! I just started learning about ML about 3 weeks ago but I got to a question that really stumped me.
I watched some "colloquial" explanations of how BPEs work and I understood it generally, but then I tried to implement it by hand. The way I understand it is:
- First break down the text into single char tokens
- Find the most common consecutive pair of single chars
- Substitute that with a new token
- Repeat until you feel like/a certain number of tokens in the vocab/can't merge anymore because all the tokens have a frequency of 1
So... I implemented a tokenizer that does just that. It's when I got to encoding that I started wondering.
The way I made it was I turned the string to encode into a queue, then consumed the largest token I could. So if the vocab had the token "Hello" in it, and the text started with Hello, it's gobbled up and we move on.
However apparently the way it's SUPPOSED to go is I am supposed to find the first merge, and apply it across the whole string, the move onto the second, then third, etc.
I understand the second approach is much more efficient, but is that the only reason it is used? I thought that taking the "largest level of abstraction" from left to right is a lot closer to how we process language as humans, so that's why I implemented it that way.
Is malice real?
There’s 2 things that can look like malice.
1- Someone sees what they are doing as self defense of some sort. That the world is such that if they don’t act in this hurtful way, they’ll be hurt themselves
2- Someone knows they are being hurtful and are doing it for its own sake. Not for any gain or protection.
I don’t know how someone can tell the difference between 1 and 2. We often assume 2, but later find out it’s 1. Is malice as defined in 2, ever the case?
I feel like I have plenty of evidence of 1, but much less evidence of 2. As much as I hated my bullies I am realizing they may have just been… normal people.
I cannot imagine malice for its own sake. It doesn’t make any sense to me.
Playing as Mandala for the first time and I realized all my tributaries split off after my first ruler died. I made a kingdom level title with him but my new guy only has like 4 domain limit and my army is too large so I am negative income (currently the Chinese exam cheater blackmail industrial complex is keeping me afloat though)
Will I really have to reconquer every tributary? If I have to reconquer most of my dudes on death then I don't see the point of the "simple" succession of Mandala. Might as well be confederate partition.
It seems weird to me that a government which uses tributaries so heavily has no real way to keep them. Or is there something I am missing?