u/initiald-ejavu

Why do Byte Pair Encoders substitute in order?

Hey guys! I just started learning about ML about 3 weeks ago but I got to a question that really stumped me.

I watched some "colloquial" explanations of how BPEs work and I understood it generally, but then I tried to implement it by hand. The way I understand it is:

  1. First break down the text into single char tokens
  2. Find the most common consecutive pair of single chars
  3. Substitute that with a new token
  4. Repeat until you feel like/a certain number of tokens in the vocab/can't merge anymore because all the tokens have a frequency of 1

So... I implemented a tokenizer that does just that. It's when I got to encoding that I started wondering.

The way I made it was I turned the string to encode into a queue, then consumed the largest token I could. So if the vocab had the token "Hello" in it, and the text started with Hello, it's gobbled up and we move on.

However apparently the way it's SUPPOSED to go is I am supposed to find the first merge, and apply it across the whole string, the move onto the second, then third, etc.

I understand the second approach is much more efficient, but is that the only reason it is used? I thought that taking the "largest level of abstraction" from left to right is a lot closer to how we process language as humans, so that's why I implemented it that way.

reddit.com
u/initiald-ejavu — 7 days ago

Is malice real?

There’s 2 things that can look like malice.

1- Someone sees what they are doing as self defense of some sort. That the world is such that if they don’t act in this hurtful way, they’ll be hurt themselves

2- Someone knows they are being hurtful and are doing it for its own sake. Not for any gain or protection.

I don’t know how someone can tell the difference between 1 and 2. We often assume 2, but later find out it’s 1. Is malice as defined in 2, ever the case?

I feel like I have plenty of evidence of 1, but much less evidence of 2. As much as I hated my bullies I am realizing they may have just been… normal people.

I cannot imagine malice for its own sake. It doesn’t make any sense to me.

reddit.com
u/initiald-ejavu — 11 days ago

Playing as Mandala for the first time and I realized all my tributaries split off after my first ruler died. I made a kingdom level title with him but my new guy only has like 4 domain limit and my army is too large so I am negative income (currently the Chinese exam cheater blackmail industrial complex is keeping me afloat though)

Will I really have to reconquer every tributary? If I have to reconquer most of my dudes on death then I don't see the point of the "simple" succession of Mandala. Might as well be confederate partition.

It seems weird to me that a government which uses tributaries so heavily has no real way to keep them. Or is there something I am missing?

reddit.com
u/initiald-ejavu — 22 days ago