u/DiscoProphecy

Hey! I've been working on this for about a month and thought I would show folks. I have gotten to the point in learning japanese where I'm just on the edge of being able to comfortably immerse but I found sentence mining on mobile pretty frustrating and also the question kept coming up as to what words would really help break down barriers in terms of understanding the work.

Which made me wonder, what is understanding a work? What is friction in reading? Frequency analysis is fine and there's tools for that but to me the issue comes after you get past the first few hundred most frequent words and everything all starts appearing 2 or 3 times in a work. What qualitative analysis can you give to words past that point? I already had the books I wanted to read in digital form so... why not process them and figure that out?

SO I tried a few things. Lot of dead ends. Then I ended up on an idea I really liked. What if you chronologically map each occurence of a word to visualize the reading experience. You upload your vocabulary, assign a series of rules to known and unknown words, ignoring particles and auxiliary verbs. Unknown words add acceleration, known words add a bunch of breaking power. Make some adjustments to ensure that it doesn't look super crazy and spikey like diminishing acceleration. Normalize the levels between 0 and 1 and BAM. You get a visual of large peaks in runs of unknown words, frequent smaller bumps in easier sections where you're alternating between known an unknown, and smooth road when it's a stretch of known words. From there you can use something called a peak shaving algorithm that's used by power companies to try and figure out how to lower usage spikes, with learning a word being the way the algorithm can lower that burden. From there you assign a few factors like rewarding a word for being in the top 20k most frequent words, and assigning a high priority to frequency words so you still get that initial important bunch really quickly. And from there it begins to categorize the rest in terms of burden reduction all the way until the graph is fully flattened. It took forever to get right but I am stoked with how the simulations look now. Each step looks like it seriously reduces the visualized friction.

Then I just kind of went from there. I used flow and frequency coverage to assign projects letter grades, and lock in a plan to reach a certain grade and get the word list you needed to get there. I added a multi step time-line so you can see how working towards one work affects the others, and generates a study step between each one with the number of resultant words after the overlap is considered from all previous projects. So instead of learning 1400 words to read SAO progressive I can create a longer timeline of more shit I want to read at various difficulties, and have it sort to steps of 200-400 words between each one.

Then to tie the know I added the anki output. I used ankiconnect to allow for card generation, preview, insertion, and editing. I also tried to make it as foolproof as possible against the common foibles of word sourced flash card generation. The main method is through indexing all sentences in the scanned work that use each given word and then filtering those sentences and ranking them based on how ideal the size is, how complete a sentence the extraction algorithm thinks it is, and its i+n value. With i+1 being the default, and falling back to i+2, i+3 in the worst case. Then it compares the word on the card with the chosen sentence to try and ensure that it is using the proper reading when generating the furigana. (The tokenizer can sometimes be a bit aggressive in kanjifying kana words so it needs to be brought back to the true reading) I searched through a bunch of rabbit holes for how to get audio, to the point where I was about to try and build an audio book extractor before I started to get tired of the project and just decided to go TTS. So I can auto generate TTS for the sentence using api calls to a high end TTS provider.

And bob's your uncle, I can go from scanning ebooks, game scripts, OCRd manga, subtitle files, to making personalized anki decks based on the desired literacy level I want to reach in a given work, and see how that effects all of the other stuff I want to read in the future. Ensuring that the time I spend learning words is used as efficiently as possible, and all of the learned words will be immediately reinforced when I read the project in question.

Other stuff:

-Wanikani aware furigana with the ability to update an entire deck to remove readings when you learn new kanji.

-Game scripts DO work, but given how non linear they tend to be the visualization isn't really going to be accurate. I don't think there's really a way around that.

-Maybe someday I'll add chapter markers from epubs or page markers from manga but for now I'm considering this a completed project.

-Japanese tokenization is horrible and so touchy.

-The point of minimizing the study steps is to only really have one solo study session before the first project and then be able to juggle reading the target book or whatever while you work towards the next one and so on.

-This is probably strictly worse or at best equal in efficacy when compared to pure sentence mining, I just get burned out pretty quickly at my current level.

FINALLY, does it work?

I dunno! I think the theory of flow reduction in applied literacy makes sense but I think the placebo effect of just feeling like I'm learning words with a goal in mind and seeing the changes in the graph before I get to read the thing I'm working towards will be valuable enough. My girlfriend is also pretty intimidated by immersion and thought it would be nice to have a way to figure out if she was ready to read something before trying to tackle it.

Anywho that's my show and tell. Now I need to learn about 3.5 thousand new words and read 12 books I guess.

My personalized Anki powered immersion aid