[Research] Help build the first public dataset on personalized vocabulary complexity
TL;DR: There's no public dataset of what real language learners actually study and how their memory responds to it. Existing data either captures words without memory patterns, or memory patterns without the words. This survey collects both: Anki cards + review logs, from real learners, in any language. Participation takes ~10 minutes, and the survey runs entirely on your machine before submission for privacy. You review every card and exclude anything you don't want to share. It is fully GDPR-compliant. The dataset will be released openly so anyone - not just commercial platforms - can build on it.
Survey link: https://nekear.me/research
Below is more information on why this may matter to you, participation, privacy, the purpose of this research, and its novelty - in that order.
Why this can matter to you as a learner
The most immediate benefit is that in just 10 minutes you're directly contributing to research that hasn't been done before, and to a dataset that will become a permanent public resource for the entire language-learning research community.
Longer term, this same research makes a new generation of learning tools possible:
- deck recommenders that know which words you're actually ready for;
- vocabulary sequencers tuned to your prior knowledge;
- smarter spaced repetition schedulers built on personal memory patterns instead of population averages.
And because the dataset will be public, anyone will be able to build them, not just one company.
Who can participate
To make the research outcomes meaningful, the dataset requires its content to follow specific rules. You're welcome to participate if:
- You actively use Anki for language learning;
- Your deck has been reviewed enough that some (not strictly all) cards have 5+ reviews (this is when review patterns start to reflect actual memory rather than early-stage half-random answers - but submissions below that threshold still help).
What participation looks like
As stated, it takes about 10 minutes, the steps are as straightforward as they can be:
- Export your Anki deck (
.apkg) with the following checkboxes ticked: "Include scheduling information" (the review logs), "Include deck presets" (the scheduler configuration) and "Support older Anki versions". - Open the survey link - it includes a built-in utility that opens your deck fully locally and lets you decide what to submit;
- Review your cards in a preview UI. The utility flags potential personal info (emails, phone numbers, names) for your attention. Exclude anything you don't want shared;
- Fill out your language proficiency and pick your domains of interest;
- Click submit. Nothing leaves your machine until this step.
You'll receive a one-time withdrawal token in case you change your mind later.
What's collected and how it's protected
TL;DR:
- Local-first review. The survey allows you to see every card/note before submission and exclude any of them individually should you deem necessary. The tool also flags potential personal information (emails, phone numbers, names). Everything runs locally.
- Identifiers stripped or randomized. Your deck names are replaced with meaningless artificial names, all timestamps (e.g., when your card was created) are offset by a random value, and Anki internal IDs are replaced with synthetic counters;
- GDPR-compliant. Data is stored in the EU, and is encrypted at rest, with a withdrawal mechanism via a one-way token you keep;
- Special-category check. Cards mentioning health, religious, or political content trigger an additional explicit notice under GDPR Article 9.
The full technical schema (every field, what's collected and why, what's transformed, and what's dropped) is accessible here: https://nekear.me/research/data-handling.
I recommend reviewing cards and notes manually as well, since the personal identification algorithm runs locally and, consequently, has real limitations.
About me and the research
My name is Michael. I'm a Master's in AI student at the University of Galway, Ireland working on a thesis at the intersection of AI and language learning.
Simply put, the research involves training an AI model that predicts how hard a specific word is for you, given the words you already know and your learning patterns. The model is trained on three inputs:
- The word's morphological features (what parts it's built from) and distributional features (how often it appears in real-world usage) - that's the reason I need your cards;
- Your performance history on similar words - the reason I need your review logs;
- Your language proficiency profile, because your native and other known languages directly affect how you learn new ones - the reason I need your language profile.
You can read more here: https://nekear.me/research/data-handling#what-is-collected.
Why the research is novel
There's prior work on word-difficulty modeling: Duolingo has published a couple of important datasets in this area (HLR in 2016, SLAM in 2018), but both capture learning within Duolingo's own curriculum: platform-chosen words, platform-formatted exercises, platform scheduling. The publicly missing part is data on what learners themselves chose to study, in any language, scheduled by a memory-faithful algorithm like FSRS, with the full card content intact. Talking about existing log datasets like open-spaced-repetition (which FSRS was built on), they strip the content out for privacy, while other public vocabulary research datasets don't include memory data. Neither side of what's needed currently exists publicly.
This survey is building the first dataset that has both. Once released publicly, it removes a real bottleneck for anyone working on personalized vocabulary learning.
Questions / concerns
Comment below, DM me, or email me at hi@nekear.me. I'm genuinely happy to discuss methodology, privacy specifics, or anything else.
Cross-posting note
I'll also be posting this in the Anki Forums and the Anki Discord #language-learning channel, with mod coordination. Apologies if you see it more than once. And I appreciate any help spreading the word, as I hope we can make a huge contribution to language learning.
Survey link: https://nekear.me/research