u/MarwanAhmed1074

How would you approach matching and filtering this "dirty" literary data?

Hey everyone,

I'm working on a literature data project and I have hit a massive wall. I'm trying to crossreference two lists of top literature, but my methodology for filtering the data is a mess. I've been trying to use AI to do the heavy lifting (free AI), but it can't handle the context window and hallucinates a completely different outcome every time I run it.

I need some advice on how to actually build a workflow for this.

Here are the two datasets I am working with:

List 1: A master list of the Top 10,000 works from TheGreatestBooks.org. This is generated by combining dozens of different "best of" book lists.

List 2: a 1,514 works listed in the appendix of literary critic Harold Bloom’s book, The Western Canon. (actually I probably also need help with this, I found sources online that have the full appendix of Harold Bloom but each source is slightly different than the other, is there an actual way for me to extract or make sure that all the works in the appendix are actually mentioned?)

My goal is to filter Bloom's academic list against the Top 10,000 list to create a final, definitive list.

My initial methodology is to first purge any non-narrative forms of literature, and then filter the Harold Bloom list based on their rank in the Top 10,000 using this logic:

If an author has 5+ works in the Top 500, keep their top 5.

If 4+ works in the Top 1,000, keep their top 4.

If 3+ works in the Top 2,000, keep their top 3.

If 2+ works in the Top 5,000, keep their top 2.

If 1+ work in the Top 10,000, keep their top 1.

But because I'm relying on free AI, this isn't working at all. On top of the AI failing, the data itself is incredibly "dirty"

Harold Bloom doesn't always mention specific titles. For example, his list just says "William Shakespeare: Plays and Poems" or "Anton Chekhov: The Tales". Meanwhile, List 1 ranks individual books (Hamlet, Macbeth, etc.). How can I map these umbrella terms so they actually trigger a match against the individual books in List 1?

Bloom's list includes philosophy, lyric poetry, and essays. I only want to compare narrative literature (novels, epics, plays, short stories). Is there a way to automate purging nonnarrative works (maybe pinging an API like Goodreads or OpenLibrary to check the genre tags?) rather than deleting them manually?

does anyone have any advice on how I should approach this? what to use? because I've been working on this project for days and have already filtered it 3 times, each time having a different result and having to restart it all over again.

reddit.com
u/MarwanAhmed1074 — 3 days ago

How would you approach matching and filtering this "dirty" literary data?

Hey everyone,

I'm working on a literature data project and I have hit a massive wall. I'm trying to crossreference two lists of top literature, but my methodology for filtering the data is a mess. I've been trying to use AI to do the heavy lifting (free AI), but it can't handle the context window and hallucinates a completely different outcome every time I run it.

I need some advice on how to actually build a workflow for this.

Here are the two datasets I am working with:

List 1: A master list of the Top 10,000 works from TheGreatestBooks.org. This is generated by combining dozens of different "best of" book lists.

List 2: a 1,514 works listed in the appendix of literary critic Harold Bloom’s book, The Western Canon. (actually I probably also need help with this, I found sources online that have the full appendix of Harold Bloom but each source is slightly different than the other, is there an actual way for me to extract or make sure that all the works in the appendix are actually mentioned?)

My goal is to filter Bloom's academic list against the Top 10,000 list to create a final, definitive list.

My initial methodology is to first purge any non-narrative forms of literature, and then filter the Harold Bloom list based on their rank in the Top 10,000 using this logic:

If an author has 5+ works in the Top 500, keep their top 5.

If 4+ works in the Top 1,000, keep their top 4.

If 3+ works in the Top 2,000, keep their top 3.

If 2+ works in the Top 5,000, keep their top 2.

If 1+ work in the Top 10,000, keep their top 1.

But because I'm relying on free AI, this isn't working at all. On top of the AI failing, the data itself is incredibly "dirty"

Harold Bloom doesn't always mention specific titles. For example, his list just says "William Shakespeare: Plays and Poems" or "Anton Chekhov: The Tales". Meanwhile, List 1 ranks individual books (Hamlet, Macbeth, etc.). How can I map these umbrella terms so they actually trigger a match against the individual books in List 1?

Bloom's list includes philosophy, lyric poetry, and essays. I only want to compare narrative literature (novels, epics, plays, short stories). Is there a way to automate purging nonnarrative works (maybe pinging an API like Goodreads or OpenLibrary to check the genre tags?) rather than deleting them manually?

does anyone have any advice on how I should approach this? what to use? because I've been working on this project for days and have already filtered it 3 times, each time having a different result and having to restart it all over again.

reddit.com
u/MarwanAhmed1074 — 3 days ago