u/DJ_Beardsquirt

I want to build a PKM from a collection of a few thousand PDFs, but most strategies involve working with markdown from the beginning. So what's the best strategies for converting PDFs into markdown? Mostly my docs are academic journal articles, but I have some full-length books, memoirs, biographies, etc. too.

I found a tool called openkb, which uses a VLM to summarise the texts and build wikilinks. But it seems very brittle, and doesn't store the full text. Other forms of OCR, such as Tesseract, etc. seem to struggle hard with footnotes and endnotes, and other formatting issues.

So does anybody here have experience starting from PDFs when setting out to build a PKM? I'd love to hear what works for you.

I have a few thousand PDFs. This is cool, but I want to be able to do stuff with all of this info, rather than just open it in a PDF Reader. Ideally, I want to be able to load it into an Obsidian Vault, but this requires extracting the text and converting it into markdown. But I'm not having much luck with this. The biggest problems are figuring out how to handle footnotes and endnotes (citations), as well as reliably capturing images, figures, etc.

I've had a quick look online, and most discussions just say capturing footnotes is "hard". And then there is a lot of discussion about capturing graph data, etc. which is less important to me.

There must be other people who would prefer to store their texts as markdown than PDF, but I can't seem to find anybody working on solutions to this problem. Does anybody here have any ideas or achieved something like this?

Starting from PDFs, what's the first step?

Best strategy for saving PDFs as Markdown?