u/1mFlux

Built a Top Pakistani Reddit trends dashboard using ML, Reddit dumps, and a lot of painful labelling

Edit: Lots of people are interested in me open-sourcing this and having a public git repo so they can contribute or learn from it, I will eventually open source it and post a public repo, but not before I solve the big mislabeling problem around political posts myself, and clean up the flow and add proper configurations files for the CLI and local DB support, as it was supposed to be a challenge/fun project for me to learn from, if people start contributing and solving all problems I won't learn as much and that was not the plan I just didn't expect as many people to be interested in contributing to my "4fun" project.

Edit2: Also I'm trying to use this as a prototype to show to Reddit in hopes they'll give me Reddit Data API access so it can have a Live trend section for latest posts on Paki subs so I don't have to rely on public Reddit data dumps to increase the dataset in future.

I had around 3 weeks of downtime from work and I’ve always wanted to work on a proper social media classifier / sentiment analysis pipeline.

Not just basic “positive/negative” sentiment, but something more multi-dimensional. Like topic, intent, toxicity, emotion, political framing, subreddit trends, and local context.

So I built this for Pakistani Reddit:

Dashboard

If you only want to check the dashboard, that’s the link.

If you’re more interested in how I built it, or the technical side / mistakes / pain behind it, read below.

I originally wanted to do this with Reddit’s API. I applied for data API access, but they denied it because I’m not a student or researcher, and I assume they’re obviously careful now about people using post data for ML/custom AI models.

So I went for public Reddit data dumps instead.

Those dumps basically contain posts and comments, but they’re huge. Each monthly dump is around 60GB. My original plan was to download data from 2025 all the way to April 2026 and run my pipeline on that.

That did not happen lol.

The 2026 data was easy enough to download, but the 2025 data was barely seeded and was taking forever. I didn’t have forever. I had 3 weeks.

So I settled for the data I could actually get and process:

  • Jan 2025
  • Jan-April 2026

After that I had to figure out how to convert the dump format into something actually useful. I settled on Parquet because I like CLI workflows and wanted something fast enough to filter, split, query, and process without constantly fighting the data.

Then I filtered the biggest Pakistani subreddits and extracted the posts/comments I wanted to analyze.

My initial goal was pretty simple:

  • sentiment analysis
  • emotion detection
  • toxicity check
  • post topic
  • post intent

But then I also wanted Pakistan-specific labels. Stuff public models probably won’t understand properly. Things like Pakistan/India posts, Imran Khan/PTI context, establishment framing, local political language, local subreddit behaviour, religion/culture posts, etc.

So I made a keyword/rule filter to pull likely Pakistan-themed posts, then manually labelled 1,000 posts to train a custom classifier.

The first model was terrible.

Like around 40% accuracy terrible.

So I labelled 2,000 more entries manually. That part was painful af. I think it took me around 6 days total to hand-label 3,000 entries.

After that I got the custom classifier to around 75% accuracy for Pakistan-specific themes, which was good enough for me for a side project.

The final pipeline ended up being a mix of:

  • custom classifier trained on my manually labelled Pakistani Reddit examples
  • zero-shot models for some topic/intent stuff
  • sentiment analysis model
  • emotion model
  • toxicity model
  • BART/NLI-style classification where it made sense
  • keyword/rule heuristics
  • confidence thresholds
  • summary scripts for dashboard-ready tables

I don’t have rented GPU compute or anything. I just used the GPUs I already had at home on different PCs:

  • 1080 Ti
  • 3070 Ti
  • 4080

So I started splitting the Parquet files and running parts of the pipeline across different machines.

At first I only ran around 10k posts at a time, checked the output, reviewed what looked wrong, tuned the rules/model, retrained, then ran it again.

Somewhere in the middle of this, for some reason, I decided to add political framing too.

In hindsight, that was probably the messiest part of the whole project.

Political posts are not clean. A political post can look like a question, a rant, a joke, a news post, an advice post, a meme, or just some vague one-line complaint. So if someone posted something political but framed it like a question, the pipeline might label it as question/help instead of politics.

I could have made it multi-label, but I didn’t want every post to turn into a complicated object with 5 different overlapping labels. I also wanted to keep this cheap/free to host on Supabase + Vercel, keep the schema simple, keep the dashboard understandable, and actually finish the project before my downtime ended.

So yeah, political framing is the weakest part. Some political posts are under-labelled or mislabelled. I’m okay admitting that.

I also set confidence thresholds.

If the model/rules were not confident enough, the post stayed as unclear. That does not always mean the pipeline failed. Sometimes the post was vague, sometimes it was Urdu/Roman Urdu, sometimes it was just an image or video, and my pipeline does not read images or watch videos.

Out of 95,593 total posts from the selected subreddits, 42,730 were labelled with high enough confidence that I felt okay showing them as useful dashboard signals.

The rest are still counted for volume/context, but I don’t treat them as strong labels.

I did try to review why so many were unclear, and a lot of them were either vague, media-based, Urdu/Roman Urdu, or political posts that didn’t fit cleanly into the label structure. But honestly, I was running out of time and I didn’t want to keep all my PCs running 24/7 for a “for fun” project forever.

For the workflow, I mostly avoided Jupyter notebooks. I know notebooks are useful, but for this project I preferred CLI scripts because the whole thing felt more like a data pipeline than an experiment notebook.

The general flow was:

  1. download Reddit dumps
  2. convert/extract usable data
  3. save filtered data as Parquet
  4. filter Pakistani subreddits
  5. build keyword/rule sets
  6. manually label training examples
  7. train custom classifier
  8. run sentiment/emotion/toxicity/zero-shot models
  9. combine model outputs with rules and confidence thresholds
  10. generate summary tables
  11. upload dashboard-ready tables to Supabase
  12. build the frontend on top of precomputed summaries

The app side is:

  • Next.js 14 App Router
  • TypeScript
  • Tailwind
  • Recharts
  • shadcn/ui style components
  • Supabase public read tables
  • Vercel hosting

The data side is:

  • Python scripts
  • Parquet files
  • custom classifier
  • public ML models
  • rule/keyword heuristics
  • CLI scripts for summaries/uploads

The dashboard reads precomputed summary tables from Supabase instead of trying to load a massive dataset in the browser. I wanted it to be cheap to host and not completely die if people opened it.

I also used this project to properly test Codex after using Claude Code for months for my actual work. Codex honestly worked better for me on this project, especially for debugging frontend issues, TypeScript problems, and making the dashboard usable instead of just technically working. I honestly think codex might be better than claude code in its current state at least.
Not to mention I one shotted the network page using codex that was all codex, I just gave it a very detailed requirement and spec sheet and fed it obsidian articles about their network graph.

I also used it for script bugs, Supabase upload issues, and some Next.js cleanup. The actual pipeline and structure still took a lot of manual decisions, but for implementation/debugging but yea Codex was surprisingly good.

The dashboard is obviously not perfect.

Political framing needs work. Some labels are probably wrong. Some posts are under-labelled. Urdu/Roman Urdu support could be much better. Image/video posts are not really understood. And the dataset is limited because I could only process the dumps I could actually download in time.

But for a 3-week side project, I’m pretty happy with where it landed.

Would love feedback on what I should do differently in future from any ML people, to improve the pipeline and any idea's how to detect political posts better because right now it kinda sucks at detecting political posts, whenever I get some extra time again, I will probably work on this again and add more data and tune the pipeline more for Political framing and try to add Urdu script support.

Link again:

PakReddit Dashboard

reddit.com
u/1mFlux — 18 hours ago
▲ 2 r/PAK+1 crossposts

I spent 3 weeks building a Reddit sentiment/classifier dashboard for Top Pakistani subreddits

https://preview.redd.it/5q1usnon4e2h1.png?width=2197&format=png&auto=webp&s=7868fcdb585a816f4962876ecc43ac055fc6220e

I had around 3 weeks of downtime from work and I’ve always wanted to work on a proper social media classifier / sentiment analysis pipeline and dashboard to see trends in discourse posts and comments.

Not just basic “positive/negative” sentiment, but something more multi-dimensional. Like topic, intent, toxicity, emotion, political framing, subreddit trends, and local context.

So I built this for Pakistani Subreddits these classifications and trends are based on data from Jan 2026 - April 2026 and also Jan of 2025

https://pakredditpulse.vercel.app/

If you only want to check the dashboard, that’s the link. This app is Desktop first so it won't work as well on phones for now.

If you’re more interested in how I built it, or the technical side / mistakes / pain behind it, read below.

I originally wanted to do this with Reddit’s API. I applied for data API access, but they denied it because I’m not a student or researcher, and I assume they’re obviously careful now about people using post data for ML/custom AI models.

So I went for public Reddit data dumps instead.

Those dumps basically contain posts and comments, but they’re huge. Each monthly dump is around 60GB. My original plan was to download data from 2025 all the way to April 2026 and run my pipeline on that.

That did not happen lol.

The 2026 data was easy enough to download, but the 2025 data was barely seeded and was taking forever. I didn’t have forever. I had 3 weeks.

So I settled for the data I could actually get and process: Jan 2025 and Jan-April 2026

After that I had to figure out how to convert the dump format into something actually useful. I settled on Parquet because I like CLI workflows and wanted something fast enough to filter, split, query, and process without constantly fighting the data.

Then I filtered the biggest Pakistani subreddits and extracted the posts/comments I wanted to analyze.

My initial goal was pretty simple:

  • sentiment analysis
  • emotion detection
  • toxicity check
  • post topic
  • post intent

But then I also wanted Pakistan-specific labels. Stuff public models probably won’t understand properly. Things like Pakistan/India posts, Imran Khan/PTI context, establishment framing, local political language, local subreddit behaviour, religion/culture posts, etc.

So I made a keyword/rule filter to pull likely Pakistan-themed posts, then manually labelled 1,000 posts to train a custom classifier.

The first model was terrible.

Like around 40% accuracy terrible.

So I labelled 2,000 more entries manually. That part was painful af. I think it took me around 6 days total to hand-label 3,000 entries.

After that I got the custom classifier to around 75% accuracy for Pakistan-specific themes, which was good enough for me for a side project.

The final pipeline ended up being a mix of:

  • custom classifier trained on my manually labelled Pakistani Reddit examples
  • zero-shot models for some topic/intent stuff
  • sentiment analysis model
  • emotion model
  • toxicity model
  • BART/NLI-style classification where it made sense
  • keyword/rule heuristics
  • confidence thresholds
  • and finally summary scripts for dashboard-ready tables

I don’t have rented GPU compute or anything. I just used the GPUs I already had at home on different PCs: I have a 1080 Ti, a 3070 Ti and a 4080

So I started splitting the Parquet files and running parts of the pipeline across different machines.

At first I only ran around 10k posts at a time, checked the output, reviewed what looked wrong, tuned the rules/model, retrained, then ran it again.

Somewhere in the middle of this, for some reason, I decided to add political framing too.

In hindsight, that was probably the messiest part of the whole project.

Political posts are not clean. A political post can look like a question, a rant, a joke, a news post, an advice post, a meme, or just some vague one-line complaint. So if someone posted something political but framed it like a question, the pipeline might label it as question/help instead of politics.

I could have made it multi-label, but I didn’t want every post to turn into a complicated object with 5 different overlapping labels. I also wanted to keep this cheap/free to host on Supabase + Vercel, keep the schema simple, keep the dashboard understandable, and actually finish the project before my downtime ended.

So yeah, political framing is the weakest part. I’m okay admitting that.

I also set confidence thresholds.

If the model/rules were not confident enough, the post stayed as unclear. That does not always mean the pipeline failed. Sometimes the post was vague, sometimes it was Urdu/Roman Urdu, sometimes it was just an image or video, and my pipeline does not read images or watch videos.

Out of 95,593 total posts from the selected subreddits, 42,730 were labelled with high enough confidence that I felt okay showing them as useful dashboard signals.
The rest are still counted for volume/context based on zero shot predictions, but I don’t treat them as strong labels.

I did try to review why so many were unclear, and a lot of them were either vague, media-based, Urdu/Roman Urdu, or political posts that didn’t fit cleanly into the label structure. But honestly, I was running out of time and I didn’t want to keep all my PCs running 24/7 for a “for fun” project forever.

For the workflow, I mostly avoided Jupyter notebooks. I know notebooks are useful, but for this project I preferred CLI scripts because the whole thing felt more like a data pipeline than an experiment notebook.

The general flow was:

  1. download Reddit dumps
  2. convert/extract usable data
  3. save filtered data as Parquet
  4. filter Pakistani subreddits
  5. build keyword/rule sets
  6. manually label training examples
  7. train custom classifier
  8. run sentiment/emotion/toxicity/zero-shot models
  9. combine model outputs with rules and confidence thresholds
  10. generate summary tables
  11. Make a SQL schema to migrate Parquet/CSV stuff to Supabase to host online
  12. upload dashboard-ready tables to Supabase
  13. build the frontend on top of precomputed summaries

The app side is:

  • Next.js 14 App Router
  • TypeScript
  • Tailwind
  • Recharts
  • shadcn/ui style components
  • Supabase public read tables
  • Vercel hosting

The data side is:

  • Python scripts
  • Parquet files
  • custom classifier
  • public ML models
  • rule/keyword heuristics
  • CLI scripts for summaries/uploads

The dashboard reads precomputed summary tables from Supabase instead of trying to load a massive dataset in the browser. I wanted it to be cheap to host and not completely die if people opened it.

I also used this project to properly test Codex after using Claude Code. Codex honestly worked better for me on this project, especially for debugging frontend issues, TypeScript problems, and making the dashboard usable instead of just technically working.

I also used it for script bugs, Supabase upload issues, and some Next.js cleanup. The actual pipeline and structure/Architecture still took a lot of manual decisions and coding, but for implementation/debugging Codex was surprisingly good.

The dashboard is obviously not perfect.

Political framing needs work. Some labels are probably wrong. Some posts are under-labelled. Urdu/Roman Urdu support is non existent. Image/video posts are not really understood. And the dataset is limited because I could only process the dumps I could actually download in time.

But for a 3-week side project, I’m pretty happy with where it landed.

reddit.com
u/1mFlux — 1 day ago