r/redditstock

This was originally meant for WSB, but automoderator removed it and mods were too stupid to handle a non-meme post, I guess. I'm too lazy to adjust it so here you go D:

A short primer on “AI” today

Today “AI” largely refers to the generation of text (commonly referred to as LLMs, Large Language Models) and image/video/audio (typically “diffusion models”) which I’ll call “generative models” for the rest of this post, but there are tens of other areas of AI and model types which we’ll skip for now. Generative models typically require insane amounts of data to train, e.g. recent Chinese models refer to tens of trillions of tokens (think words / numbers / pieces of data / ...) in their technical reports and documentation [1, 2]. These datasets are either generated synthetically or collected from many different sources like the broader web, books, movie subtitles, your shameful roleplay on c.ai, and so on. More data typically means stronger models, which is why (contrary to popular belief) Meta’s top model easily competes with the other AI megacaps [3]. Reddit is one such source of training data, truly immense in size and growing incredibly fast too, but we’ll dive into it later.

Usage (or “inference”) of generative models can be problematic due to hallucinations, which is why many users in the industry moved towards Retrieval Augmented Generation (RAG) which performs a search for relevant snippets and injects them in the context during inference, effectively “grounding” the answer. It doesn’t fully solve the problem, but LOTS of research exists for detecting, preventing, fixing, and otherwise combating hallucinations, including work by my way-smarter-than-you-and-me colleagues [4] that is implemented in production systems. Thus Reddit’s data can also serve as a reference for a search engine’s answers. This is where the main hype about upcoming AI deals comes from, because a per-citation pricing structure could mean a significant boost to Reddit’s earnings. Probably the biggest consumer of Reddit’s data for this purpose is Google, so before we move on, let’s take a detour to look into them, and how much this aspect could be worth.

Search engines & their revenues

Google’s total search revenue is ~$60.4B (growing ~20%) for Q1’2026, and ~$224B for Y2025 (growing ~13%) [5]. “Trillions of searches” every year [6], upwards of 5 trillion [7]. Analyst estimates put Google Search to 4.3-4.5B MAU, which doesn’t seem crazy considering Google’s official MAU numbers where AI Mode has >1B [8], AI Overview has >2.5B, AI Lens has >1.5B, and Gemini has >900M [9, 10, 11, 12]. Putting all of this together, Google’s revenue per search is **$0.045**, which is pretty insane considering it’s freely available. This number gives us an absolute cap of what Google could potentially pay for its own costs and licensing all data used on a single search. Google doesn’t tell us anything about either of these numbers, so we need to get creative.But before that, let’s pause to look at other search engines so we can truly appreciate how high those numbers are.

Microsoft’s Bing supposedly has a market share of >4% (up to 10%), and 1B MAU [13, 14, 15, 16, 17]. Microsoft reports earnings for Search together with Gaming and news advertising under “More Personal Computing”, which is at $54.6B for 2025 [18]. In their 10-K they say that “search and news advertising revenue increased $1.6B or 13%” [18]. A table near the end of their yearly report seems to break down by category and that shows ~$12.3M and ~$13.78B respectively, which roughly aligns with the quote. A linear extrapolation from both Google and Brave (see below) using market share would suggest >200B/year but an unreliable third-party estimate places it ~1.1B/day or ~400B/year [16] – I’ll take the middle. DuckDuckGo is reported to have ~0.8% of market share. They estimate about ~22M MAU just in the EU [20], and they say they have >100M users worldwide [19], let’s assume their MAU isn’t that far off, and they also report >3B monthly queries [21]. A bunch of probably outdated and unreliable-looking sources put their revenues in the $100-200M range [22, 23, 24, 25, 26], we’ll take the midpoint. Brave has ~1.6B queries each month (~19.2B/year) with 100M Monthly Active Users (MAU) [27]. Market share is below 1% (not clear how much, but 0.4% seems reasonable after comparing to Bing). Their total revenue is reportedly ~$100M [28]. They have a bunch of revenue sources (Search API, VPN, Wallet), but even we assume a generous 75% of that comes from web search (so it roughly matches DDG’s assumed numbers), their numbers would still be pretty bad. Perplexity has a page(?) about having 780M monthly queries, 22M MAU, and $100M ARR from May 2025 [29], they probably increased since then but I’ll take them as is, and there are some probably unreliable third-party data too [30]. Tabulated:

Search Engine	Searches	MAU	Market share	Revenues	Rev/search
Brave	19.2B	100M	0.4%	$75M	$0.004
Bing	300B	1B	4%	$13.8B	$0.046
DuckDuckGo	36B	100M	0.8%	$150M	$0.004
Google	5T	4.5B	90%	$224B	$0.045
Perplexity	9.4B	22M	-	$100M	$0.011

Bold -> Third-party estimates, Italics -> my estimates/guesswork, Regular -> official numbers (10-K, 10-Q, official blogs, etc).

With that out of the way, let’s go back to profits. First, we would need to find and subtract their costs, which are likely lower than ANY other search engine (top engineering, years of optimization, etc). A report by SemiAnalysis seems to suggest that this number may have been from ~$0.0161 to ~$0.025 in 2022, and their estimated cost per query was ~$0.0106 [31]. It seems pretty high to me, because if that’s what it costs Google, how could Brave and DDG cover even half of that? Are they hitting higher costs due to salaries, research, deals, and so on? Anyway, to be safe let’s add 10% to the SemiAnalysis figure for good measure and round up to $0.012/query. One more thing before we move on: even though Google reports revenue from Search, for costs they group Search under “Google Services”, and that’s ~$139B [5, 6]. Revenue for the whole segment is $342B so costs are ~40% (?). SemiAnalysis used this number for their analysis, but at the time it was 34.15%. Let’s be conservative, and let’s we reduce that to 30% for Search (video ads on YouTube probably cost more than static link ads, right?), then cost per query in 2025 would be ~$0.0134 vs $0.012 in 2022. And that’s after we increased costs in 2022 and decreased them in 2025 (as percentages). Let’s keep these two numbers.

That was for traditional search, but what about the LLM part? The same SemiAnalysis report [31] suggested than an similar query from OpenAI/ChatGPT would cost $0.0142, so the difference of ~$0.0036 was probably for the LLM. However since 2022, an equal-capability LLM would cost today <99% of what it would cost then, and even a 20-30B & A3-4B Lite MoE model (which seems likely given Gemma 4 sizes) would cost them way less. My rough calculations say they’d need at most ~$0.007 per 1M tokens. I’m not even factoring cheap power, better efficiency of recent GPUs/TPUs, and so on, and I cross-reference that with Sglang’s blog post about running DeepSeek [32], SemiAnalysis post for various models [33], and various Nvidia blog posts using GH*/GB*/NVL72 for Gemma/GPT-OSS/DeepSeek [34, 35, 36, 37], most of which suggest even better/lower numbers. And that’s per 1M tokens, where each answer would typically have 100-1000 tokens, so the cost per query is practically zero, but let’s say it is $0.003 so we get a nice round number (and only marginally cheaper than the 2022 estimation for ChatGPT). For reference, Brave Answers cost $0.004/query (which includes their profit margin) [38], so we have plenty of error margin. This leaves us with $0.045 - $0.012 - $0.003 ~= $0.03 per query for profits and AI-related licenses. A bit higher than but not too far off SemiAnalysis’ 2022 estimates [31]. [Warning: heavy hopium] But wait a moment, we made so many (un)favorable assumptions (reducing current costs & inflating old costs & allowing higher profitability & increasing LLM inference cost), but there is still a cost increase of $0.0134 - $0.012 ~= $0.0014 (which is ~$7B/year, maybe ±$2B) that is too large to be a rounding error, pretty significant, and unaccounted for (remember, we always rounded against this number, not in favor). Could that be related to the total value they spend on AI-related licenses? [/heavy hopium] Let’s look at previous deals and licenses.

Past AI data license deals

Google uses a Knowledge Graph for Google Search [39]. To build that knowledge graph, they obviously rely on public data and their crawlers, but also on structured data and partnerships. Reddit already had a deal with Google in 2024 [40], but these deals are not unique to Reddit. Google has deals with LyricFind, StackOverflow, and so on [41, 42]. And Google isn’t the only player, Meta, OpenAI, and others have signed deals with companies that have data such as News Corpo, Reuters, Financial Times [42, 44]. Many data licensors (such as Refinitiv/LSEG and XE) already have deals and APIs for providing data to search engines too, including DDG and Google (who had some of them pre-2022 too) [43, 44]. More deals will follow across the board, or companies will simply launch APIs targeted to LLMs (e.g. LexisNexis’ “Nexis Data+”) [45]. This gives credibility to the Bloomberg report from ~1 year ago about data providers (including Reddit) pushing for usage-based fees / dynamic pricing rather than flat fees [46]... but how do we measure it?

Well, we know Google and OpenAI already have deals with Reddit that pay $60M/year and $70M/year [50] respectively, but we obviously don’t know the exact terms. We also don’t know how often Reddit gets cited in search. I was going to hire a bunch of randos on fiverr and create my own dataset, but I don’t care enough, so we’ll have to do with some posts like [47, 48] which cite statista for 40% on Google and [49] which says it’s ~2-3% on ChatGPT. Very wide range, let’s lean towards the lower estimate and say it’s a convenient 5% of citations (not accounting for single/multiple citations per query). Fun aside: Maybe Reddit also benefits from this partnership in non-monetary ways, e.g. with Reddit Answers / Ask? I created a small dataset of ~10 questions and ran them across Gemini Flash-Lite, Reddit Answers, and ChatGPT, and then did a “qualitative” / visual inspection of the results, then a few quantitative ones (cosine similarity, BERT, BLEU, and rouge scores – yes, I know the method is not great, blow me), and finally what was probably the worst way to try LLM-as-a-judge. In my eyes, Reddit’s Ask/answers feature looks almost entirely like Google Gemini, and answers in a similar way too (with similar titles, similar link/citation placement, similar markdown), while only ChatGPT emoji-spammed its answers. The various metrics were relatively close, but Gemini Flash-Lite was consistently closer to Reddit Answers than ChatGPT. Finally, LLM-as-a-judge picked Gemini Flash-Lite too... at least when the model names were hidden – when the names were visible it dunked on the other two calling them “community-based” versus calling its own answers “precise and factual”.

[Warning: heavy hopium]If we conveniently take that convenient 5% citation number, conventiently assume that Reddit is “only” 1 out of 5 citations (in my small test set the average was ~6.33, but if you deduplicate by website it’s more like 2-4), and conveniently apply that to the convenient $7B from before, it conveniently works out to ~$70M, conveniently close to the reported $60M.[/heavy hopium] Awfully convenient, huh?

Well, we likely can’t work out these values ourselves without knowing what other partnerships look like, without statistics about the total distribution of queries, without citation numbers/sources per topic, and so on. Again, I’m too lazy to do this properly, feel free to commission an army of freelancers yourselves. What I can do with my current laziness levels is look at Reddit’s data quality for training.

Data quality angle

There are good arguments about data quality being trash. There is plenty of AI-generated slop to go around, like in almost every other platform, and as some have pointed out, bots can be used to abuse voting mechanisms, moderators can be overwhelmed, humans parroting LLM content without research (what I like to call “The Perplexity Idiot”), and so on. Not only that, but plenty of human-generated content is incredibly dumb too (just read the comments below, r/whooosh), many one-liner quips, overly overoverused memes (RIP u/RepostSleuthBot), etc. If you’re worried about that, AI stroke warning >!you are absolutely right!!<

But have you guys seen what non-STEM data the various frontier AI labs like Google, OpenAI, and Anthropic have access to? Based on OpenRouter data [51], Anthropic’s API is used ~85% for programming / tech / science / etc, ~10% for RP (Nana is watching you), and the remaining 5% is for literally everything else. WSB creates more ~~shitposting~~ high-quality financial commentary in a day than a million of their users do in their lifetimes.

Jokes aside, only Meta and ~~Twitter~~ X have direct access to lots of human-to-human data, everyone else like ChatGPT and Google (and probably DeepSeek and some Chinese apps), through their free services, only have access to human-to-LLM data. Less popular platforms (Anthropic, Mistral, ...) mostly cater to enterprise and power users, rather than armies of normies, and are probably even more limited and need to rely on partnerships, which sometimes aren’t easy as data is the “moat” of the non-AI companies against the AI ones, or at least that’s what our CEO says [52] as the $TRI keeps tumbling down (not your fault boss, plz don’t fire me). This means that Reddit’s data can have value for training, but the question is how much.

A good deal of Reddit data likely gets thrown away immediately. From what we know from technical reports, common filters include: safety filtering (harmful / adult content), deduplication and excessive repetitions, token distribution outliers (too many rare words), quality filtering (grammar, syntax, length, ...) [1, 53]. Some Reddit data can be used to generate synthetic datasets (e.g. for training vision / multi-modal models). Some data, even meme posts, can be used for training multi-modal models (e.g. VLMs, diffusion models, etc), for which you traditionally relied on other platforms like Pinterest, Instagram, Shutterstock, etc. Btw, for new programming libraries, and various other support stuff, Reddit is a pretty good source of answers, for some topics Reddit subs have more up-to-date information than dedicated forums (I’m looking at you, truenas) or even StackOverflow (which is generally somewhat high-quality but heavily discourages back-and-forth and follow-ups, and also penalizes those downvoting). This is even more important for later stages of training, which are they key drivers of model performance. However, this “up-to-date” point is more relevant to citations since you cannot expect that an LLM will perfectly recite a single Reddit answer from its >30 trillion token dataset. This is why we went into the usage-based fee structure first. However, we should not ignore the fact that Reddit can, and often is, gloriously wrong (e.g. as this rando says), but we have ways of dealing (even if not “solving”) with such issues and hallucinations (as we already saw above).

Another common counter-argument to Reddit data is filled with LLM slop, dead internet theory, and what not. You regards would be surprised that the most popular subreddits aren’t WSB and r/investing but r/movies, r/funny, r/gaming, r/news, and so on. Most of it isn’t AI-generated, and most of it is links (with/without commentary), ~~Twitter~~ X posts, recycled videos from 20 years ago, downscaled TikTok reposts, memes from the 1820’s, and cat pictures. Lots and lots of cat pictures. Many subs with the same topic but slightly different names (r/AITAH vs r/AmItheAsshole). What’s more, the most popular subs (the ones with mostly picture posts), also have many more contributions than text-only / text-mostly subs, so this reduces the percentage of AI posts even more, and people and moderators in r/pics are exactly as receptive to AI-generated content as those in r/localllama (which is pretty funny in itself). If you pair these with the comments, flair, and titles, they make good vision and generation datasets. But let’s put this aside for now and get back to slop. There is no reliable way to tell AI-generated content, and even humans struggle and mostly rely on heuristics (common phrases, markers like “—”, markdown) which are relatively easy to bypass with custom prompts, finetunes, rephrasing, and other methods. There are some “fingerprinting” methods that actually embed something like an invisible marker which social media or other apps can use to detect that something is AI-generated, but there is no guarantee that every model will have this or that users can’t remove them. Do we quit? Do we blindly accept the unsubstantiated numbers we find online? No. Listen here, n00bs: you can shamelessly steal the work of universities and companies from their open-source repositories. I found the RAID Benchmark Leaderboard [54], picked open models/code that had >90% detection accuracy, created a majority voting system (score>=4 out of 5, picked after testing on a subset of known content), and used it to classify 15.5K posts from the past 2 weeks (up to 1K posts from each of 25 popular subreddits). I marked all non-text posts (~9.1K, 60%) as non-AI (totally intentional, not because my system can’t handle that much load). I arrived at a ~5% number after re-running the experiment at a few different times throughout the day.

That’s just slop, but we can also filter for low-quality. Of all posts, ~25% have 0 score, and ~45% have <10. In theory this 25-45% would contain all AI slop posts, but let’s say there’s no overlap and so we have 50% of the data being worthless. I personally think there is value in the low-quality and successfully-detected AI slop posts as negative feedback for models, but maybe not. I also noticed something strange: I’ve hit the 14-day limit on a bunch of subs. This shouldn’t happen when their weekly contributions are in the multiple thousands, so I suspect this means moderation removes A LOT of content (based on [56, 57] probably ~3-10%, which doesn’t account fully for what I saw). Free labor, mod abuse, power trip, methodology error, redact fanatics, call it (and make of it) whatever you wish. I didn’t bother filtering with length limit and common phrases as that mainly applies to comment replies, and I’m not scraping and parsing hundreds of comments per post for 15.5k posts. From the half that’s left, I’ll arbitrarily discard another 90% just because Reddit users are mostly braindead NPCs. This leaves us with 5% that is not slop, downvoted, or idiotic. Reddit reports 5B posts (half of that in the past year!), 22B comments, 3.9B messages (probably half is from remindmebot though) [55, 56, 57]. Even discarding 95% of that data, we’re probably left with ~100B–1T okay-quality tokens which is more significant than you’d think, and growing very fast. This is an even bigger number when you think that it can be used for additional synthetic data generation, and (more importantly) subs with less-popular languages are way under-represented in traditional training datasets. As Reddit grows internationally, the value of its non-English subs grows tremendously.

So calls?

Eh. A week or so after Q1’26 earnings, based on currently available data, I estimated that RDDT would trade around $170-220 by the end of the year, by assuming a drop in P/E and a growth slowdown (lower than guidance). There are a few unknowns too with lawsuits against Perplexity and Anthropic, and the next dates for those are in the last week of July. If your degen sense is tingling, maybe try to time a pre-earnings entry, otherwise maybe wait until after earnings, but if they do another triple beat maybe you likely won’t be able to enter at a good price point. Premiums seem a bit too high to me for buying options, but maybe an August expiry could work, good luck either way.

Disclaimers

Yes, 140% of this is financial advice. I’ve been granted a lifetime, perpetual, exclusive, non-revocable license by FINMA to give financial advice on Reddit, along with full immunity if your ports blow up, which it won’t because I have a 100% banbet win rate. Also some sources are kinda BS. Much of it is guesswork. I’m as much a regard as the rest of you, if not more, so you should put as much trust in this DD as you would on your own DD: zero. I sniff 300g of Hopium2Copium3 every morning. I'm too lazy to proofread or double-check my logic. 140% of this post was written by an Actual Idiot^TM^ (yes, it is in fact possible for humans to use markdown), and all mistakes are your own.

One of the biggest Swiss newspaper is integrating reddit clips in their article

Companies creating subreddits to promote

America’s Sweetheart on Netflix

Andy Burnham here - AMA

Reddit data for AI

A short primer on “AI” today

Search engines & their revenues

Past AI data license deals

Data quality angle

So calls?

Disclaimers

Sources

Ever visit the comments section on Instagram Reels? Yikes

Redditor Highlights feature

Weekend RDDT Discussion Thread for the Weekend of July 04, 2026

Robert Whittaker says he's walking around 220-225 lean ahead of his light heavyweight debut, reacts to Reddit saying he looks small: "I'm just short, dude!"

Daily Bull Post until 500: Day 16

This is the highest score I've seen yet

Golden Cross is right around the corner 🚀

[July 03, 2026] Daily RDDT Discussion Thread

Reddit introduces age verification for the EU and the EEA to comply with Digital Services Act

If the AI bubble pops will the market punish or reward RDDT

Daily Bull Post until 500: Day 15

[July 02, 2026] Daily RDDT Discussion Thread

Golden cross imminent

287.8 million people more people