u/transmision

Hi everyone,

I'm not entirely sure this request belongs on this subreddit, but I'll give it a shot anyway.

I'm working on a personal project called WeakSignalFinder, focused on quantitative text analysis to help detect emerging themes.

What the project currently does:

The program relies on Natural Language Processing (NLP) to identify various categories of terms (nouns, pronouns, adjectives, verbs) and quantitatively count the occurrences of a given set of keywords (e.g., war, economic…). It also analyzes co-occurrences, meaning it captures the immediate neighborhood of each word (positions n-1 and n+1), in order to produce a kind of map or dictionary of the linguistic patterns within the input corpus.

The problem I'm currently stuck on:

I'm now tackling a feature that was actually the original goal of the project: identifying weak informational signals (in the Ansoff sense). For a long time this seemed too complex to me, mainly because of one core difficulty: how do you distinguish noise from a genuine weak signal?

The hypothesis I'd like to submit:

A few days ago, I came up with a possible angle. To filter out noise from the pool of terms suspected of being weak signals, one could compute an average coefficient for each of the suspect term (by all occurrences), in order to derive a density of "theme-words" (terms with high, or very high, occurrence rates).

I'm coming to this subreddit today hoping to get critical feedback on this hypothesis, pointers to academic literature that could help me validate, refine, or correct the approach, and ideally any existing implementations or experimental code that have explored these concepts in practice.

Thanks in advance for any help. My current self, armed only with an Associate's Degree in Computer Science, will be more than happy to quench a bit of his insatiable thirst for knowledge.