u/DataCharming133

normalized SEC data

normalized SEC data

EDGAR is already free and paying for it is dumb. My friend and I started normalizing the data ~2 years ago and decided to make it completely free through an API. We're supported by commercial partners so everyone else can use it as a public resource how it's supposed to be.

Python client: https://github.com/3spread/py3spread

Current supported form types at this time are:

3, 3/A, 4, 4/A, 5, 5/A, 13F-HR, 13F-HR/A, 13F-NT, 13F-NT/A, D, D/A, SC 13D, SC 13D/A, SC 13G, SC 13G/A, 144, 144/A, N-CEN, N-CEN/A, N-MFP, N-MFP2, N-MFP3, N-PX, N-PX/A, 1-A, 1-A/A, 1-K, 1-U, 1-Z, N-PORT, N-PORT/A, S-1, S-1/A, S-3, S-3/A, S-3ASR, S-4, S-4/A, S-11, S-11/A, F-1, F-1/A, F-3, F-3/A, F-4, F-4/A

Up next is populating the registration statement endpoints above, the first structured text datasets, in the next couple days during downtime. 10-K/Q stuff is a little further ahead on the calendar with structured text and normalized financials (not just relabeled XBRL, you know who you are).

Currently able to support 300 req/min and 5 years of depth and are rolling out higher/deeper as we stress test everything and put out fires.

u/DataCharming133 — 2 days ago

normalized EDGAR data (open source python client)

SEC data is already free and paying for it is dumb. My friend and I started normalizing the data ~2 years ago and decided to make it completely free through an API. We're covered by commercial licenses so everyone else can use it as a public resource how it's supposed to be.

Free data: 3spread.com or just give your AI 3spread.com/llms.txt and it'll do the work for ya.

If you're feeding SEC data into an LLM, it's highly likely that inconsistent formatting/structuring is going to degrade quality and performance, especially with working with smaller models. Everything we serve comes out in consistent, normalized schemas for each form type no matter when it was filed or which generation of the SEC's spec the filer was ignoring at the time (which happens a lot).

Text is by far the hardest and most high-value when using AI to extract information from prose. Filings like registration statements and 10-Ks or 10-Qs can be hundreds of pages long (millions of tokens) and have limited raw utility for an LLM. We chunk these docs down with deterministic parsers across all the core sections (MD&A, risk factors, etc) with every embedded table pulled out, cleaned, and separately referenceable. There are no LLMs in our parsing pipeline, so the same documents parse identically every time and your models get clean data instead of a massive corpus of hallucination inducing tokens. The registration statements are being populated this weekend and 10-K/Q data is up soon after.

Normalized financials are also on our roadmap, but this is measurably more complicated. Normalizing financials =/= relabeling XBRL (you know who you are). A proof of concept dataset is targeted to come out relatively soon.

Python client: https://github.com/3spread/py3spread

pip install py3spread

Current supported form types at this time are:

3, 3/A, 4, 4/A, 5, 5/A, 13F-HR, 13F-HR/A, 13F-NT, 13F-NT/A, D, D/A, SC 13D, SC 13D/A, SC 13G, SC 13G/A, 144, 144/A, N-CEN, N-CEN/A, N-MFP, N-MFP2, N-MFP3, N-PX, N-PX/A, 1-A, 1-A/A, 1-K, 1-U, 1-Z, N-PORT, N-PORT/A, S-1, S-1/A, S-3, S-3/A, S-3ASR, S-4, S-4/A, S-11, S-11/A, F-1, F-1/A, F-3, F-3/A, F-4, F-4/A

Currently able to support 300 req/min and 5 years of depth and are rolling out higher/deeper as we stress test everything and put out fires.

u/DataCharming133 — 2 days ago