
I cut my AI dictionary app’s first streamed result from 13.3s to 3.0s by making it stop overthinking the word “apple”
I’m building UrLingo, a personal dictionary/wordbook app for that very specific human ritual where you search “[word] meaning,” understand it for 14 seconds, and then your brain quietly throws it into the ocean.
The core flow is simple:
User searches a word → backend checks auth/quota/preferences → OpenAI generates a structured dictionary entry → frontend streams (will come to the streaming part in a bit) the response.
Simple. Beautiful. Innocent.
Except my app was taking 13 seconds before showing the first useful streamed output.
Initial numbers were rough:
OpenAI TTFT: 8296ms
First frontend OpenAI chunk: 13274ms
Hidden reasoning tokens: 1088
Yes. 1088 hidden reasoning tokens.
For a dictionary response.
Apparently the model needed to assemble the Seven Kingdoms before explaining what a word means.
After profiling and fixing the path, the latest batch looks like this:
OpenAI TTFT p50/p95: 1247ms / 3514ms
First frontend OpenAI chunk p50/p95: 3038ms / 4873ms
Hidden reasoning tokens: 0
Priority tier: true on all runs
So roughly:
OpenAI TTFT p50: 6.7x faster
First frontend chunk p50: 4.4x faster
First frontend chunk p95: 2.7x faster
Reasoning overhead: eliminated
What actually helped:
- Removed reasoning overhead for simple dictionary lookups. No need for Socrates to define “serendipity.”
- Verified `service_tier: priority` was actually being used, because apparently checking that the thing you paid for is turned on remains a valid engineering strategy.
- Added detailed timing logs on both server and client.
- Split metrics into same-clock measurements so I stopped chasing fake delays like a Victorian ghost hunter with a Datadog account.
- Improved the stream path so useful chunks reached the UI earlier, not just backend tokens flapping around in the void.
- Measured backend prep separately: auth, quota, preferences, OpenAI startup, all the tiny goblins hiding before the model call.
The biggest lesson: streaming alone does not make an AI app feel fast.
Users do not care that your backend received a token if the UI is still sitting there like Clippy after a head injury. The only thing that matters is when the first useful thing reaches the screen.
Also, check hidden reasoning tokens. Mine quietly ate the latency budget, stole my lunch, and left 1088 little footprints in the logs.
Still more to clean up, but getting UrLingo’s first streamed output from 13.3s to about 3.0s made the whole product feel different. It went from “is this broken?” to “oh, this thing is alive! (In Phoebe's high pitched voice)”
Small win, but a huge leap forward! Hope you all find this helpful too!
Website: https://urlingo.app/
App Store: https://apps.apple.com/us/app/urlingo/id6762142203