r/OpenaiCodex

how to make two codex agents communicate on the same network?

Hi guys,

I use Codex on two machines on the same local network, and my workflow is kinda clunky. I send files from one computer to the other, then feed that file to the next Codex agent manually. Is there any way to have both codex agents communicate over the local network, so when I launch it on each machine they can stay synced and update each other?

For example, if I give instructions to codex on machine one, I want it to communicate with Codex on machine two, so both chats oget updated without me copying files every time.

reddit.com
u/Vivid_Track_3308 — 9 hours ago
▲ 0 r/OpenaiCodex+2 crossposts

Looking for ChatGPT Pro user that doesn't use Codex

I’m a student looking for a person who already has ChatGPT Pro and does NOT use Codex.

I only need access for Codex usage and won’t touch ChatGPT AT ALL, so your regular usage will stay unaffected.

I can contribute/pay a fair share for the Codex usage.

reddit.com
u/ApartRule5095 — 19 hours ago
▲ 56 r/OpenaiCodex+5 crossposts

Tailwind MCP that gives coding agents actual design taste

TL;DR: https://windframe.dev/mcp

Hi everyone 👋

I’ve been working on a Tailwind-native MCP that gives coding agents better design context when generating  interfaces.

A lot of AI-generated UI still feels inconsistent because the agent has no real sense of design systems, spacing, typography, or visual structure. It can write Tailwind, but it often lacks the taste and context needed to make the result feel properly designed.

So I built the Windframe MCP around that idea.

It gives coding agents access to curated Tailwind-native styles, design tokens, and styleguides inspired by products like Linear, Notion, and other companies that invest heavily in their design systems.

The difference in output quality has been really impressive. The generated interfaces feel polished and visually cohesive, not like a random collection of Tailwind components.

I’ll keep adding new design styles to the MCP as well, so the library will continue to grow over time.

Give it a try here https://windframe.dev/mcp

Would love any thoughts or feedback :)

u/Speedware01 — 1 day ago
▲ 236 r/OpenaiCodex+2 crossposts

EDIT: A few of you have asked if I'd run this on your repo. I'm doing 5 free in May to refine the methodology (all run locally - I won't see your code). If you're debating models, harness, reasoning levels, AGENTS.md, or SKILL.md, edits, DM me with the decision you're trying to make, and we can go from there! Especially interested in organizations doing evaluations (as I am in one, and run into this problem frequently at work)

TLDR; OpenAI cooked with GPT-5.5

Opus 4.7 writes smaller patches. GPT-5.5 writes patches that more often survive review. Which one you want depends on whether "small" means disciplined or incomplete in your repo.

I ran both models, plus GPT-5.4, on 56 real coding tasks from two open-source repos: 27 tasks from Zod and 29 from graphql-go-tools (these codebases were selected arbitrarily and may not represent your experience - that's the point of why running your own benchmarks is important!) Each model ran in its native agent harness at default settings: Anthropic models in Claude Code, OpenAI models in OpenAI Codex CLI.

The result was not "one model wins everything." GPT-5.5 was the best shipping default across these runs. By "shipping," I mean the model I would most often trust to produce a patch that passes tests, matches the intended human change, and survives code review. Opus 4.7 was still doing something valuable: it wrote much smaller patches.

On Zod, that looked like a real tradeoff. On graphql-go-tools, it looked more like under-implementation.

GPT-5.5 ships more often. Opus 4.7 ships smaller. Which one wins on your repo depends on whether your bottleneck is review or footprint.

That distinction is why repo-specific evals matter. Public benchmarks flatten model behavior into one number aggregated at massive scale. Real code turns it into a workflow decision on your specific codebase and standards.

I used Stet, an evaluation framework I am building for real-repo coding-agent benchmarks, to grade more than test pass/fail: behavioral equivalence to the human patch, code-review acceptability, footprint risk, and craft/discipline rubrics. This post is not a claim about all coding tasks. It is a concrete look at how three frontier models behaved on two real codebases.

Model Harness Reasoning Level
Opus 4.7 Claude Code high
GPT-5.4 Codex CLI high
GPT-5.5 Codex CLI high

The short version

Across 56 scored tasks:

Metric Opus 4.7 GPT-5.4 GPT-5.5
Tests pass 33/56 31/56 38/56
Equivalent to human patch 19/56 35/56 40/56
Clean pass: tests + review 10/56 11/56 28/56
Mean footprint risk, lower is better 0.20 0.34 0.32
Mean time/task 11m18s 8m24s 6m56s
Estimated run cost $3.43 $2.39 $2.86

GPT-5.5 is the quality leader. It passes the most tests, matches the human patch most often, and clears the reviewer about three times as often as Opus.

Opus is the footprint leader. Its patches are smaller and lower-risk by Stet's footprint model. But a small patch is only good when it is complete. The recurring Opus failure mode is passing the visible tests while missing companion work the human PR included.

GPT-5.5 is also the efficiency leader on tokens and wall-clock. It used fewer input tokens, fewer output tokens, and less summed agent time than either competitor. GPT-5.4 is still the cost leader because its pricing is lower, but the cost advantage did not offset the clean-pass gap in these runs.

The repo split is where the result gets interesting:

Repo Model Tests Equiv yes Review pass Clean pass
Zod, 27 scored tasks Opus 4.7 12 11 6 5
Zod, 27 scored tasks GPT-5.4 9 18 10 5
Zod, 27 scored tasks GPT-5.5 12 18 14 10
graphql-go-tools, 29 tasks Opus 4.7 21 8 5 5
graphql-go-tools, 29 tasks GPT-5.4 22 17 6 6
graphql-go-tools, 29 tasks GPT-5.5 26 22 19 18

On Zod, GPT-5.5 and Opus tie on tests. GPT-5.5 wins on reviewer judgment. Opus wins on diff size.

On graphql-go-tools, GPT-5.5 wins outright. It passes more tests, produces far more clean passes, and is closer to the human patch. Opus still writes the smallest patches, but the small-patch strategy misses too much.

Full scorecard

Metric Opus 4.7 GPT-5.4 GPT-5.5
Code-review pass 11/56 16/56 33/56
Code-review avg: correctness + bug safety 2.33 2.59 3.08
- Correctness 2.11 2.60 3.16
- Introduced-bug safety 2.55 2.56 3.04
- Maintainability, GraphQL only 2.07 2.55 3.03
Custom grader avg, 8 rubrics 2.33 2.40 2.62
Craft score, 0-4 2.41 2.54 2.78
- Clarity / coherence / robustness 2.56 / 1.95 / 1.92 2.75 / 2.18 / 2.43 2.91 / 2.51 / 2.69
Discipline score, 0-4 2.20 2.16 2.36
- Scope discipline / diff minimality 2.39 / 2.42 2.18 / 2.28 2.45 / 2.46
Total input tokens 239.1M 222.3M 201.8M
Total output tokens 1.29M 1.09M 0.72M

The quality-score rows are there to avoid treating "more tests passed" as the whole story. Code review is one grader: correctness, introduced-bug risk, and maintainability where available. The custom grader average is separate: eight additive rubrics split into five craft dimensions and three discipline dimensions. Across both layers, GPT-5.5 is not merely preferred in the abstract. It is rated higher on correctness, lower introduced-bug risk, GraphQL maintainability, coherence, robustness, scope discipline, and diff minimality relative to the requested task. Opus still wins the mechanical footprint row, which is the useful tension: smaller diffs, but not consistently more disciplined diffs.

How the benchmark works

Each task is derived from a real merged commit. The model gets a frozen repo snapshot, a prompt describing the change, and one attempt to produce a patch — running in its native shipped agent harness with no Stet-side scaffolding: Opus 4.7 in Claude Code (claude -p); GPT-5.5 and GPT-5.4 in OpenAI Codex CLI (codex exec); both at default settings. Stet applies the patch and runs the task's tests in an isolated container.

Then Stet grades the result beyond pass/fail:

  • Tests: did the patch satisfy the executable acceptance tests?
  • Equivalence: does the candidate patch accomplish the same behavioral change as the original human patch?
  • Code review: would a reviewer accept the patch, considering correctness, introduced-bug risk, maintainability, and edge cases?
  • Footprint risk: how much review and regression surface did the patch create?
  • Craft/discipline rubrics: clarity, simplicity, coherence, intentionality, robustness, instruction adherence, scope discipline, and diff minimality.

Every model ran once per task with a single seed. The judge model for equivalence and rubrics was GPT-5.4, run with identical rubric versions across all three arms. Each patch was scored independently — the judge sees the patch and the task, not the arm label or the model that produced it. There is no dual-rater calibration, so treat absolute scores as directional; the cross-arm deltas are the thing to trust.

Tests are signal, not the finish line

The most useful row in the table is not tests. It is clean pass: tests pass and the code-review grader accepts the patch.

On Zod, Opus and GPT-5.5 both passed 12 of 27 scored tasks. If you stop there, the models look tied. But GPT-5.5 produced 10 clean passes; Opus produced 5.

On graphql-go-tools, the same pattern was amplified. GPT-5.5 passed 26 of 29 tests and produced 18 clean passes. Opus passed 21 tests but produced only 5 clean passes.

That is the gap you feel in code review. The tests say "this patch probably works." The reviewer asks "is this the patch we want to maintain?"

One GraphQL task shows the difference. PR #1001 changed an HTTP datasource OnFinished hook so consumers could inspect request and response metadata. All three models passed tests and were judged equivalent. Only GPT-5.5 cleared code review. The other two got warnings around API shape, raw HTTP object exposure, and robustness at the hook boundary.

That is not a benchmark trick, rather, this is reflective normal engineering culture where code is reviewed: three patches can satisfy the same test and still differ materially in review quality. You only want to merge the code that is high-quality and maintainable, even if it technically works.

What the reviewer saw

The code review and craft/discipline rows explain why the result is not reducible to "GPT-5.5 changes more files." Two patch autopsies make the numbers less abstract.

Zod async codecs and defaults. The task was to make codec pipelines work with async transforms, prevent defaults from becoming undefined, and generate stub package manifests for the build. All three models failed tests. If you stop at the test row, the task tells you nothing.

The reviewer found a real ordering underneath. Opus changed 8 files and missed central semantics: defaults could still allow undefined, core codec definitions remained synchronous, generated stubs were not published, and prefault() was tightened even though the request was about .default(). GPT-5.4 got closer with an 11-file patch and was judged behaviorally equivalent, but it still over-tightened adjacent API by restricting prefault. GPT-5.5 also failed tests, but it was judged equivalent and scored better on correctness and introduced-bug risk because it covered the schema/build behavior more cleanly: codec/default tests, version metadata, stub-manifest scripts, and the relevant packages/zod/src/v4/*/schemas.ts surfaces.

That is a different kind of signal from pass/fail. It says GPT-5.5 was not merely getting luckier tests; even on a miss, it more often moved the right pieces.

GraphQL Apollo-compatible validation. PR #1169 aligned field-selection validation errors with GraphQL spec and Apollo Router conventions. All three models produced patches. All three passed tests. Only GPT-5.5 cleared equivalence and review.

Opus touched 11 files and passed tests, but missed enum and wrapped-scalar leaf validation, pointed some leaf-selection locations at the field instead of the selection set, left an inline-fragment message non-spec-compliant, and did not apply validation status uniformly. GPT-5.4 touched 12 files and also passed tests, but broadened behavior in the wrong places: unconditional validation metadata, incomplete enum/wrapped scalar handling, broad request-error conversion, and stale compatibility API.

GPT-5.5 touched fewer files than either one, 10 total and 6 non-test, while still adding more targeted behavior: aligned field-selection messages, requested locations, and centralized Apollo validation metadata. This is the clean reviewer example: tests saw three passes; semantic grading saw one patch that actually matched the convention the PR was trying to establish.

This is what the score rows are trying to summarize. GPT-5.5's biggest review lead is correctness: 3.16 versus 2.60 for GPT-5.4 and 2.11 for Opus. The custom graders say the same thing from another angle: GPT-5.5 leads coherence and robustness because its patches more often carry the change through the repo's existing surfaces instead of stopping at the first passing path.

The discipline row is the one I would not overclaim. GPT-5.5 leads, but narrowly: 2.36 versus 2.20 for Opus and 2.16 for GPT-5.4. Opus wins raw footprint. GPT-5.5 narrowly wins task-relative discipline. The grader is separating "small" from "appropriately scoped." A patch can be compact and still undisciplined if it stops before the task is done.

What Opus is doing

Opus 4.7 is cautious. It writes smaller patches, touches fewer files, and has the lowest footprint risk in both repos.

On Zod, that caution is often attractive. Zod has many contained tasks where the correct move is a precise source edit, a type change, and maybe a small test update. Opus tied GPT-5.5 on tests while keeping the patch footprint lower.

But Opus's restraint has a recurring failure mode: it implements the headline behavior and stops before the companion work is done.

Zod made this easy to see. Zod has parallel Node and Deno trees. The tests exercise the main src/ path, so a patch can pass while leaving Deno mirrors stale. On several Opus test-pass-but-not-equivalent tasks, that is exactly what happened. A CIDR validation change passed tests after Opus touched four files. GPT-5.5 touched eleven, because it updated the parallel distribution surface too. The judge marked Opus non-equivalent because the human patch did the companion work.

The same behavior looked worse on graphql-go-tools. That repo is a Go federation engine with planner, datasource, hook, validation, and runtime paths that need to line up. A minimal patch is not enough if the real change spans several engine surfaces.

On PR #1155, the task covered repeated scalar fields in a gRPC datasource, request building, response marshaling, null and invalid responses, error status information, disabled datasources, and dynamically-created clients. Opus produced no patch. GPT-5.5 passed tests, matched the human patch, and cleared review.

That is the key distinction: Opus's small patches can be discipline on local tasks and under-implementation on integration-heavy tasks.

What changed from GPT-5.4 to GPT-5.5

GPT-5.5 is not just GPT-5.4 with higher pass rates. The failure modes shift.

GPT-5.4 often sees the right general approach but fails in execution. On Zod it had 18 equivalence yes judgments, matching GPT-5.5, but only 9 test passes. The equivalence grader recognized the intended behavior; executable validation still failed.

GPT-5.5 closes that gap. It keeps more of the broad integration behavior while producing fewer broken patches.

Three Zod examples are useful.

First, a schema-to-TypeScript generator. The task asked for a recursive visitor over Zod schema definitions. Opus and GPT-5.5 both recognized it as an implementation task and built the visitor. GPT-5.4 produced repository-instruction files instead of the feature. That is not a subtle algorithmic miss. It misclassified the work.

Second, a recursive parser fix. Both GPT models reached for visit-count tracking. GPT-5.4 added an inProgress sentinel and reset logic. GPT-5.5 kept the count-and-cache-error behavior and removed the extra state. Same broad idea, fewer moving parts, passing tests.

Third, CIDR validation. GPT-5.4 and GPT-5.5 had similar core algorithms: split on /, validate the address, validate the prefix. GPT-5.5 updated the Deno mirrors. GPT-5.4 did not. This is not a reasoning leap. It is repo hygiene.

On graphql-go-tools, the separation is more operational. PR #1232 required deduplicating identical single fetches while rewriting dependency references that pointed at removed duplicates. A patch can look plausible and still leave fetch dependencies stale. GPT-5.5 was the only model to pass tests, match the human behavior, and clear review.

The pattern is: GPT-5.5 does more of the boring integration work that turns a clever local fix into a shippable repo change.

The cost of doing more

GPT-5.5 writes larger patches than Opus.

On graphql-go-tools, average patch size was about 33 KB for GPT-5.5, 27 KB for GPT-5.4, and 19 KB for Opus. The footprint scores move accordingly: Opus 0.19, GPT-5.4 0.32, GPT-5.5 0.34.

That is not free. Bigger patches are harder to review, easier to conflict, and more likely to touch sensitive paths. If your workflow is dominated by auditability, Opus still has a real advantage.

But the craft rubric shows why raw size is not enough. On GraphQL, GPT-5.5 had the largest patches and still slightly led diff minimality relative to the task. The grader is not asking "who changed the fewest bytes?" It is asking "who changed the fewest bytes needed to solve the actual request?"

That distinction is the whole benchmark in miniature. A 5 KB patch that misses required surfaces is not more minimal than a 20 KB patch that finishes the job.

The cost story also changed between repos. On Zod, Opus and GPT-5.5 looked similar operationally: Opus used 53.0M input tokens and 359K output tokens; GPT-5.5 used 50.4M input and 290K output. Opus was faster on summed agent time, 1.99h versus 2.32h, and slightly cheaper, $45.53 versus $46.69.

GraphQL reversed that. Opus used 186.1M input tokens and 934K output tokens. GPT-5.5 used 151.4M input and 431K output. Opus took 8.56h of summed agent time; GPT-5.5 took 4.16h. That does not look like Opus sandbagging. It looks like Opus working longer, emitting more tokens, and still converging on smaller, less complete patches.

The behavior metrics point the same way. On GraphQL, Opus averaged 3.17 explicit planning calls per task; GPT-5.5 averaged zero. Opus made 10.2 patch calls per task; GPT-5.5 made 9.9. Opus was not bailing early. The difference was exploration style: GPT-5.5 made about twice as many shell calls and more search calls, while Opus spent more of its budget in planning and patch rewrite churn. In this repo, broader repo inspection appears to have mattered more than deliberating over a narrower patch.

Model personalities, in one paragraph each

Opus 4.7 — under-reach. Conservative, precise, low-footprint. Strong when the task is local and the desired change has a narrow surface. Weak when the human patch includes companion surfaces the tests do not fully cover. Its failure mode is often "tests pass, but this is not the same change."

GPT-5.4 — right shape, wrong execution. Directionally capable but uneven. It often finds the intended shape, which is why its equivalence numbers are respectable, but it is more prone to stale mirrors, extra bookkeeping, unearned refactors, and patches that the judge likes more than the test suite does.

GPT-5.5 — broader, bigger footprint. More complete on integration surface. It is more likely to update the surrounding code, pass review, and convert intended behavior into passing code. Its risk is patch footprint: when it is wrong, it can be wrong over more files.

Why this matters

The practical question is not "which model is best?"

The practical question is:

For this repo, under this harness, on the kinds of tasks we actually ship, which model produces patches we trust?

The answer changed by repo.

Zod made GPT-5.5 versus Opus look like a tradeoff: same test pass count, GPT-5.5 better reviewer alignment, Opus smaller patches.

graphql-go-tools made the tradeoff less symmetrical: GPT-5.5 was simply more shippable on the measured tasks, while Opus's small-patch advantage came with too much missed integration work.

That is why Stet is built around real repo tasks instead of synthetic prompts. Your repo has its own mirror trees, codegen surfaces, test blind spots, hook conventions, planner invariants, and review standards. You also have your own AGENTS.md, skills, model and harness settings, etc. Those details decide whether a model's "personality" is an asset or a liability.

Caveats

Fifty-six scored tasks is still small. One task swing moves a repo-level rate by a few points. Every model ran once per task. Some close calls would flip on rerun.

The equivalence and rubric judge was GPT-5.4. That can introduce family bias. I do not think it explains the whole result: GPT-5.5 beats GPT-5.4 decisively, Opus still wins footprint, and many Opus equivalence losses are concrete missed files or missing companion surfaces.

Results are also harness-conditional. Claude Code and Codex CLI bring different system prompts, planning loops, and tool surfaces, and each model ran in the harness its vendor ships. Running Opus 4.7 inside Codex via API, or GPT-5.5 inside Claude Code, would change the picture. The numbers here describe these models in the harnesses real engineers actually use them in — not the models in isolation.

Takeaway

If I had to summarize the 56 scored tasks:

  • GPT-5.5 is the best default shipping model across these two repos.
  • Opus 4.7 is still the low-footprint model and can be preferable when narrow diffs matter most.
  • GPT-5.4 is cheaper per task, but not enough better on cost to overcome the clean-pass gap here.
  • Tests alone would have hidden the most important result.
  • The same model ranking changed by repo, which is the point.

The interesting model eval is no longer "can the model solve a hard prompt?" It is "what kind of patch does this model tend to produce in my codebase, and does that match how my team ships software?"

u/bisonbear2 — 2 days ago

Codex Local Model Switcher (Release) for Ollama Models

Hey everyone, personally I feel like Codex is the best coding agent around. It continues to develop into an incredibly capable harness.

That being said, recently Ollama added native support for Codex. This means for those of you want to run your own local models in the codex desktop app, it is now compatible.

So I created a easy gui that detects your available local models, and runs the ollama profile switch from gpt to your local models.

This is not meant to replace gpt, its meant to give you more options.

Important note: The ollama profile and you normal account profile do not share context conversations. This is a good supplement for when you are between your sessions and obviously performance is subjective to your hardware and what ollama model you are choosing to implement.

https://github.com/MarzEnt87/Codex_Model_Switcher

https://preview.redd.it/u5a5gm2q872h1.png?width=2678&format=png&auto=webp&s=1eaae7967f61f015562e1d73701c640872d84073

reddit.com
u/PTXStudio — 2 days ago
▲ 17 r/OpenaiCodex+1 crossposts

Updated the BCSC-bound ophthalmology GPT after feedback here — would value thoughts on whether it is better or overcorrected

ANNOUNCEMENT:

V2.0 of the AAO CustomGPT is flagged due to OpenAI policy restrictions.

I am currently working on deploying it on another Open source platform.

MEANWHILE, OLDER VERSION 1.0 is BACK!. DM ME FOR ACCESS

__________________________________________________________

Background: ophthalmologist.

I updated the BCSC-bound ophthalmology GPT based on feedback from people here who stress-tested it.
Main failure mode identified: it could sometimes accept the user’s framing too early, then build a polished ophthalmology-sounding explanation around the wrong premise.

So the updated version now emphasizes:

  • objective findings before interpretation
  • authority claims as context, not evidence
  • anatomy/surgical-state checks before differential
  • morphology/location before labeling slit-lamp findings
  • benign/lookalike-first reasoning
  • stopping earlier when the premise is weak

Small internal before/after stress test, n=26 prompts:

LLM-assisted internal scoring. Not a clinical validation study.

u/Other-Vanilla-5765 — 3 days ago

Why Codex Is consuming credits so fast?

Hi guys,

I started using Codex recently, and in the first few days my quota felt like it lasted way longer. I was asking the same general type of stuff, but the credits were not disappearing this fast. Now I feel like Codex consume them so quicker, I hit the limit much sooner, and then it starts pushing me to subscribe to pro or buy more credits.

Maybe I am wrong, not sure though, but i honestly feel this is done on purpose to make people buy more. Or maybe there is some normal reason for it and I am not sure what it is yet. What do you guys think causes the quota to get used up so fast compared to the first days of using the tool? did you guys have a similar experience?

reddit.com
u/Vivid_Track_3308 — 3 days ago
▲ 39 r/OpenaiCodex+14 crossposts

I added dedicated AWS / EKS support to KubeShark.

Mini recap:

KubeShark is my Kubernetes skill for Claude Code and Codex.

It helps AI agents generate, review, and refactor Kubernetes manifests without falling into the usual LLM traps: missing security contexts, deprecated API versions, broken selectors, wildcard RBAC, unsafe probes, missing resource requests, and rollout configs that look okay but fail under real traffic.

The important part is that KubeShark is failure-mode-first. It does not just tell the model “write good Kubernetes”. It forces the model to reason about what can go wrong before it generates YAML, and then return validation and rollback guidance as part of the answer.

That matters a lot with Kubernetes, because many bad manifests are accepted by the API server and only fail later at runtime.

Repo: https://github.com/LukasNiessen/kubernetes-skill

---

Now what’s new:

KubeShark now has special dedicated AWS / EKS support.

When the task involves EKS, AWS, IRSA, EKS Pod Identity, AWS Load Balancer Controller, EBS/EFS CSI, AWS VPC CNI, or Karpenter, KubeShark switches into EKS-aware guidance.

That matters because EKS is “just Kubernetes” until identity, load balancing, storage, pod networking, and node provisioning enter the picture.

Common LLM mistakes include:

  • putting AWS access keys into Kubernetes Secrets
  • mixing IRSA and EKS Pod Identity assumptions
  • using nginx annotations with AWS Load Balancer Controller
  • treating EBS like ReadWriteMany storage
  • recommending Karpenter while omitting resource requests
  • assuming NetworkPolicy works without checking the CNI/policy engine

Example guidance KubeShark now keeps in mind:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app
  namespace: payments
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payments-app

It also knows that EBS is usually RWO and zone-sensitive, EFS is the RWX option, and Karpenter depends heavily on good workload requests.

So instead of generic Kubernetes advice, you get EKS-aware manifest generation and review.

u/trolleid — 4 days ago
▲ 18 r/OpenaiCodex+2 crossposts

How I run Codex from my phone and laptop using a remote Mac mini

I wanted an always-on Codex setup that I could use from my MacBook or iPhone without keeping my main machine tied up. The basic idea is simple: run Codex on a Mac mini, then connect to that machine as a remote project from the ChatGPT apps.

Here is the rough setup:

  1. Get into the Mac mini

If you already own a Mac mini, you can use that. If you don't, you can rent a Mac VPS (hyperbox.sh is just one option). Use Screen Sharing, VNC, or your provider's remote desktop.

  1. Start Codex on the Mac mini

Open Terminal on the Mac mini and run codex from the CLI:

```bash

codex --yolo

```

Codex gives you a device-login URL and code. Open the URL on your normal browser, sign in with your ChatGPT account, paste the code, then go back to the Mac mini terminal. Accept the trusted directory prompt so Codex can work in that folder.

  1. Pair your phone

Open the ChatGPT desktop app on your MacBook, go through the Codex mobile setup flow, and scan the QR code with the ChatGPT app on your iPhone. After approving it, enable the relevant connection settings like keeping the Mac awake and enabling computer use.

  1. Add the remote project on your MacBook

In the ChatGPT desktop app, go to the connections area, choose the option to control other devices, authorize on chatgpt.com, then add the Mac mini as a remote project. After that, prompts from the iPhone or MacBook hit the same Codex session on the Mac mini. Any prompts you make will show up across devices.

The result is a small always-on coding workstation: Codex runs on the Mac mini, while your phone and laptop act like lightweight "satellite" controllers.

u/Reibmachine — 5 days ago
▲ 28 r/OpenaiCodex+5 crossposts

Codex now works directly in Chrome on macOS and Windows.

It’s even better at working with apps and sites in Chrome, and now works in parallel across tabs in the background without taking over your browser.

u/dorugamer — 7 days ago

Business plan limit

​

Has anyone previously signed up for the ChatGPT Business Plan with one free seat? I signed up and felt like the Codex limit ran out very quickly. Sometimes, as soon as I opened the program, the limit was already gone, even though I hadn't used anything. P.S. Is it true that the limit is lower than the Plus plan?

reddit.com
u/TansawaT — 5 days ago
▲ 8 r/OpenaiCodex+5 crossposts

Audrey 1.0 is out: local-first memory guard for Claude Code / Codex-style agents

I posted Audrey here about a month ago when it was still rough. I kept grinding and turned it into a real 1.0 release.

What it does now:

  • local-first memory for agents, not another cloud memory service
  • pre-action checks before risky tool calls, with allow / warn / block verdicts
  • redacted tool-trace receipts so the system can learn from previous mistakes without leaking raw secrets
  • GuardBench artifacts so the claims are auditable instead of just vibes
  • Node package + Python client + MCP/server path

The point is simple: the model can propose, but the host has to decide. If the rule only lives in prompt text, it is advice. If it runs at the tool boundary with evidence, it becomes infrastructure.

GitHub: https://github.com/Evilander/Audrey

Paper / artifact preview: https://paper-site-r3jdakujn-evilanders-projects.vercel.app

arXiv is submitted but still on hold, so I am not pretending there is a public arXiv ID yet. I would rather be exact than hype fake status.

If you are building agent tooling, I would genuinely like hard feedback. Especially on the GuardBench scenarios and where pre-action memory should block vs warn.

u/MomSausageandPeppers — 8 days ago
▲ 18 r/OpenaiCodex+5 crossposts

The entire code was made 100% using GPT codex 5.3 til now 5.5.

It's 100% javascript stack, node on server and vue + native HTML canvas on client.

I'm using the Tibia assets as placeholder for now, plan is replace it entirely later.

Post is awaiting moderator approval.

u/Top-Assumption7555 — 12 days ago
▲ 5 r/OpenaiCodex+1 crossposts

Anyone have any tricks with 5.5 and Stripe Integration?

I've been awake all night with nothing to show for it. 5.5. I feel like I'm back in jipity3.5 days. Stripe works and then it breaks, then it works then it breaks. I'm about to break a keyboard in half.

u/Critical-Teacher-115 — 13 days ago