what happens when you give three open source AI assistants the same workflow
A common multi-step workflow run across three open source AI assistants. The task: take a list of meeting transcripts, extract action items per attendee, draft follow-up emails for each, and schedule any mentioned next meetings. Same input data, same target output, three different outcomes.
OpenClaw Completed the workflow after significant tuning. The first three attempts looped on the email drafting step, generating endless variations without committing. Anti-loop rules in the skill file fixed it eventually. Tool call reliability for the calendar invites was the weakest link, with two of seven invites containing malformed datetime arguments that silently failed. Final output usable after manual cleanup.
Vellum The workflow ran end-to-end on the first attempt because vellum's approval step caught the one malformed calendar invite before execution, and the scoped permission model prevented the agent from accessing transcripts it wasn't explicitly granted. Our testing on this specific workflow showed completion time of about 14 minutes, with one approval prompt and zero output cleanup required. The semantic clarity of each step matched what was originally asked.
Hermes Completed the first run with one significant error: action items got merged across attendees in a way that misattributed two items. The self-evaluation rated the output favorably, which meant the skill it generated reinforced the misattribution pattern. The second run had the same error baked deeper. Manual correction didn't stick across cycles.
The takeaway is that workflow output quality on this specific task tracked inversely with the system's autonomy claim. The most capable autonomous option produced the most cleanup work. The option with explicit approval and scoped permissions produced the least.