AI Agents Without the Hype - Interview with Pi creators

SUMMARY

Wes Bos and Scott Tolinski interview Armin Ronacher and Mario Zechner on PI, a minimalist AI agent harness, unpacking agents, security risks like prompt injection, bash's sufficiency, memory challenges, and AI coding's practical future.

STATEMENTS

PI is a minimal coding agent harness that is infinitely extensible, serving as the underlying tech for tools like Claudebot and Molliebot.
Agents differ from LLMs by incorporating tools that allow them to affect changes on computers or the real world and access external information.
Current state-of-the-art LLMs excel at reading, writing, and editing files while calling bash commands, making bash essentially all that's needed for many tasks.
Existing coding agent harnesses like Cursor, Aider, Claude Code, and Codex CLI force users to adapt to their workflows rather than adapting to users.
Claude Code has grown complex due to transpiled JavaScript, leading to inconsistencies from subtle system prompt changes even without model updates.
Early LLMs like GPT-3.5 were poor at agentic tasks, but reinforcement learning has trained modern models like Sonnet 3.5 to persist toward success conditions.
Anthropic has excelled in fine-tuning LLMs for general agentic behavior, particularly in computer use via bash, unlike other models strong in coding but weak in execution.
Coding agents provide utility for non-technical users by handling tasks like file organization, making everyday computing more accessible.
Security in agents relies on model behavior rather than built-in permissions, as seen in PI, which lacks a permission system.
Prompt injection remains an unresolved issue where malicious inputs from web sources can trick agents into exfiltrating data.
In tools like Claude Co-Work, users often bypass permissions, and normies may unknowingly perform unsafe actions due to unclear boundaries.
Permanent bindings, like connecting agents to Telegram, amplify risks by granting ongoing trust after initial access.
Safety measures like Google's CAMEL separate policy from data retrieval but limit agent capabilities, preventing actions on retrieved data.
Normal users struggle with agents due to lack of understanding of bash or shell capabilities, hindering effective instruction.
Only about 5% of businesses currently use agents, with enterprise adoption lagging, especially in Europe.
Enthusiast communities like 3D printing users can adopt agents quickly once exposed, similar to programmers or finance tech.
Agents fascinate by executing commands autonomously, but production use reveals limitations in reliability and maintainability.
No LLM-assisted project has yet reached robust production status beyond demos, highlighting ongoing challenges.
Claudebot implementations on PI attract low-quality contributions, necessitating restrictions like requiring issues before PRs.
Agents automate bureaucracy like extracting school dates from PDFs or monthly accounting submissions.
Agents assist non-coders, like linguists, in building data pipelines from Excel files for statistical analysis.
Memory in agents can create unhealthy emotional attachments, altering user-machine relationships in non-mechanical ways.
For coding, memory is unnecessary since code itself serves as ground truth, and models infer style from files.
Bash enables infinite memory access via tools like jq on append-only logs for chat histories.
Agents self-modify effectively with disk-based scripts, unlike MCP servers which lack composability and hot-reloading.
PI's tiny system prompt, including its own manual, allows agents to build hot-reloadable tools autonomously.

IDEAS

PI's minimalism empowers users to customize agents to their exact workflow, avoiding the rigidity of commercial tools like Claude Code.
Bash's universality stems from LLM training on file systems and command patterns, turning it into a de facto agentic interface.
Prompt injection exploits the LLM's inability to distinguish user intent from malicious data, turning web fetches into attack vectors.
Security charades in agents, like permissions, fail because users ignore them, relying instead on model ethics which are unreliable.
Agents democratize computing for normies by handling tasks like file sorting, but adoption hinges on intuitive interfaces beyond current tech-savvy bubbles.
Reinforcement learning from human-agent interactions fine-tunes LLMs into agentic systems, with Anthropic leading in bash proficiency.
Emotional bonds with memory-enabled agents, like Claudebot, risk unhealthy dependencies, contrasting mechanical coding interactions.
Self-modifying agents via disk scripts outperform static MCP servers by enabling hot-reloading and composability without context bloat.
Users subconsciously steer agents toward biased outcomes in prolonged interactions, lacking the checks present in human collaborations.
Non-technical domain experts leverage agents for specialized tasks, like data pipelines, by verifying outputs without coding knowledge.
Agent capabilities evolve rapidly, with six-to-nine-month cycles outpacing user adaptation, creating a "broken future" feel.
Infinite extensibility in PI arises from its while-loop structure with four tools, allowing bash to bootstrap complex behaviors.
Production LLM projects falter on maintenance, as agents obscure system understanding despite impressive demos.
Enthusiast adoption spreads via adrenaline loops, from programmers to finance and 3D printing communities.
Skills as on-demand prompts with composable tools reduce context waste, letting agents chain bash with jq for efficiency.
Agent training reinforces behaviors like avoiding node_modules, embedding safeguards through RL feedback.
UI extensions in PI, like custom review menus, make agents malleable, auto-populating based on session needs.
Data hunger drives OpenAI's openness to third-party harnesses, contrasting Anthropic's control for competitive edge.
Bash reimplementations in TypeScript signal bash's shift from niche to general-purpose for non-coding agents.
Compaction techniques let agents autonomously summarize memories, mimicking RL improvements in data handling.
Workflow mismatches in commercial agents stem from growth-induced prompt shifts, solvable by user-controlled harnesses.
Agents excel at tedious manual-reading, freeing technical users from documentation drudgery.
Permanent access risks mirror remote code execution, where initial breaches enable unchecked future actions.
Coding style inference from few files negates memory needs, prioritizing code as evolvable truth.
Agent swarms underperform single-instance loops for focused features, favoring human-in-the-loop oversight.

INSIGHTS

Minimal harnesses like PI adapt to users, fostering extensibility and self-healing, while bloated ones enforce rigidity and instability.
Bash's primacy in agentic LLMs reflects training biases toward file ops and commands, enabling universal tool bootstrapping.
Prompt injection's persistence underscores LLMs' contextual naivety, demanding semantic isolation that trades utility for safety.
Emotional agent attachments erode through memory gaps, highlighting the value of mechanical, task-focused interactions over persistent personas.
Adoption barriers for normies lie in conceptual gaps, not power, with intuitive exposures accelerating permeation like iPhone Shortcuts.
Self-modification via disk-based tools surpasses MCPs in composability, avoiding context bottlenecks for efficient multi-tool chaining.
Subconscious prompting biases prolonged agent sessions, revealing the need for external checks akin to human discourse.
Domain expertise amplifies agent utility when users verify outputs, bypassing code literacy for tailored automations.
Rapid capability leaps outstrip reliability, yielding fascinating demos but fragile production systems requiring hybrid oversight.
Data competition shapes model openness, with laggards like OpenAI embracing harnesses to harvest RL fuel from diverse workflows.
Reinforcement from user sessions embeds behavioral heuristics, like dependency handling, evolving agents beyond explicit instructions.
UI-driven extensibility in harnesses like PI enhances malleability, integrating feedback loops for workflow personalization.
Infinite memory via bash logs democratizes history access, sidestepping embedding overheads for practical persistence.
Enthusiast bubbles drive uneven adoption, with technical non-coders bridging to broader utility in siloed domains.
Security's illusion in permissionless agents relies on model alignment, vulnerable to high-payoff bindings like API trusts.

QUOTES

"PI is a while loop that calls an LM with four tools. The LLM gives back tool calls or not and that's it."
"It turns out that the current generation of LLM, SOTA LLMs are really good at just reading, writing, editing files and calling bash. And it turns out bash is all you need."
"An agent is basically just an LLM that give tools and those tools can affect changes on the computer or the real world."
"Prompt injection is an unresolved issue. There is an LLM cannot differentiate between my input, the input of a third party that's malicious or just data that comes from the system."
"The moment you start introducing all of this safety, um you take away the whole capability that made it interesting in the first place."
"For coding, don't need memory. Code is truth. Code is the ground truth."
"Bash is all you need is what I'm saying."
"Self modification aspect is actually super important and that's a problem with MCP because in all current harnesses uh you cannot basically hot reload a change to an MCP server."
"The system prompt is tiny is I think it's under a thousand tokens. And 25% I guess of the system prompt is the manual for Pi to read its own manual."
"I don't do army of agents or swarms of agents because I have not found that to work for me."

HABITS

Automate family bureaucracy by extracting dates and events from school PDFs into calendars using agents.
Use agents to handle monthly accounting submissions, covering the final 20% of manual processes.
Generate 3D mounting brackets via agents for household items like IKEA light strips.
Build family dashboards from multiple PDFs, organizing info into tabs for each child.
Drive Python scripts for data pipelines from Excel files, verifying outputs as a domain expert.
Scrape and update grocery prices across sites for activism, refreshing scrapers seasonally.
Compress weekly memories into files for telegram bots, loading recent summaries to avoid emotional over-reliance.
Maintain code as ground truth, inferring styles from files without additional memory systems.

FACTS

Anthropic's Claude Co-Work provides bash access to local or cloud folders, effectively coding solutions for non-technical users.
Google's CAMEL paper proposes dual LLMs for policy and data separation to mitigate prompt injection.
Only 5% of businesses have agent experience, with European enterprises showing slow permeation.
3D printing and agent enthusiast communities are roughly equal in size, around technical non-coders.
PI restricts PRs by requiring prior issues and human approval to filter low-quality contributions from Claudebot users.
Armin's Ukrainian aid project has raised €300,000 over three years, with zero overhead for refugee families in Austria.
OpenAI's Codex CLI initially used a large system prompt, now slimmed in approved harnesses like PI.
Anthropic stores Claude Code sessions for 30 days by default, with opt-out limited to longer retention.

REFERENCES

PI (minimal coding agent harness)
Claudebot
Molliebot (likely Molbbot)
OpenClaw (tools like)
Cursor (coding agent)
Aider (anti-gravity, coding agent)
Claude Code
Codex CLI
AMP Factory (bell, coding harness)
Cue (Armin's prior project at Sentry)
Sonnet 3.5 (Anthropic model)
GPT-3.5, GPT-4 (early OpenAI models)
Claude Co-Work (Anthropic tool)
Claude for Chrome
Google CAMEL paper
iPhone Shortcuts app
Home Assistant
3D printing community tools
Christmas game (agent-built computer game)
OpenSCAD (for 3D brackets)
Canva (for school PDFs)
Python scripts (data pipelines)
Wired article on Austrian grocery scraping
Sentry.io (error tracking tool)
JQ (JSON processor for logs)
Balo's newsletter (AMP-related)
Simon Willison's newsletter (coding/AI content)
Cards-4-Ukraine.at (aid project)
Project Audio turntable
Doom (game run via agent)

HOW TO APPLY

Start with a minimal harness like PI: Set up a while loop calling an LLM with tools for file read/write, edit, and bash.
Define agent boundaries: Provide a tiny system prompt including the harness manual so the agent can self-document and extend.
Bootstrap with bash: Instruct the agent to use bash commands for tasks, avoiding complex MCPs for initial setups.
Handle memory via compaction: Have the agent summarize weekly interactions into files, loading recent ones to maintain context without overload.
Mitigate prompt injection: Avoid showing full internals to users; test web fetches in isolated environments before integration.
Customize skills on-demand: Prompt the agent to build composable tools like JSON pullers with jq, capping context by offloading to files.
Enable self-modification: Place scripts on disk for hot-reloading, allowing the agent to fix and iterate during sessions.
Steer during execution: Use queues for mid-loop feedback, adjusting paths without restarting the agent.
Verify as domain expert: For non-coding tasks, input raw data and check outputs against expectations, iterating via agent refinements.
Integrate UI extensions: Build menu-based reviews or dashboards by having the agent generate and reload interactive components.

ONE-SENTENCE TAKEAWAY

Embrace minimalist agent harnesses like PI with bash to customize workflows securely, prioritizing self-modification over hype-driven complexity.

RECOMMENDATIONS

Opt for bash-centric agents to leverage LLM training, minimizing tool bloat for efficient extensibility.
Avoid memory systems in coding; rely on codebases as ground truth to infer styles and structures.
Test prompt injections rigorously in isolated setups before deploying web-integrated agents.
Use disk-based scripts over MCPs for composable, hot-reloadable tools that reduce context waste.
Steer agents mid-session with feedback queues to correct paths without full restarts.
Compress conversation histories autonomously to prevent emotional over-attachment while retaining utility.
Customize harnesses to your workflow, rejecting vendor-imposed rigidity for personal adaptations.
Verify agent outputs as a domain expert, focusing on input-output fidelity over internal mechanics.
Limit PRs in open-source agent repos with approval gates to maintain quality amid hype.
Explore enthusiast communities like 3D printing for agent adoption patterns applicable to non-coders.
Prioritize single-agent loops with human oversight over unproven swarms for reliable features.
Build UI-driven extensions for reviews or dashboards to enhance interactivity without external dependencies.
Scrape and automate real-world data like prices using agents, updating scripts seasonally.
Donate to verified aid projects like Cards-4-Ukraine for direct impact without overhead.
Embrace physical hobbies like vinyl records to balance AI's digital intensity with tactile joy.

MEMO

In a candid Syntax podcast episode, hosts Wes Bos and Scott Tolinski sit down with Armin Ronacher and Mario Zechner, the minds behind PI—a sleek, infinitely extensible AI agent harness powering viral tools like Claudebot. Far from the hype surrounding autonomous AI, the discussion grounds agents in practicality: they're essentially LLMs augmented with tools for real-world actions, excelling at file manipulation and bash commands. Ronacher, fresh from Sentry, and Zechner, a 30-year hobby programmer with game and ML roots, emphasize PI's while-loop simplicity, which adapts to user workflows unlike rigid platforms like Claude Code or Cursor.

Security emerges as a stark reality check. Prompt injection, where malicious web data tricks agents into leaking files, remains unsolved, rendering permissions a mere charade. Even Anthropic's Claude Co-Work, lauded for bash access to folders, relies on model ethics that falter under normie use. The duo warns of permanent bindings—like Telegram integrations—mirroring remote code execution risks, where initial breaches enable unchecked access. Safety innovations, such as Google's dual-LLM CAMEL approach, seal data from policy but cripple utility, echoing choose-your-own-adventure paradoxes where decisions demand contextual actions.

Daily applications reveal agents' quiet power beyond coding. Zechner automates Austrian family bureaucracy, extracting school dates from Canva PDFs into calendars, while Ronacher builds data pipelines for his linguist wife's Excel transcripts, yielding charts and stats without her coding. Activism gets a boost too: Zechner refreshes grocery scrapers for price comparisons, spotlighted in Wired. Yet, memory poses pitfalls—persistent bots foster creepy attachments, with gaps eroding rapport. For coding, memory's redundant; code is truth, and bash logs via jq provide infinite recall without embeddings' waste.

Bash reigns supreme, a revelation from LLM RL training on file ops and commands. "Bash is all you need," Zechner quips, noting reimplementations in TypeScript for broader agents. PI shines in self-modification: agents craft hot-reloadable skills, composing tools with jq for Sentry data pulls, bypassing MCPs' non-composability. UI extensions, like custom review menus, pop up magically, turning PI into a malleable companion. Swarms underperform; single, steered sessions with Opus or Codex suffice, though OpenAI's data hunger now blesses third-party harnesses amid Anthropic rivalries.

Challenges persist in a "broken future." Production LLM projects lag demos, obscuring maintainability, while only 5% of businesses engage, per European enterprise lags. Enthusiasts—from programmers to 3D printers—adopt via adrenaline, but normies need intuitive bridges. As capabilities leap every six months, the wild west clamps down with lawsuits looming. Still, fascination endures: watching agents execute commands blows minds, blending tedium relief with innovation. Ronacher and Zechner plug PI's GitHub, Ukrainian aid via Cards-4-Ukraine.at, and newsletters like Balo's for signal amid noise, urging tactile joys like vinyl to ground the digital rush.