Long-horizon AI agents are becoming increasingly viable, primarily due to improvements in both underlying language models and the surrounding 'harnesses' or tooling.
The most successful current applications for these agents are in software development and other tasks that produce a 'first draft' for human review, such as research reports or incident analysis.
Building and debugging agents is fundamentally different from traditional software development; the source of truth shifts from the code alone to a combination of code and execution traces, making tracing an essential tool.
Providing agents with tools, especially access to a file system, is considered a mandatory requirement for building effective, complex agents as it aids in context management and enables more sophisticated tasks.
12 quotes
Concerns Raised
Current agents lack the high-level reliability needed for full autonomy in production.
Models are not yet proficient enough at using web browsers, limiting their capabilities in that domain.
Debugging agents is impossible without deep tracing, as code alone does not reveal the application's behavior.
Opportunities Identified
Building agents that generate 'first drafts' for human review in coding, research, and finance.
Developing AI SRE (Site Reliability Engineer) agents to automate incident investigation.
Creating sophisticated agent harnesses that provide essential tools like file system access and memory.
Using human-labeled traces to build 'LLM as a judge' evaluators for automated testing and calibration.