NHDNUG / The Woodlands, TX / February 19, 2026
Software 1.0 is software you specify.
Software 2.0 is software you verify.
— Aaron Stannard
Three Core Assertions
-
Some developers are already obsolete
Those who refuse to learn LLM-assisted techniques - "coasters"
-
Those who adapt will be more productive than ever
LLMs are force multipliers for skilled developers
-
LLMs have fundamental, irresolvable limitations
Mathematical constraints, not engineering problems to be solved
TurboMqtt
- High-performance MQTT client for .NET
- Open source:
github.com/petabridge/TurboMqtt
- Goal: implement MQTT 5.0 protocol features autonomously
- Uses Claude Code + RALPH loop infrastructure
We're going to kick this off right now and check back on it throughout the talk.
LIVE DEMO
Kicking Off the RALPH Loop
$ cat IMPLEMENTATION_PLAN.md
$ cat ralph.sh
$ ./ralph.sh
"We'll check back on this later."
What Influences Output Quality?
- Model size - more parameters = richer representations
- Mixture of Experts (MoE) - specialized sub-networks activated per task
- Quantization - precision trade-offs for speed
- Training data quality - garbage in, garbage out still applies
- Active parameters - not all weights fire for every token
LLMs Are Stateless
Advantages
- Easy to parallelize and clone
- Good single-shot responses
- No accumulated state bugs
- Reproducible given same context
Disadvantages
- No persistent memory between sessions
- Must rebuild context every time
- Knowledge frozen at training cutoff
- Can't learn from mistakes across sessions
Key Takeaway
Prompts activate different regions of the model's learned weights.
More specific prompts = better activation = better results.
This is why context matters more than anything else.
Bad Prompt vs. Good Prompt
Bad
"Write me a marketing page."
Vague task, no context, no process, no output criteria
Good
"First analyze target audience pain points from @docs/personas.md.
Then define positioning against competitors in @docs/competitive.md.
Then write the page. Show reasoning at each step.
Output: HTML with Tailwind, mobile-first."
Context, process, verification, output format
Beyond Single Prompts
A single good prompt is fine for one-off tasks.
But a prompting system is what you need for serious work:
- System prompts that establish identity + context
- Skills that encode reusable processes
- Implementation plans that break work into context-window-sized tasks
This leads us to agentic operating systems...
The Core Files
CLAUDE.md /
AGENTS.md
The "constitution" - loaded into every agent session automatically
- Build, test, deploy instructions
- Skill routing table - which
/command handles which task
- Documentation pointers for deeper context
- Project conventions and coding standards
- Tool configuration (CLI tools, MCP servers)
Context Files
PROJECT_CONTEXT.md
- Project purpose and mission
- Adjacent repositories
- SDLC phase (active dev? maintenance?)
- Key architectural decisions
TOOLING.md
- Tech stack details
- CLI tools and their usage
- Deployment procedures
- CI/CD pipeline structure
These establish "Who am I working on?" and "What tools do I have?"
IMPLEMENTATION_PLAN.md
Breaking work into context-window-sized tasks
## Phase 2: MQTT 5.0 Auth
### Task 2.1: Implement AUTH packet
- Parse AUTH reason codes per MQTT 5.0 spec s3.15
- Add property parsing for auth-method, auth-data
- Verify: AuthPacketSpecs.cs must pass
- Verify: dotnet build succeeds with zero warnings
### Task 2.2: Implement SASL challenge flow
- Wire AUTH into connection state machine
- Support multi-step SASL exchanges
- Verify: SaslAuthIntegrationSpecs.cs must pass
The spec litmus test: "Can the agent execute without clarifying questions?"
Skills: Reusable Prompt Templates
Invoked with /commands
Productivity Multipliers
/commit - GPG-signed commits with proper messages
/pr - Create PR with summary and test plan
/review-pr - Comprehensive code review
Domain Skills
/close-deal - CRM workflow for sales
/draft-customer-email - Email with style guide
/create-benchmark - BenchmarkDotNet setup
Skills encode tribal knowledge as executable prompts
LIVE DEMO
TurboMqtt's Agentic OS
$ cat CLAUDE.md
$ cat IMPLEMENTATION_PLAN.md
$ ls .claude/skills/
$ cat ralph.sh
The actual file structure powering our running demo
What is a Harness?
A tool that gives LLMs access to files, shell, web, and context.
Claude Code
CLI-based, terminal-native
OpenCode
Open source, multi-model
Cursor / Windsurf
IDE-integrated
GitHub Copilot Workspace
Cloud-based, PR-oriented
What to Look For in a Harness
-
Model Selection
Match the model to the task. Not everything needs Opus.
-
Intelligent Context Gathering
Load skills, files, and relevant context automatically when needed.
-
Local Environment Feedback
e.g., OpenCode + C# LSP = real-time compiler feedback without running dotnet build
Model Routing
Not all tasks need the most expensive model.
| Task Type |
Model |
Why |
| Complex architecture |
Opus |
Needs deep reasoning |
| Standard coding tasks |
Sonnet |
Good balance of speed and quality |
| Formatting, renaming |
Haiku |
Fast, cheap, good enough |
| Browser automation |
Haiku |
Token-heavy but simple logic |
Example: playwright-gopher agent uses Haiku for browser tasks that would burn through Opus budget
Attention Degrades at Scale
Maximum effective context << advertised context
"Context Is What You Need" (arXiv:2509.21361)
200K
Advertised context window
~30-50K
Effective attention range
Context Compaction
What happens when the window gets too large
📝
Full conversation
50K tokens
→
🗜️
Compacted summary
10K tokens
→
Compaction is lossy. Critical details get summarized away.
Context Management is a Skill
Treat each agent session like a well-designed function:
One job. Executed well. Completed decisively.
- Don't have multi-hour conversations with agents
- Break large tasks into discrete, completable units
- Start fresh sessions for fresh tasks
This leads us directly to RALPH loops...
Two Steep Requirements
1. Really Good Planning
Developers become project managers whether they want to or not
2. Really Good Verification
Non-deterministic output demands deterministic checks
"Historically, software developers are beyond terrible at both of these.
It is time to git gud."
LIVE DEMO
TurboMqtt RALPH Progress Check
What has the autonomous loop accomplished so far?
$ git log --oneline -10
$ dotnet test
Hallucinations Are Mathematical
LLMs predict plausible outputs, not verified truths.
- This is inherent to transformer architecture
- It will NEVER be fully "fixed"
- Better models hallucinate less, but never zero
- Even a 0.1% hallucination rate = real bugs in production
LLMs Don't Reason - They Predict
They predict what text humans previously used to describe the world,
one token at a time.
Not understanding - pattern matching on training data
Can produce alien-looking decisions - optimizing for prediction, not logic
But emergent behaviors ARE real - autonomous coding, image generation, translation
Three Irresolvable Limitations
1. Hallucinations
- Mathematical constraint of transformer architecture
2. Finite Effective Context
- Orders of magnitude smaller than advertised
3. Misalignment
- Model biases can diverge from your actual goals
These aren't getting "fixed." Design your workflow around them.
Planning Modes
Read-only research, no writes
Claude Code:
/plan command or shift+tab to toggle plan mode
OpenCode:
Built-in plan mode with LSP integration
General principle:
Research first, write second. Always.
Three Things to Provide
-
Detailed description of goals and desired output
What does "done" look like? Be explicit.
-
Constraints
Visible from source (API contracts, types) + invisible ones ("40MB App Store limit", "must run on .NET 8")
-
How the LLM can verify it did its job
Specific test files, build commands, acceptance criteria
The Spec Litmus Test
"Can the agent execute your specification
without clarifying questions?"
If not, you haven't been specific enough.
The Fundamental Shift
Software 2.0
80% planning + review
20% coding
You're becoming a specification author and project manager.
Same Prompt, Different Results
Non-determinism demands deterministic checks
↓
Deterministic verification gates
Beyond Code Correctness
-
UI Design Verification
Screenshot comparison against approved mockups
-
Documentation Fact-Checking
Link validators, API reference accuracy, code sample testing
-
Adversarial Review
LLM-reviews-LLM with sufficient context and fresh eyes
-
Behavioral Verification
Does the feature actually work the way users expect?
Minimize Slack
More verification = less room for error
= more trust in autonomous operation
Every unchecked dimension is a dimension where hallucinations can hide.
Close the gaps. Tighten the checks. Trust the process, not the output.
Why Observability Matters
Understanding HOW agents make decisions
-
Audit trail - what files were read, what was modified, in what order?
-
Decision tracing - why did the agent choose approach A over B?
-
Failure analysis - when something goes wrong, can you trace back to the root cause?
-
Trust building - over time, patterns emerge that build or erode confidence
LIVE DEMO
RALPH Output Log Review
Tracing agent reasoning through the log
$ tail -100 ralph-output.log
$ git log --oneline --stat -5
$ git diff HEAD~3..HEAD --stat
What decisions did the agent make? How can we trace its reasoning?
Start Tomorrow
Three concrete steps to begin your Software 2.0 journey
-
Write a CLAUDE.md
Describe how to build, test, and deploy your project. Takes 30 minutes.
-
Close one verification gap
Add a linter rule. Write a test for that untested function. Enable a formatter.
-
Apply LLM assistance to that tech debt you've been avoiding
Perfect first project. Low stakes, high learning, immediate value.
Thank You!
Questions?