One AI Parameter Change Cost Me $54/Month

I trusted Claude to help me migrate from AWS Lambda to Google Cloud Run. The migration went perfectly – services deployed, workflows executed, users were happy. Then I checked our API bills and nearly fell off my chair.

*This is part 3 of my DIALØGUE engineering series. If you missed them: [Part 1: DIALØGUE Introduction] and [Part 2: From Advertising to Engineering] cover the foundation.*

This is the story of what happens when you let AI write production code without being explicit about production constraints. Spoiler alert: “make it work” and “make it production-ready” are very different requests.

Table of Contents

The Migration That Went Too Well

After wrestling with AWS Lambda for months (cold starts, layer limits, the works), I decided to migrate everything to Google Cloud Run. Being a pragmatic developer (read: lazy), I enlisted Claude as my pair programmer.

“Help me migrate these Lambda functions to Cloud Run,” I said. “Here’s the existing code.”

Claude delivered beautifully. Clean Dockerfiles, proper service configurations, working Cloud Workflows. The migration took just one day instead of the weeks I expected. I was thrilled!

Everything deployed smoothly. Services started up fast. Users were creating podcasts without issues. Success, right?

Then I noticed something odd in our logs:

```
[ERROR] Segment generation failed: Unexpected token in JSON
[RETRY] Attempting segment generation again...
[SUCCESS] Segment generated successfully
```

About 30% of our AI calls were failing on the first attempt but succeeding on retry. Not the end of the world – my retry logic was working! Users didn’t notice any issues. But man, those extra API calls were adding up.

The Investigation: Blame the Migration?

My first thought was that something went wrong during the migration. Maybe the new Cloud Run environment was different? Container networking issues? I spent hours comparing AWS and GCP configurations.

Everything looked identical. Same prompts, same retry logic, same error handling. So why were we suddenly getting malformed JSON responses?

Then I started comparing the actual code. Here’s what I found:

AWS Lambda version (working fine for months):

```python
response = anthropic.messages.create(
    model="claude-3-7-sonnet-20250219",
    temperature=0,  # Deterministic for JSON
    messages=[{"role": "user", "content": prompt}]
)
```

GCP Cloud Run version (Claude’s migration):

```python
response = anthropic.messages.create(
    model="claude-3-7-sonnet-20250219",
    temperature=0.7,  # <-- Wait, what?
    messages=[{"role": "user", "content": prompt}]
)
```

There it was. Temperature 0.7.

“But that’s a reasonable default!” you might say. And you’d be right. For creative writing, exploration, brainstorming – 0.7 is perfectly sensible. But for structured JSON generation? It’s a disaster.

The Smoking Gun: Creative JSON

After adding detailed logging to capture the actual AI responses, I found exactly what was happening. Here’s what Claude was returning at temperature 0.7:

```
Here's the podcast segment you requested:

{
  "title": "The Rise of AI Podcasting",
  "content": "Welcome back, listeners! Today we're diving into something really fascinating...",
  "duration": 120
}

I hope this segment captures what you were looking for!
```

See the problem? Claude was being creative with the response format. Sometimes just JSON, sometimes JSON with helpful commentary, sometimes JSON wrapped in markdown code blocks. At temperature 0.7, it was improvising like a jazz musician when I needed a metronome.

The AI Pair Programming Blind Spot

Here’s the thing about working with AI as a pair programmer: it’s incredibly good at making code work, but it doesn’t necessarily optimize for production concerns unless you specifically ask.

When I said “migrate this Lambda function to Cloud Run,” Claude focused on the migration requirements:

– ✅ Make it run on Cloud Run

– ✅ Handle the same inputs and outputs

– ✅ Maintain the same functionality

– ❌ Optimize for production efficiency

Claude chose temperature 0.7 because it’s a “reasonable default” for AI applications. And it is! For most use cases, 0.7 gives you a nice balance of creativity and consistency.

But here’s what I learned: **AI doesn’t know your specific production constraints unless you tell it.**

My original AWS code used temperature 0 because I’d learned (the hard way) that JSON generation needs deterministic outputs. But during migration, I never explicitly said “this is for structured data generation” or “optimize for reliability over creativity.”

So Claude wrote perfectly functional code with sensible defaults. The problem wasn’t the AI – it was my incomplete requirements.

The Fix: Getting Specific with AI

Once I identified the problem, I went back to Claude with better requirements:

“Help me fix this code. It’s for JSON generation in production. I need 100% reliability, zero creativity. Use temperature 0 and any other optimizations for structured data.”

Claude immediately suggested:

```python
response = anthropic.messages.create(
    model="claude-3-5-sonnet",
    temperature=0,  # Deterministic outputs
    response_format={"type": "json_object"},  # JSON mode
    system="You are a JSON generation assistant. Output only valid JSON.",
    messages=[{"role": "user", "content": prompt}]
)
```

Wait, there’s a JSON mode now? (This is what happens when you’re heads-down building instead of reading changelogs!)

Success rate jumped from 70% to 99.9%. The remaining 0.1%? Network timeouts. Can’t blame Claude for those.

The difference? This time I was explicit about production constraints. I didn’t just ask for a migration – I asked for production-optimized, reliability-focused code.

The Real Cost of “Reasonable” Defaults

Let me break down what this “reasonable” temperature setting actually cost us:

**Daily podcast generations**: ~200

**Failure rate**: 30% requiring retries

**Extra API calls per day**: 60 failed calls needing retries

**Monthly waste**: ~1,800 unnecessary API calls

**Claude 3.7 Sonnet pricing**: $3 input + $15 output per million tokens

**Average tokens per call**: ~2,000 input + 1,500 output

**Cost per retry**: ~$0.03 per failed attempt

**Monthly overage**: ~$54 in wasted API calls

But the real cost wasn’t just the $54/month. Each retry added 3-5 seconds to generation time. Users were waiting longer, my Cloud Run instances were spinning up more often, and I was burning through quota faster.

The kicker? This all happened because I trusted AI to make production decisions without giving it production context. Classic case of “it works” vs “it works efficiently.”

What I Learned About AI Pair Programming

1. Be Explicit About Production Requirements

“Make it work” gets you functional code. “Make it work efficiently in production with these constraints” gets you optimized code. AI is great at solving the problem you describe, not the problem you have in mind.

2. AI Uses “Reasonable” Defaults, Not “Optimal” Ones

Temperature 0.7 is reasonable for most AI applications. But production systems often need unreasonable optimization for specific use cases. AI won’t know this unless you tell it.

3. Code Review Applies to AI-Generated Code Too

Just because AI wrote it doesn’t mean it’s production-ready. I should have caught this during code review, but I was so focused on whether the migration worked that I didn’t audit the parameters.

4. Context Matters More Than You Think

My original AWS code had temperature 0 for a reason – learned through painful experience. During migration, that context got lost. Now I document the “why” behind every seemingly arbitrary parameter.

My New AI Pair Programming Workflow

Now when I work with AI on production code, I’m much more explicit about constraints:

**Before:**

“Migrate this Lambda function to Cloud Run”

**Now:**

“Migrate this Lambda function to Cloud Run. This is for production JSON generation – prioritize reliability over creativity. Use temperature 0, JSON mode if available, and any other optimizations for structured data output.”

Here’s the production-optimized config Claude helped me build:

```python
def get_ai_json_response(prompt: str) -> dict:
    """Production-optimized JSON generation with AI"""
    response = anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        temperature=0,  # Zero creativity for structured data
        response_format={"type": "json_object"},  # Force JSON mode
        system="You are a JSON generation assistant. Output only valid JSON.",
        messages=[{
            "role": "user", 
            "content": f"{prompt}\n\nRespond with valid JSON only."
        }]
    )
    
    # Explicit error handling for production
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError as e:
        logger.error(f"JSON parse failed: {e}")
        logger.error(f"Raw response: {response.content[0].text}")
        raise
```

The difference? I gave AI the context it needed to optimize for my specific use case.

The Results

30% lower API costs, 40% faster generation. All from one conversation where I was actually specific about what “production-ready” meant.

The funny part? Even our “creative” dialogue generation uses temperature 0. Turns out deterministic doesn’t mean boring – it means reliable. The conversations still sound natural because the prompts and content research provide the variety, not random temperature fluctuations.

Turns out AI pair programming works great when you remember it’s still programming – precision in the requirements gets you precision in the results.

Want to create your own AI podcasts with guaranteed valid JSON? Try DIALØGUE – 2 free credits to get started! 😛

Part 3 of the DIALØGUE Engineering Series. Still learning that “make it work” and “make it work efficiently” are completely different requests. Follow more AI pair programming adventures at chandlernguyen.com.

  **Next in series**: Coming in about 7 days – “From 3 Minutes to 500ms: The Signup Bug That Made No Sense”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.