I’ve been using Claude Code and Cursor heavily for the past few months. The productivity gain is real - I built a working prototype of a monitoring system in an afternoon that would have taken me a week. But I also shipped a bug to production because the AI-generated code silently swallowed an exception. Lessons were learned.
Here’s what I’ve figured out about the gap between “vibe coded prototype” and “thing that runs in production without waking me up at 3am.”
The trick is telling the AI what you want to exist, not how to build it. When I say “create a Flask app with psycopg2 that queries PostgreSQL,” I’ve already made a bunch of decisions. When I say “I need an API for storing user profiles with CRUD operations,” the AI picks reasonable defaults and I can focus on whether the behavior is right.
When something breaks, I describe the problem instead of debugging it myself. “The create endpoint works but get-by-ID returns a 500, logs show connection already closed.” The AI has context about what it just wrote. It’ll usually spot the issue faster than I would and explain why it happened.
The trap is letting the AI make decisions you should be making. “Handle errors appropriately” gets you generic try/catch blocks. “When payment fails, retry 3 times with backoff, then mark the order failed and email the user” gets you code that does what you actually need.
I stop vibe coding when I notice I’m re-explaining the same context, or when fixes start introducing new bugs, or when I’m not sure what the code is doing anymore. At that point I have a working prototype and it’s time to actually engineer it.
After vibe coding a few projects to production, the gaps are predictable:
No tests. The AI optimizes for “does this look right” not “will this break.” I tested it manually, it worked, ship it. Then edge cases happen.
Optimistic error handling. Everything is wrapped in try/catch that logs and returns 500. No distinction between “you sent bad data” and “our database is down.” Users see “internal server error” for everything.
print() debugging left in. Or worse, no logging at all. Something breaks in production and I’m flying blind.
Hardcoded config. The AI puts in working values to make the example run. Database URLs, API keys, timeouts - all sitting in the code.
Security as an afterthought. “I’ll add auth later” is in the commit history of every prototype I’ve written.
I don’t aim for coverage percentages. I write tests for:
The happy path, end-to-end. Does the main flow actually work?
The business logic I wrote myself. The custom stuff, not the CRUD boilerplate.
The edge cases that have bitten me before. Empty strings. Nulls where I expected values. Concurrent requests to the same resource.
I ask the AI to generate tests, but I’m specific: “Test payment processing. Include: successful payment, insufficient funds, expired card, two concurrent payments for the same order.” Generic “write tests for this module” gets you generic tests.
The other thing: vibe-coded prototypes are often untestable because everything is coupled together. The class creates its own database connection in init. You can’t mock it. I’ve started asking the AI to “refactor this to use dependency injection so I can test it” before I write any tests.
The minimum viable improvement:
class ValidationError(Exception):
"""User sent bad data - 400"""
pass
class NotFoundError(Exception):
"""Thing doesn't exist - 404"""
pass
# Catch these specifically and return appropriate status codes
# Everything else is a 500, but log it properly
The thing that actually matters is separating “errors I should show to users” from “errors I should hide from users.” Validation failures, not found, conflict - users can see those. Database connection errors, null pointer exceptions, third-party API failures - users get a generic message, but I get the full stack trace in logs.
I switched to structured logging after spending an hour grep-ing through text logs trying to trace a request. Now every log line is JSON with a request ID:
# Before
print(f"Processing order {order_id}")
logger.info("Payment complete")
# After
logger.info("payment_processed", order_id=order.id, amount=amount, method=method)
The structured version is searchable. I can query “show me all logs for request abc-123” or “show me all payment failures over $1000 in the last hour.” The print statements are write-only.
Request IDs are the key. Generate a UUID at the start of each request, attach it to every log line, return it in error responses. When a user reports “I got an error,” they give you the request ID and you can see exactly what happened.
Input validation. The AI is optimistic about inputs. It assumes request.json[“email”] exists and is a string. Production users will send you nulls, arrays, and SQL injection attempts. I use Pydantic now for everything that comes from outside:
class CreateUserRequest(BaseModel):
email: EmailStr
name: str = Field(min_length=1, max_length=100)
If it doesn’t validate, it throws before my code runs. No more checking for None everywhere.
Auth on every endpoint. I’ve shipped endpoints that forgot the auth decorator. Now I use a before_request hook that requires auth by default and I explicitly mark public endpoints:
PUBLIC_PATHS = ["/health", "/login"]
@app.before_request
def require_auth():
if request.path in PUBLIC_PATHS:
return
# ... validate token
Secrets out of code. The AI puts in placeholder values to make examples work. I’ve committed API keys. Now I make secrets required from environment with no defaults - the app crashes on startup if they’re missing, which is better than running with test credentials.
Hardcoded values are the silent killer. The prototype works with 5 second timeouts and 10 database connections. Production needs different numbers. I externalize anything that might vary:
DATABASE_URL = os.environ["DATABASE_URL"] # Required, no default
REQUEST_TIMEOUT = int(os.environ.get("REQUEST_TIMEOUT", "30")) # Optional with default
DEBUG = os.environ.get("DEBUG", "").lower() == "true" # Off unless explicitly on
The required ones fail fast if missing. The optional ones have sensible defaults. Debug is never on by accident.
Health check endpoints. Kubernetes (or whatever) needs to know if your app is alive:
@app.get("/health")
def health():
return {"status": "ok"}
@app.get("/ready")
def ready():
# Check database, redis, whatever
# Return 503 if something's down
Graceful shutdown. When the container gets SIGTERM, finish in-flight requests before dying. The AI never adds this:
import signal
def shutdown(signum, frame):
# Stop accepting new requests
# Wait for current requests to finish
# Then exit
signal.signal(signal.SIGTERM, shutdown)
Before I deploy something I vibe coded:
Can I trace a request through the logs? Do errors tell me what went wrong? Is there anything hardcoded that shouldn’t be? Did I test the main flow and the obvious edge cases? Is auth required everywhere it should be? Will it start up correctly in production, or crash because of missing config?
That’s it. It’s not comprehensive, but it catches the stuff that actually wakes me up at night.
Vibe coding is great for getting something working. It’s not great for keeping it working. The gap is predictable - tests, errors, logging, security, config. Once you know what to look for, it takes maybe an hour to harden a prototype. That hour saves you the 3am pages.
The engineers who get the most out of these tools are the ones who know what production code looks like. The AI gets you 80% of the way there fast. The last 20% is still on you.