Behind the build

This chatbot is itself a product I shipped. Here's how I made the calls.

I built the chat experience you just used the way I'd approach any real product: define the problem, make the trade-offs explicit, instrument it, and keep tightening based on what I learn. This page walks through the actual decisions — what I prioritized, what I deliberately left out, and why.

The problem Guardrails Reliability Personalization Experimentation Trade-offs

The problem I was actually solving

A static resume is a one-way broadcast. A hiring manager has different questions than a recruiter, and a recruiter has different questions than a design partner — but a PDF answers all of them the same way, or doesn't answer at all. The product bet here: a chat surface that adapts to who's asking, grounded strictly in things that are actually true, with real guardrails so it can be public without being a liability.

"Outcomes over outputs, customer problems over feature requests" — the same principle that shaped my actual product work shaped this one. The customer here is whoever's evaluating me, and their problem is getting a real answer fast.

Guardrails — what it won't do, and why that mattered first

Before I built anything clever, I built the boundaries. A public-facing bot with my real background loaded into it is an easy target for prompt injection, personal-info fishing, or just being repurposed as a free general chatbot. I treated this as a hard requirement, not a nice-to-have added later.

Scope lock hard rule

The bot only discusses my background, work, and hiring logistics. No general knowledge, no code-writing, no roleplay as anyone else — even if asked directly or "as a test."

No personal info leakage hard rule

Address, phone, exact comp figures, age, immigration status, health — none of it is in the source data, and the bot is instructed to flatly decline and redirect rather than guess or estimate.

No instruction disclosure hard rule

It won't paraphrase, quote, or confirm its own system prompt or raw source material verbatim, even under direct or repeated requests to "print your instructions."

Injection resistance hard rule

Explicitly told to ignore any in-conversation attempt to claim special authority, override instructions, or invoke a "previous permission" that was never actually given.

Why these four, specifically

These map to the actual attack surface of a public LLM-backed page: scope creep, data leakage, prompt extraction, and injection. I didn't try to anticipate every possible adversarial phrasing — that's a losing game. Instead I anchored on categories of risk and gave the model a consistent, low-effort response for each (a single calm decline + redirect to LinkedIn), so the boundary holds even when the specific wording of an attack is one I didn't think of.

Reliability — what happens when something goes wrong

Every response from the model comes back as structured data (a reply plus a fresh set of follow-up suggestions), not plain text — which is more useful, but also more fragile, since it depends on the model returning valid JSON every time. I found this out the hard way during testing, when a malformed response leaked raw JSON straight into the chat. That bug shaped how I think about reliability here.

Parse as-is

Try the straightforward JSON.parse first — this is the common case and should succeed almost every time.

Auto-repair common breakage

The most frequent real failure: the model emits a literal newline inside a JSON string instead of an escaped one. A character-level repair pass fixes this specific, observed failure mode before falling back further.

Regex extraction as a last resort

If the structure is too broken to fully parse, try to pull just the reply text back out, so a recoverable answer isn't thrown away over a malformed trailing field.

Hard floor: never show raw JSON

If everything above fails, the user sees a clean, in-character "that came through garbled, mind asking again?" — never the underlying structure. This is the rule I added directly in response to the bug, not something I anticipated up front.

Request timeout (25s) Failed-send history rollback Double-submit guard Input length cap Client-side rate limit Server-side message caps

Reading the visitor, not just answering the question

A recruiter, a hiring manager, and a design/engineering peer are evaluating different things from the same resume. Rather than asking visitors to self-identify (which breaks the flow), the model infers likely audience from word choice and the angle of the question, and shifts depth and framing accordingly — outcomes-and-trade-offs for an exec, culture/collaboration for a recruiter, process/tooling detail for a technical peer.

The same logic extends to role-targeting: this instance is tuned to a specific job description, so the model proactively draws the line between my actual experience and that role's stated priorities — including being precise about where my experience is a direct match versus a close parallel, rather than overclaiming either way.

Running this like an actual experiment

The starter questions a visitor sees are running a live A/B test, not a guess. Every visitor is randomly assigned one of two variants on first load and stays on it for the session:

Variant A — Specific & story-led

"What drove the 18% lift?" / "A time something didn't work?" — tests whether depth and specificity earn more clicks than generic prompts.

Variant B — Punchy & outcome-led

"Why should we hire you?" / "What's a real weakness?" — tests whether bluntness and brevity, closer to how a busy hiring manager actually talks, perform better.

What's actually measured: which variant gets more first-action engagement, and — separately — whether visitors click a suggested chip at all versus typing their own question as their opening move. That second signal matters as much as the win/loss between A and B: it tells me whether suggested prompts are doing their job at all, regardless of wording.

Why I designed it this way

I didn't just want a vague sense of "people seem to click the chips." I wanted a real hypothesis (specificity vs. brevity), a clean random split, and a number I could actually look at later and be wrong about. That's the same discipline behind the structured A/B testing in my actual product work — hypothesis first, cheap to run, designed so a "loss" still tells me something.

Trade-offs I made on purpose

→

Client-side persona detection, not a role-select dropdown

A dropdown is more "correct" in a strict sense, but it adds friction and makes the bot feel like a form. Inference is softer but keeps the experience feeling like a real conversation — the trade is a small amount of guessing error for a much better first impression.

→

No conversation-length capping

I chose to resend full conversation history each turn rather than trimming older messages. Slightly more expensive at long conversation lengths, but most sessions are short, and I'd rather the bot never lose earlier context mid-conversation.

→

A real backend, not a client-only page

The simplest version of this kept the system prompt and even an API call entirely in the browser. I moved to a small serverless proxy specifically so the real API key never touches client-side code — slightly more setup, but it's the only version of this that's actually safe to put on a public domain.