Why a baseline is worthless if you only run it once

Two weeks ago we published a baseline of how AI engines describe Seenu Tech. It was useful for about a day. A baseline is a single photograph: it tells you what one moment looked like, not whether you are getting better or worse. AI answers are regenerated on every query and shift as models update, so a citation you earned in April can quietly disappear in May without any change on your end. The only way to manage that is to re-run the same checks on a fixed cadence and compare the wording, not just the presence, of your brand. So in week one we did something deliberately boring: we froze the baseline into a script of 24 prompts, assigned each to a category, and committed to running the identical set every Monday. The goal was not a bigger report. It was a smaller, repeatable one we could actually trust. This is the difference between an audit you buy once and a system you operate. If you only want the snapshot, our living lab page shows the format. If you want the loop, keep reading.

The four prompt categories we run every week

We sort every prompt into one of four intents, because each one fails differently. Brand prompts ask the engine directly: "What does Seenu Tech do?" These test whether the model has an accurate, current description. Category prompts ask without naming us: "Which agencies help NJ businesses get cited by AI engines?" These test whether we surface at all when the buyer does not already know our name. Buyer-intent prompts mimic a real decision: "I run a med-spa in Fort Lee and want to show up in ChatGPT answers, who should I call?" These are the highest-value and the hardest to win. Competitor prompts ask the engine to compare options, which tells us how we are positioned against named alternatives. In week one, brand prompts were strong, category prompts were inconsistent across engines, and buyer-intent prompts were where we lost the most ground. That split is the entire point: a single "are we visible" number would have hidden it. Knowing which category is weak tells us which page to fix next, which is far more actionable than an overall score.

What we actually observed in the four engines

Here is a concrete example from week one. We asked Perplexity, "Which firms help local businesses in New Jersey become discoverable in AI answers?" Perplexity returned a clean list of five firms and cited the source page for each. We were not in it. Then we asked ChatGPT the brand prompt directly, and it described us accurately but used an older line about "SEO services" rather than GEO, language we had already retired on our own site. Gemini answered the category prompt by summarizing a third-party directory rather than any agency site. Google AI Overviews surfaced our living lab page for a brand query but truncated the description mid-sentence. None of these are disasters. But each is a specific, fixable gap: Perplexity needs a citable category page it can lift from, ChatGPT is reading stale cached language, and the AI Overview is pulling a weak meta description. We wrote each observation down verbatim so that next Monday we can tell whether our fix moved it.

The three pages we changed in response

Observation is only half the loop; week one ended with three concrete edits. First, the stale "SEO services" line ChatGPT echoed traced back to two older pages we had not fully updated, so we rewrote them to lead with GEO and a one-sentence definition an engine can quote cleanly. Second, because the category prompt kept surfacing directories instead of us, we strengthened our NY/NJ services page with a plainly stated list of who we help and where, the exact phrasing a model needs to associate us with the query. Third, we fixed the truncated description Google AI Overviews was pulling by tightening the page summary to a complete, self-contained sentence. We did not touch design, add keywords, or chase volume. Each change was a direct response to a specific thing an engine got wrong. That is the discipline of a growth log: you do not edit because it feels productive, you edit because you watched a model misread you and you can name the page responsible.

Making it repeatable without it becoming busywork

The risk with any weekly ritual is that it turns into theater: a report nobody reads, run because the calendar says so. We protect against that with three rules. First, the prompt set is frozen for a full quarter. Changing the prompts every week makes the data uncomparable, which is the one thing that makes the whole exercise pointless. Second, every check ends with a decision, not a document: either a page gets edited this week or we explicitly note why nothing needs to change. A week with zero edits is a valid outcome, but it has to be a choice. Third, we keep the verbatim engine responses in one running file so the comparison is mechanical rather than from memory. The entire week-one pass took under two hours. That is the real argument for systematizing it: a process this light has no excuse to be skipped, and the compounding value comes from running it forty times, not once brilliantly. Consistency beats intensity here.

What this means for your own business

If you have ever run a one-time AI visibility check, you already have a baseline, and you already know it is going stale. The work that matters is the loop: the same prompts, the same engines, the same cadence, and a willingness to edit the specific page each gap points to. You do not need our toolset to start. Pick eight prompts across the four categories, run them in ChatGPT and Perplexity, paste the answers into a document, and look honestly at where you are absent or misdescribed. Do it again next Monday. If you would rather we run the loop for you with a structured weekly log like this one, start with our AI visibility audit and we will build your fixed prompt set together. The first photograph is easy. The second, third, and tenth are where visibility is actually won, and that is the only part most businesses skip.

Related assets

Back to Business Blog