Signals

Quick reads from the intersection of AI, analytics, and security.

AI Broke the SDLC: Zero Trust, Zero Traceability

Published • Nov 21, 2025

Last week I pasted an entire AWS CloudFormation template into ChatGPT and ask it to "make this more secure." The model obliged. Generated a cleaner version. Added encryption. Tightened IAM policies. All I could think was: What if the model hallucinated a vulnerability? What if it introduced a backdoor we can't see? How do we audit a decision made by a black box?

If you've spent any time in tech, you know the software development lifecycle (SDLC) has been through a lot, from waterfall to agile, we've spent decades building guardrails: code reviews, CI/CD pipelines, security scanning, change management. All of it built on one assumption: humans write the code, humans review it, and humans understand what it does. AI doesn't fit that model. And instead of redesigning the SDLC, we're bolting AI onto processes built for predictable systems, and it doesn't seem to be going well.


AI Outputs: Created by No One, Verified by No One, Trusted by Everyone.

Traditional software gives us:


Version control: Every change is tracked, attributed, reversible.

Code review: Humans validate logic before it ships.

Predictable behavior: The same input produces the same output.

Auditability: You can trace a bug back to the commit that caused it.


AI-generated code gives us:


No version control for models: You can lock a commit, but you can't lock how a model interprets a prompt after retraining.

No human validation at scale: Engineers are reviewing AI outputs faster than they'd review their own code.

Unpredictable behavior: The same prompt can produce different results depending on model drift, temperature settings, or training data changes.

No auditability: You can't trace an AI decision back to its source. The model doesn't "remember" why it chose one implementation over another.


The SDLC assumes everything behaves the same way twice. AI doesn't work like that.


There are a few tools out there starting to tackle this problem, but the ecosystem is still catching up. Most security scanners flood you with alerts about what's present in your code, but not where it came from or how it got there.


One company actually addressing provenance is Kusari. Instead of just telling you "here's a vulnerability," Kusari traces every component in your software back to its original source. It answers the questions that are actually helpful in the current SDLC: Where did this dependency come from? When did it show up? It flags for possible typosquatting attacks etc.


What makes this useful for AI-generated code is that Kusari automatically produces signed SBOMs (Software Bill of Materials), provenance attestations, and vulnerability reports for every build. If an LLM generates a module that adds in dependencies, you get a verifiable trail of what got added, how it entered the pipeline, and the provenance of those dependencies. It's the kind of transparency we need, especially as AI makes it easier to ship code faster than we can audit it. But tools alone won't solve this. The real challenge is adoption. Read more about Kusari.

Everyone: ‘We Need More AI.’ Me: ‘We Need More Provenance’

If you have ever ordered an item online, you likely have tracked the package, from the warehouse to your doorstep. You order it, get a confirmation email, then later an email with shipping information including a tracking ID. From there, you’ll get alerts about the path of your package: moved to sorting center, loaded on truck, out for delivery and finally arrived at your door. Data lineage and the SDLC work the same way, they show you the path your data or code takes, stage by stage from creation to final output. Provenance goes a step further, it provides the proof behind it. If data lineage and SDLC are the macro, then data provenance is the micro. Rounding back to the package analogy, provenance is the metadata of your package:


- the store the item came from

- who packed it

- the batch number

- the digital signature on the label

- the weight and barcode hash that would flag tampering


Provenance tells you not just where something traveled, it tells you where it originated, who touched it, and whether it was altered along the way. This matters more now because AI doesn't just use code, it creates it. And it does so at a pace humans can't match or audit effectively.


When a developer writes code, you get a commit history. You see their name, the timestamp, the reasoning in the pull request. You can trace a bug back to line 47 of a specific file merged three weeks ago. That's provenance built into the process.


When an AI writes code, you get… output. No commit message explaining why it chose that implementation. No human who remembers the context. Just a block of working (or seemingly working) code that someone reviewed for 90 seconds before merging.


Ask yourself: If a model writes a Lambda function that processes PII, can you audit what logic it used? If AI-generated IAM policies grant excessive permissions, how do you trace the chain of decisions that led there?

You can't. Not with current tooling.


Every AI Touchpoint Is a New Entry Point

With the introduction of AI into many people’s day to day, every place it touches is a new potential vulnerability.


Prompt Injection: The new SQL injection. An attacker can manipulate an AI agent's instructions mid-execution, turning "summarize this data" into "exfiltrate this data."

Training Data Poisoning: Supply chain attacks now target model training pipelines. If someone corrupts the data used to train a model, the model will pick up those bad patterns and behave maliciously later.

Shadow AI: Employees paste internal credentials, API keys, and proprietary logic into an AI chat without realizing they're leaking secrets.

AI Agents = Automated Attack Paths: If an AI agent can click, fetch, run code, or schedule tasks, a single wrong prompt can send it down a harmful path with no human in the loop.


Most SDLC frameworks treat AI as a feature. We should start treating AI is an infrastructure layer we don't control.


What's Missing: Provenance for AI Outputs


This is where the SDLC needs to evolve. If data lineage was critical for analytics, AI lineage becomes critical for trust.


Every AI output should come with:


A fingerprint: A hash of the content so you know if it's been altered.

A provenance trail: What prompt generated it? What model version? What context was provided?

Reproducibility: Can you recreate the same output given the same inputs?

Attribution: Who requested it? What system generated it? Was it human-reviewed before deployment?


We do this with sensitive data pipelines everyday. You need to know where data came from, how it was transformed, and who touched it. We should be doing the same with AI-generated code and infrastructure.


What the Future SDLC Should to Look Like


The SDLC should evolve to incorporate AI in every step of the process.


  1. 1. AI-Aware Requirements Gathering: Define what the model should never do as clearly as what it should do.

  2. 2. Secure Model Development Lifecycle (SMDLC): Version your data, your embeddings, your prompts, your fine-tuning recipes. Treat models like infrastructure.

  3. 3. Guardrail Testing Instead of Unit Testing: You can't write a unit test that accounts for creativity. Test boundaries: adversarial inputs, edge cases, and red-team scenarios.

  4. 4. Provenance and Auditability at Every Stage: Every AI output should be traceable. If a model generates a security policy, you should be able to answer: What prompt created this? What version of the model? Was it reviewed? By whom?



Same as It Ever Was: Cryptography and the Fight for Reality

Published • Oct 12, 2025

How do we verify AI-gen content?

The other day I was scrolling through a reel from a security content creator I follow, and halfway through it cut to what turned out to be a deepfake. I couldn't tell where the real footage ended and the fake began. Chilling. The repercussions? Chilling to even think of it. OpenAI's new Sora2 can generate video that looks real. But the truly unsettling part isn't the visual quality, it's the cognitive exhaustion that comes with questioning everything you see. Your brain wasn't built to constantly verify reality. It's more draining than mindlessly swiping yourself into an algorithm hole, at least there you're passive + draining. This is active paranoia. There's a line in the Talking Heads' "Once in a Lifetime" that feels eerily relevant now. You know, that disorienting feeling of looking around and realizing you don't recognize the world you're in anymore. That's where we are with AI-generated content.


And You May Ask Yourself, Well… How Did We Get Here?

So what is currently being done to verify AI-generated content? Why with AI detection tools, of course! Upload an image or video, and it will tell you if it’s real or not. The problem I foresee with this is that it likely isn’t sustainable. AI is a rapidly advancing technology and you’re essentially fighting fire with fire, as the detection gets better so does the generation. AI is all about learning, so the models will eventually learn to evade current AI detection techniques. Unless we figure out accessible solutions outside of AI, it will be a long, reactive game of whack-a-mole (such a futile game).


I thought about this a bit and instead of trying to identify what's fake, what if we could verify what's real? How do we identify if what we have in front us hasn’t been altered and how can we trust this source? If you view this from the lens of cybersecurity, it seems pretty simple: hashing and cryptography. We use and trust cryptography for highly sensitive actions today, like bank transactions, verifying software updates, safely buying something online, credit card transactions etc. What if we employ these principles to help you verify whether a video actually came from who it claims to be from? The technology exists. The challenge is making it simple enough that normal people can use it without thinking too much about it.


Water Dissolving, and Water Removing

It took me a good 2 months to where I felt like I could fully understand hashing, symmetric and asymmetric cryptography, ashamed to admit a little longer to actually implement it. However, from teaching a course currently in Security, my favorite analogy is the wax seal analogy.


Imagine someone preparing to send an important letter centuries ago. Before they seal it, they need to verify exactly what's in that letter, this is where hashing comes in. A hash is like taking a fingerprint of the entire document. Every word, every comma creates a unique mathematical fingerprint. Change even one character, and you get a completely different fingerprint. Now they're ready to seal it. They drip hot wax on the letter and press their signet ring into it, creating a unique stamp with their family crest. When someone receives the letter, they look at that wax seal and immediately know who sent it because they recognize the pattern.


The most important part: you can't fake that seal without the actual ring. The pattern is too specific. Digital signatures with cryptography work identically. Your private key = the physical ring (only you have it). The signature it creates = the wax impression (unique and unforgeable). Your public key = everyone's ability to look at the seal and verify that it's authentic.


The creator's public key (which anyone can access) can verify that signature. If the math checks out, you know two things: 1) The content hasn't been altered, and 2) It came from who it claims to be from.


Letting the Days Go By, Into the Blue Again

Rounding this back to AI-generated content, AI can create a deepfake of a rat eating a pizza on the L train, but it can’t create a video that appears to be signed by The New York Times' private key, because it doesn't have that key. No one does except The New York Times.


The idea is this: when a piece of media is created, whether it's a photo from your iPhone or a video published by Reuters, the device or software creates a unique "fingerprint" of that file. This is called a hash, and it's mathematically unique to that specific content.


If every camera, news organization, even every verified creator signed their content at the moment of creation, we'd have a way to separate authentic media from everything else.


The problem is adoption. Like I mentioned previously, I learned cryptography at painfully slow pace, so you can’t expect every news organization, or even every verified creator to create a hash of their content and cryptographically sign each piece of content.


But, this system already works in parts of the tech world. When you visit a website with HTTPS, cryptography is verifying that you're actually talking to that site and not an imposter. You probably don't even think about it, you just look for the little lock icon.


That's the goal. Make verification so seamless that people don't need to understand the underlying cryptography. It would be helpful if instead of overwhelming your cognitive faculties every time you see a piece of content, you could just check a verification badge:


  • Verified - Authentic, signed by CNN, unaltered
  • Modified - Originally from CNN but edited after signing
  • Unverified - No proof of origin

You don't need to understand underlying public key cryptography (but it’s a good thing to learn and you should know how underlying pieces of technology you interact with work). You just need to see verified, modified or unverified. People need to understand what these badges mean without needing a computer science degree. "Verified" can't just mean "complicated math says it's probably okay." It needs to mean "This is what it claims to be, and here's how we know."


So far there aren’t many companies doing this, outside of the big tech companies, at least from what they’ve reported on. As of writing this, there is one company, Proof, that just introduced a product, Certify, meant to fight AI-generated fraud via digital signatures.


Read more about Proof’s launch here: Proof Launches Certify – The Cryptographic Answer to AI-Generated Fraud .


The Coalition for Content Provenance and Authenticity (C2PA) is trying to standardize this, but adoption is slow. We need camera manufacturers, social media platforms, news organizations, and AI companies to all follow etc.


You May Ask Yourself, Am I Right... Am I Wrong?

Even if we build all of this, it only works if people care.


Right now, most people don't verify sources before sharing something. They see a shocking video, their emotions kick in, and they hit repost. By the time fact-checkers revoke it, it's already been viewed by millions of users.


What if platforms buried unverified content in the algorithm? What if your browser slapped a warning on any media without a signature? What if news organizations refused to amplify anything that couldn't be verified?


It's not about forcing people to check every piece of media they see. It's about making verification the path of least resistance.


So how do we get there? How do we make cryptographic signatures as invisible and expected as HTTPS? As standard as a verified checkmark? The technology exists. We can build the infrastructure. What's missing is adoption.


Or maybe it's simpler than that. Maybe we just need one major event where verification would have prevented massive harm. One deepfake scandal that's so damaging, so obvious, that it forces the conversation. Or maybe just opening the conversation for now is enough.


What do you think? What would actually move the needle?



The Traffic You're Not Seeing: Why Fraud Detection Should Be Part of Your Analytics Stack

Published • Oct 10, 2025

Most people, well not most, but a lot of people in the field spend a lot of time obsessing over conversion rates, bounce rates, and user journeys. But here's a question nobody wants to ask: How much of that traffic is even real?

I don't mean bots in the traditional sense, those get filtered out pretty well these days. I'm talking about the more sophisticated stuff. The click farms that behave almost like humans. The traffic that looks legit in GA4 but feels wrong when you dig into it. The signups that never engage. The ad clicks that cost money but goes nowhere. If you're running any kind of paid acquisition or collecting user data, there's a non-zero chance you're making decisions based on polluted numbers.

The Illusion of Clean Data

Here's what I've noticed: Most teams treat their analytics like it's a source of truth. GA4 says you got 10,000 visitors last month, so you got 10,000 visitors. It says your conversion rate is 2.3%, so it's 2.3%. Makes sense. But what if 15% of that traffic was fraudulent? What if those "users" who bounced immediately were actually bots testing your forms? (happened to me an embarrassingly high rate). What if your retargeting audiences are full of fake profiles?

I don’t think the scary part is even that fraud exists, it's that it's invisible until you go looking for it. And most people never look.

Why Traditional Bot Filtering Isn’t Enough

Google's bot filtering catches the obvious stuff. Scrapers with "bot" in the user agent. Traffic from known bad IP ranges. The usual. But modern fraud is more nuanced. It uses residential IPs. It mimics human behavior patterns. It scrolls at reasonable speeds and clicks on things. It fills out forms with plausible data. You can't catch this stuff by looking at pageviews and session duration. You need to look at how people interact with your site, not just what they do.

The Signals Hidden in Plain Sight

Real humans are messy. They hesitate. They make mistakes. They move their mouse in weird, inefficient ways while they're thinking. When someone scrolls through your site at a perfectly linear pace, 25% depth at exactly 5 seconds, 50% at exactly 10 seconds - that's not normal. When clicks happen at suspiciously regular intervals, that's not normal. When a mouse cursor moves in a straight line from point A to point B with no deviation, that's definitely not normal. These micro-behaviors are everywhere in your data. You just need to start capturing them. The beautiful thing? You don't need fancy ML models or fraud detection APIs (though those help). You can start by just measuring variance. How much do users deviate from predictable patterns? Real humans have high variance. Bots don't.

The Geography Problem

Then there's the IP layer. Not everyone using a VPN is fraudulent, obviously. But if you're a local business in Ohio and suddenly getting tons of "conversions" from data centers in Vietnam, that's worth investigating. Context matters here. A financial services company might expect VPN traffic from privacy-conscious users. An e-commerce site probably shouldn't see that many hosting provider IPs. The point isn't to block everything suspicious. It's to know what's happening so you can make informed decisions.

What This Actually Looks Like in Practice

Let's say you're an e-commerce site and you notice cart abandonment is unusually high from a specific traffic source. You optimize the checkout flow, but nothing changes. Turns out it's bots testing stolen credit cards, not real users with cold feet.

These patterns are findable if you're tracking the right things. You just need:

  • Behavioral signals that measure how users interact (scroll patterns, click cadence, mouse movement)
  • IP intelligence to understand where traffic is actually coming from (VPNs, data centers, known fraud networks)
  • Scoring system that weights these signals and flags anomalies

Start with scroll depth and IP geolocation. See what you find. Then build from there.

Where the Data Lives

The technical implementation matters less than the philosophy, but basically: You capture behavioral signals in Google Tag Manager, send them to GA4 as events, and enrich them with IP data via a server-side function.

Then everything flows into BigQuery or wherever your target data warehouse/analytics platform is. Then you can build queries that calculate risk scores based on whatever logic makes sense for you. From there, it's dashboards in Looker Studio and automated alerts when something weird happens.

The infrastructure isn't complicated. The hard part is deciding what "suspicious" means for you.

Deciphering Your Metrics

Once you start seeing how much fraudulent traffic exists, it changes how you look at your dashboards. The healthiest relationship with analytics is one where you trust the trends but verify the details.

Fraud detection isn't about catching bad actors (though that's a nice side effect). It's about having confidence in your data. It's about knowing that when you make a decision based on what GA4 tells you.

Starting Small

Pick one thing:

  • Track scroll-speed variance
  • Pull IP data for form submissions
  • Measure click cadence on key CTAs

See what you find. See if it changes how you think about your traffic.

How to Track AI Traffic in GA4 (The Easy Way)

Published • Sep 15, 2025

How to detect Google’s AI Overview visits in your analytics data.

AI is basically everywhere in search now. As of June 2025, 55% of people are using AI instead of search engines. Google’s AI Overviews, ChatGPT, and Perplexity are all answering questions and sometimes linking back to your site. Cool, right? Except you might have no idea this traffic is even happening unless you set up tracking for it.

Why You Should Care

AI traffic is weird. It doesn't play by the normal organic search rules:

  • People show up without the usual referrer strings
  • UTM parameters often get stripped out
  • Everything just lumps into “(direct)” traffic in GA4
  • You're basically flying blind unless you tag it properly

If you want to know whether AI is actually sending you traffic (and if that traffic converts), you need to track it on purpose.

What to Look For

From what we've seen, clicks from AI Overviews and other AI tools usually have telltale signs:

  • Referrers from AI domains like search.google.com, bard.google.com, chat.openai.com, perplexity.ai
  • Weird URL patterns like #:~:text=AI%20Overview
  • Parameters like ved= or sca_esv
  • Snippet text fragments matching what the AI highlighted

We'll use these clues to detect AI traffic.

Setting Things Up in GTM

1. Create a Referrer Variable

Go to GTM → Variables → New → JavaScript Variable and name it JS - document.referrer.

javascript
function() {
  return document.referrer;
}

      

2. Build an AI Detection Variable

javascript
function() {
  var ref = document.referrer || '';
  var url = window.location.href || '';

  var aiDomains = [
    'search.google.com',
    'bard.google.com',
    'chat.openai.com',
    'perplexity.ai',
    'you.com'
  ];

  var aiPatterns = [
    '#:~:text=AI%20Overview',
    'sca_esv',
    'ved='
  ];

  var domainMatch = aiDomains.some(d => ref.includes(d));
  var urlMatch = aiPatterns.some(p => url.includes(p));

  return domainMatch || urlMatch ? 'true' : 'false';
}

      

Then fire a GA4 event when AI traffic is detected:

  • Trigger: Page View → when AI detection variable equals “true”
  • Tag: GA4 Event → ai_traffic_detected
  • Parameters: ai_source = {{JS - document.referrer}}, ai_detected = true

Analyzing the Results

  • Segment where ai_detected = true vs false
  • Compare conversion rates and engagement
  • Export to BigQuery to trend AI traffic sources over time

Bottom line: AI is changing how people find your site. Without tracking, you’re blind — with GTM and GA4 configured, you finally see the signal through the noise.