Can't you just use an off-the-shelf pose detection model to judge squat depth?

Not accurately. Generic pose models place a keypoint somewhere reasonable near the hip, which is fine for animation or rep counting. But squat depth is decided by where the hip crease sits relative to the top of the knee, and the gap between a generic hip keypoint and the actual crease is often larger than the entire pass/fail margin of the lift. The real work is refining those coarse keypoints to a precision where a few pixels decide the verdict.

Does more training data always make an AI model more accurate?

No. On a precision task, combining every available data source can lower accuracy. A smaller, carefully curated set of clean examples can beat a larger pot that includes noisy or off-distribution images. More data helps when your problem is variety; when your problem is precision, extra data is often just extra noise.

How do you know when a model is accurate enough to ship?

When its error on held-out images shrinks to roughly the level where the model disagrees with human labels about as often as two careful humans disagree with each other. At that point you're measuring the ambiguity in the ground truth itself, not the model, and further gains require a harder evaluation set and better labels rather than more training tricks.

How we trained an AI to judge squat depth

June 2026 · 9 min read

In a sanctioned powerlifting meet, three referees decide whether your squat counts. The rule is simple to state and brutal to satisfy: the crease of your hip has to drop below the top of your knee. Miss it by a centimeter and the lift is three red lights.

We set out to build software that makes that same call from a single photo. This is the story of what broke along the way, and the lessons that survived contact with reality. We're not going to hand you our model weights or our recipe, but we'll be honest about every wall we walked into, because the walls turned out to be more instructive than the wins.

Why "just use a pose model" doesn't work

If you've touched computer vision recently, your first instinct is right: off-the-shelf pose estimators already find hips and knees. Feed in a photo, get back a skeleton with a dot on each joint. Problem solved?

Not even close. A generic pose model is trained to put a keypoint somewhere reasonable on the hip, good enough to animate a video-game character or count reps. But "the hip" to a pose model is a fuzzy region near the pelvis. "The hip crease" to a referee is a specific anatomical fold, and depth is decided by where that fold sits relative to the top of the kneecap. The difference between a generic hip keypoint and the actual crease is often larger than the entire pass/fail margin of the lift.

So the real problem was never "find the joints." It was: take a coarse skeleton and refine two specific landmarks to a precision where a few pixels decide the verdict. That framing shaped everything that followed.

The hardest bug is the one inflating your numbers

Early on, our model looked great. Suspiciously great. Each new round of training reported better accuracy than the last, and we were patting ourselves on the back.

Then we found two things. First, a handful of images in our evaluation set had also been in our training set. The model wasn't generalizing on those, it had memorized them. A leak that small doesn't sound like much, but on a precision task with a small evaluation set, it quietly lifts every score and flatters every experiment.

Second, and worse, a bug in our data-loading code was silently dropping a large fraction of our training examples. We had spent weeks running dozens of experiments and ranking them against each other, and a chunk of that ranking was built on a foundation that was both leaking and starved of data at the same time.

Here's the uncomfortable part: once we fixed both bugs and re-ran everything honestly, our headline accuracy got worse. Not because the model degraded, but because we'd been grading on a curve. We made a rule for ourselves after that, and we still follow it: every result is tagged honest or not-trustworthy, and we never quote a number we can't reproduce on data the model has never seen. It's slower. It's also the only reason we trust anything we ship.

If there's one thing to take from this whole piece: a model that's wrong and a model that's lying to you look identical on a dashboard. The work is telling them apart.

More data made it worse

Conventional wisdom says throw more data at the problem. We had multiple sources of training images and the obvious move was to combine all of them.

When we did, accuracy dropped. It turned out a carefully curated, smaller set of clean examples beat the everything-in-the-pot version. The extra data wasn't free: some of it was noisier, some of it was subtly off-distribution from how people actually photograph their squats, and mixing it in dragged the model toward the average instead of the truth. The win came from being willing to exclude data, which feels wrong every time you do it.

"More data" is advice for when your data is clean and your problem is variety. When your problem is precision, more data is often just more noise wearing a helpful disguise.

Synthetic data is a scalpel, not a firehose

We experimented with generating synthetic squat images to expand our training set. The results were not "synthetic is good" or "synthetic is bad," they were maddeningly specific. The same batch of synthetic images helped the model on one of the two landmarks and hurt it on the other.

That forced a better conclusion: synthetic data shifts the distribution your model learns, and whether that shift helps depends entirely on the task. Used as a scalpel, aimed at the exact gap you're trying to close, it earns its place. Used as a firehose to bulk up sample counts, it just moves your problem somewhere you weren't looking.

The famous pretrained backbone didn't transfer

At one point we tried the textbook upgrade: swap our small from-scratch vision component for a well-known backbone pretrained on millions of internet images. This is supposed to be free money, transfer learning, standing on the shoulders of giants.

It failed across the board. Every variant was worse than the humble from-scratch version it replaced.

The reason, in hindsight, is obvious. Those famous backbones learned their features on large, content-rich photos: dogs, cars, faces. We were feeding them tiny, tightly cropped patches around a single joint. The high-level features that make a network great at "is this a golden retriever" are close to useless on a postage-stamp crop of a knee. And the bigger network had far more parameters to fit on a dataset that, by computer-vision standards, is small. More capacity plus the wrong priors is a recipe for overfitting, not magic.

Borrowed knowledge only transfers when the new problem rhymes with the old one. Ours didn't.

Knowing when to stop

Eventually our error on held-out photos got small enough that we hit a different kind of wall: the model started disagreeing with our human labels about as often as two careful humans disagree with each other. When you're squabbling over a couple of pixels, you're no longer measuring the model, you're measuring the ambiguity in the ground truth itself.

That's a strange and important place to reach. It means the next gains don't come from another clever training trick; they come from a harder, larger evaluation set and better labels. We ran several more rounds of experiments that each came back "within noise," no clear winner, and the discipline was to not ship them, to not fool ourselves into seeing improvement in what was really just randomness between training runs.

A lot of machine learning work is knowing the difference between a real gain and a lucky roll of the dice. We've shipped fewer models because of that rule, and slept better for it.

What we're comfortable not telling you

You'll notice we haven't given you architecture diagrams, hyperparameters, or our exact accuracy figures. That's deliberate. Those are the parts that took years and money to find, and they're what make the product hard to copy. The lessons above, though, cost us just as much, and they generalize far beyond squats.

If you lift, all of this disappears behind a single screen: you record a rep, and a moment later you get a verdict you can trust, the same call three referees would make, minus the travel and the entry fee. That trust didn't come from a model. It came from being ruthlessly honest about when the model was wrong.

See it call your depth

Squat Eye uses AI pose estimation to measure your hip crease relative to your knee, frame by frame. Private, on-device, free.

Download Squat Eye