technical rails ai opencv

Claude can't find faces, here's what actually works in Rails

Ahmed Nadar · · 6 min read

I needed face redaction in a Rails app. Detect faces in uploaded photos, mosaic them, store the redacted version. The kind of thing that sounds simple until you try it.

I tried using Claude’s API. Three times. It failed every time. Not because Claude is bad at understanding images, it’s excellent at that. It fails because face detection isn’t an understanding problem. It’s a geometry problem.

What went wrong with the LLM approach

My first attempt: send the photo to Claude, ask it to return face coordinates as percentages of image dimensions. Something like {"cx": 65, "cy": 30, "size": 12}, center point and radius.

The response looked reasonable. “There’s a face at approximately 65% from the left, 30% from the top.” But when I converted those percentages to pixel coordinates and applied the mosaic, it landed on empty sidewalk. The face was untouched.

Second attempt: switched to bounding boxes {"x": 40, "y": 25, "w": 12, "h": 15}. Added examples. Added explicit instructions for non-upright faces. Same result. The mosaic landed on the person’s belongings, not their face.

Third attempt: more specific prompt, coordinate validation, rejection of out-of-bounds detections. The coordinates improved slightly but were still off by 100+ pixels. Enough to completely miss a face.

The core issue: Claude understands “there’s a face in the upper right of this image.” But it can’t tell you the precise pixel coordinates. It approximates. For image classification (“what is this a photo of?”) that’s fine. For face detection (“where exactly are the eyes, nose, and chin?”) it’s not enough.

This is how Claude processes images: a Vision Transformer chops the photo into patches (typically 14×14 or 16×16 pixels), converts each patch into an embedding vector, and feeds those embeddings into the language model as tokens. Positional information is encoded in those embeddings, the model knows which patch came from which region, but Claude wasn’t specifically trained to decode that spatial signal back into precise pixel coordinates. It can reason about what is in an image and roughly where, but not at the precision face detection demands.

A note on other models: this isn’t a universal LLM limitation. Google’s Gemini models can return bounding box coordinates, Google specifically trained that capability into the model, outputting normalized coordinates on a 0–1000 scale. If you ask Gemini “where is the face in this image?” it can give you a structured bounding box. So “LLMs can’t do spatial geometry” isn’t quite right. It’s more accurate to say: Claude can’t, some models can, but a dedicated CV model is still the right tool for production, because it’s faster, cheaper, runs locally, and gives deterministic results without API latency or per-image costs. When you need to process every uploaded photo automatically in a background job, you don’t want to be making cloud API calls for something a 227KB model can do in under a millisecond.

What actually works

OpenCV’s YuNet DNN face detector. It’s a 227KB ONNX model that processes pixels directly. It returns precise bounding boxes: x=641, y=807, w=130, h=161. Not percentages. Not approximations. Exact pixel coordinates.

YuNet handles tilted faces, sideways faces (someone lying on the ground), partial faces (covered by a hat or blanket), and faces at various distances. The Haar cascade detectors that ship with OpenCV work as a fallback for anything YuNet misses.

The mosaic itself is simple: extract the face region, resize it down to a tiny grid (8px blocks), resize it back up with nearest-neighbor interpolation. The result is the classic blocky pixelation you see on news broadcasts. It can’t be reversed. It takes less than a millisecond per face.

The Ruby problem

There is no maintained Ruby gem for face detection. I looked.

  • ruby-opencv exists but it’s abandoned and doesn’t compile on modern Ruby.
  • ruby-vips has great image processing but no face detection.
  • There’s no Ruby binding for YuNet or MediaPipe.

So the architecture is: Ruby calls a Python subprocess via IO.popen. The Python script loads OpenCV, runs YuNet detection, applies the mosaic, and writes the result. Ruby reads the modified file back.

IO.popen(
  ["python3", "-c", script, image_path, ...],
  &:read
).strip

Array-based IO.popen, no shell interpolation, no injection risk. The script is a heredoc constant in the Ruby class, not a file on disk.

This means your Docker image needs python3 and opencv-python-headless. That’s about 40MB. If you’re running Rails in Docker (and you should be), your image is already 400-500MB with Ruby, gems, and libvips. The 40MB is an 8% increase for face detection that actually works.

The right tool for each job

Here’s what I landed on:

Task Tool Why
Image classification Claude (LLM) Understands content, context, categories
Face detection OpenCV YuNet (CV) Precise pixel coordinates, runs locally, no API cost
Mosaic pixelation OpenCV (CV) Fast, deterministic, irreversible
Report generation Claude (LLM) Natural language, formal tone

LLMs for understanding. Computer vision for geometry. Each tool does what it’s best at. Even if Gemini could technically handle the face detection, shelling out to a local 227KB model that runs in under a millisecond is a better architectural choice than adding another cloud API dependency to your upload pipeline.

What I’d explore next

A Ruby native face detection library doesn’t exist, but it could. The YuNet ONNX model is 227KB. Ruby has onnxruntime gem for running ONNX models. Someone could build a pure-Ruby face detector that loads the model, runs inference, and returns bounding boxes without touching Python.

If you’re in the Ruby community and this interests you, I’d love to hear about it. The gap is real and the use case is everywhere: any app that handles user-uploaded photos needs face detection.

In the meantime, Python subprocess via IO.popen works. It’s not elegant, but it’s correct. And correct matters more than elegant when you’re handling people’s faces. You can see it in action here.

For the non-technical side of this story, why face redaction matters for civic reporting, how I handle homelessness reports differently, and the design decisions behind it, read: Every face in a civic report deserves dignity.

Report an issue: solveto.ca

Support SolveTO: solveto.ca/support

Questions: support@solveto.ca