My wife spotted a pothole while driving. She grabbed her phone and took a photo from inside the car. Blurry. Through the windshield. At an angle. From a moving vehicle.
The AI couldn’t make sense of it. The report failed.
That moment changed everything about how I think about civic reporting.
The real problem
The problem wasn’t the AI. The problem was the process. To report a pothole through a city form, or even through SolveTO’s original flow, you had to:
- Spot an issue while driving
- Pull over safely
- Get out of the car
- Take a clear photo
- Open the app
- Fill out a form, location, description, category
- Submit
Seven steps. For a pothole.
No wonder most people just drive past.
What does the city actually need?
I looked at Toronto’s 311 form. The required fields are straightforward: location, issue type, and some description. That’s it. The city doesn’t technically need a photo to dispatch a crew. They need to know where and what.
So the question became: how do I capture where and what from someone who is stopped at a red light with maybe ten seconds before the light turns green?
Safety before everything
I need to say this clearly: driving and taking photos don’t mix. Dont’t do it. Focus on the road. Be safe. Period.
But what about the moments when you’re already stopped? At an intersection. At a stop sign. Waiting for a light. You see a pothole, a damaged sign, debris on the road. You can’t take a photo. You don’t want to spend time on an app. But you want to mark it, something is wrong here, and move on.
That’s the problem worth solving.
Version 1 failed
The first attempt was a stripped-down camera page. Tap a button, camera opens, take a photo, GPS captures automatically, tap send. Three taps. Fast.
I tested it driving around my neighbourhood. It worked, technically. But problems surfaced immediately.
The camera wouldn’t auto-open. Mobile browsers block that without a direct human interaction. One extra tap, but it broke the flow.
The upload took 30+ seconds over mobile data, watching a spinner. Not good when you need your phone back because you’re driving.
And the photo was still the bottleneck. Even with fewer steps, taking a photo at a red light means: hold phone up, frame the shot, tap capture, wait for focus. Too much friction for a ten-second window. It was not acceptable to make people wait.
The insight that changed everything
Then someone on Twitter said something that clicked:
“If there was a way to just force it through with just a pothole button, and just do it off of geolocation, that actually solves it, even without the picture. If 10 people report an issue in a particular geolocation cluster, the city can derive that there’s a pothole problem in that area statistically.”
He was right. There’s an entire population, drivers, commuters, delivery workers, who see problems every day but will never stop to take a photo. If we could capture even a fraction of what they see, the data alone would be powerful.
But one button with zero context isn’t enough. A bare GPS pin tells the city nothing. “Something is wrong at 43.7, -79.4”, wrong how? Pothole? Broken light? Fallen tree?
The answer was voice.
Why voice
Drivers can’t type. They can’t browse a map. They can’t fill out forms. But they can talk.
“Big pothole on Don Mills near the gas station.”
In that sentence: issue type (pothole), location context (Don Mills near the gas station), severity (big). Everything the AI needs to classify, generate a formal report, and route it to the right department.
That’s how Pin It works:
Tap the + button. Two options: Snap (photo) or Pin (voice). Tap Pin.
Then: Confirm an existing nearby report, or New Pin. If someone already flagged that spot, tap Confirm and you’re done in 5 seconds! One tap. No duplication 🎯
If it’s new, tap New Pin. The microphone opens instantly. Speak: “Big pothole on Don Mills near the gas station.” The app transcribes your words in real time. Tap Stop. Review the transcript, use it or retake. Tap Send.
New report: four taps plus speaking. Ten seconds. No photo. No typing. No map. No form.
Confirming an existing one: three taps. Five seconds.
The art of removing things
Making something simple is genuinely hard. I spent five hours, from 11 PM to 4 AM, stripping away everything unnecessary. Different layouts, different flows, different button arrangements. Testing on my phone, reloading, trying again.
The question for every element: does removing this make the experience worse? If no, it goes.
Voice description label? Gone, the pulsing mic icon is enough. Photo preview at full size? Shrunk to a thumbnail, the Send button matters more than reviewing pixels. Category selection? Gone, the AI picks it from your words. Map picker? Gone, GPS handles it. Description field? Gone, your voice is the description.
The goal was a principle I borrowed from Rory Sutherland: make it so simple that not using it feels stupider than using it. If a ten-year-old and an eighty-year-old can both use it without thinking, you’ve designed it right.
When nobody has a photo, the community becomes the evidence
One person’s voice pin, “pothole on Don Mills”, is a data point. Easy to dismiss. But what happens when five people independently flag the same intersection?
That’s statistical proof.
When someone taps Pin, the app checks for existing reports within 50 meters. If someone already flagged that spot, you see it in the Confirm list. Tap it. Done. Five seconds. Your confirmation adds weight without creating a duplicate.
When enough voices accumulate:
- 5 confirmations, the report gets flagged
- 10 confirmations, sent to city 311
- 15 confirmations, goes to both 311 and the ward councillor
No single pin has a photo. But ten people independently flagging the same stretch of road? The city gets something they’ve never had: real-time, crowd-sourced infrastructure intelligence. Not from sensors or inspectors. From the people who drive those roads every day.
Instant feedback, because you’re driving
The old flow made you watch a spinning icon for 30+ seconds while the photo uploaded. That’s unacceptable for someone at a red light who needs their phone back.
Now: tap Send, instantly see a green checkmark and “Safe driving.” The upload happens in the background. You’re free in under a second.
The system doesn’t need you to wait. The AI processing, the classification, the report generation, the email routing, all of that happens after you’ve already put your phone down and driven away.
Five words, not two
Through testing, I discovered that two words isn’t enough. “Damage road” tells the AI almost nothing. What kind of damage? Where exactly? How bad?
Five words emerged as the minimum for a meaningful report. “Big pothole near Don Mills” gives the AI enough to work with. The app enforces this, if you say fewer than five words, it asks you to try again. Not to be difficult. Because a report with too little context wastes everyone’s time, including yours.
What changed
Before Snap: seven steps, two minutes, requires stopping your car.
After Snap: three taps plus speaking, ten seconds, works at a red light.
Before Pin It: you needed a photo to file a report.
After Pin It: your voice is enough. The community validates what the camera can’t capture.
The math of civic reporting changed. Maybe 1 in 100 people who spot a pothole will stop and take a photo. With Pin It, that could be 1 in 10. The friction dropped by an order of magnitude.
A blurry photo from a car window led to a voice-first, zero-friction reporting tool that turns every driver into a city sensor.
Every decision, voice over typing, community over photos, instant feedback over spinners, five words over two, had a reason. None of it was obvious. All of it was necessary.
For the technical side of how face detection and privacy work in SolveTO, read: Claude can’t find faces: here’s what actually works in Rails. For why homelessness reports are handled differently, read: Every face in a civic report deserves dignity.
Built from 11 PM to 4 AM. Toronto.