Software Lead for ARC. Led a team of 40 engineers and architected three in-house neural networks that take camera frames in and decisions out.
NN 1 — 6D Pose Estimation
The first network predicts the opponent's 6-DoF pose (position + orientation) from a single camera frame. The output is enough to know not just where the opponent is, but which way it's facing — critical for choosing an approach angle.
Sensor fusion (EKF + IMU): The vision pose is fused with the bot's on-board IMU (Inertial Measurement Unit — accelerometer + gyroscope, hundreds of Hz) through an Extended Kalman Filter. The NN gives accurate-but-slow fixes; the IMU is fast-but-drifty; the EKF combines them into one smooth 6-DoF estimate, even between camera frames.
Jargon:
- Pose: Position + orientation.
- 6-DoF: Three position numbers + three orientation numbers.
- Keypoint: A labeled landmark (e.g., weapon tip) the network finds.
- IMU: Sensor with an accelerometer (linear motion) and gyroscope (rotation).
- EKF: Filter that fuses noisy sensor streams into a single best estimate. "Extended" handles nonlinear rotation.
- Drift: Error that builds up integrating IMU motion alone — corrected by the vision fix.
NN 2 — Semantic Segmentation @ 90 FPS
The second network is a semantic segmentation model that labels every pixel of the camera feed as opponent-or-not. Trained from scratch and quantized to run at 90 FPS on-bot.
Blue pixels = the opponent mask predicted by the network on live arena footage.
Segmentation gives us a much richer signal than a bounding box: we can see partial occlusions, shape changes during a flip, and exact weapon-arm geometry — all of which feed the strategy network downstream.
Jargon:
- Semantic segmentation: Classify every pixel in an image. Output is a mask the same size as the image.
- Mask: A per-pixel label map. Here, a binary mask: opponent vs. not.
- Encoder–decoder: The standard segmentation architecture — compress the image into features, then expand back to a pixel map.
- Quantization: Compressing the network's weights (e.g., 32-bit → 8-bit) so it runs faster on edge hardware with minimal accuracy loss.
- FPS: Frames per second. 90 FPS = ~11 ms per frame.
- Edge inference: Running the network on the bot, not over a wireless link.
NN 3 — Strategy (PPO)
A policy network trained with PPO (Proximal Policy Optimization). The agent plays against itself in simulation, learning which actions win.
PPO training dashboard: win rate, throttle, distance, planned paths.
Jargon:
- RL: Train by rewarding good outcomes and penalizing bad ones.
- PPO: Stable RL algorithm; only allows small policy updates per step.
- Policy network: Maps current state → action.
- Reward shaping: Designing the reward (we reward hits; penalize idle time and flips).
- Self-play: Train against past versions of yourself.
Why three networks?
Splitting pose → segmentation → policy gives each stage a clean contract, lets us train and debug them separately, and lets each run at the frame rate it actually needs.