FOMO is a TinyML neural network for real-time object detection


This article is part of our coverage of the latest in AI research.

A new machine learning technique developed by researchers at Edge Impulse, a platform for creating ML models for the edge, makes it possible to run real-time object detection on devices with very small computation and memory capacity. Called Faster Objects, More Objects (FOMO), the new deep learning architecture can unlock new computer vision applications.

Most object-detection deep learning models have memory and computation requirements that are beyond the capacity of small processors. FOMO, on the other hand, only requires several hundred kilobytes of memory, which makes it a great technique for TinyML, a subfield of machine learning focused on running ML models on microcontrollers and other memory-constrained devices that have limited or no internet connectivity.

Image classification vs object detection

TinyML has made great progress in image classification, where the machine learning model must only predict the presence of a certain type of object in an image. On the other hand, object detection requires the model to identify more than object as well as the bounding box of each instance.

elephants-object-detection

Object detection models are much more complex than image classification networks and require more memory.

“We added computer vision support to Edge Impulse back in 2020, and we’ve seen a tremendous pickup of applications (40 percent of our projects are computer vision applications),” Jan Jongboom, CTO at Edge Impulse, told TechTalks. “But with the current state-of-the-art models you could only do image classification on microcontrollers.”

Image classification is very useful for many applications. For example, a security camera can use TinyML image classification to determine whether there’s a person in the frame or not. However, much more can be done.

“It was a big nuisance that you’re limited to these very basic classification tasks. There’s a lot of value in seeing ‘there are three people here’ or ‘this label is in the top left corner,’ e.g., counting things is one of the biggest asks we see in the market today,” Jongboom says.

Earlier object detection ML models had to process the input image several times to locate the objects, which made them slow and computationally expensive. More recent models such as YOLO (You Only Look Once) use single-shot detection to provide near real-time object detection. But their memory requirements are still large. Even models designed for edge applications are hard to run on small devices.

“YOLOv5 or MobileNet SSD are just insanely large networks that never will fit on MCU and barely fit on Raspberry Pi–class devices,” Jongboom says.

Moreover, these models are bad at detecting small objects and they need a lot of data. For example, YOLOv5 recommends more than 10,000 training instances per object class.

The idea behind FOMO is that not all object-detection applications require the high-precision output that state-of-the-art deep learning models provide. By finding the right tradeoff between accuracy, speed, and memory, you can shrink your deep learning models to very small sizes while keeping them useful.

Instead of detecting bounding boxes, FOMO predicts the object’s center. This is because many object detection applications are just interested in the location of objects in the frame and not their sizes. Detecting centroids is much more compute-efficient than bounding box prediction and requires less data.

sheep-object-detection-bounding-box-vs-centroid

Redefining object detection deep learning architectures

FOMO also applies a major structural change to traditional deep learning architectures.

Single-shot object detectors are composed of a set of convolutional layers that extract features and several fully-connected layers that predict the bounding box. The convolution layers extract visual features in a hierarchical way. The first layer detects simple things such as lines and edges in different directions. Each convolutional layer is usually coupled with a pooling layer, which reduces the size of the layer’s output and keeps the most prominent features in each area.

The pooling layer’s output is then fed to the next convolutional layer, which extracts higher-level features, such as corners, arcs, and circles. As more convolutional and pooling layers are added, the feature maps zoom out and can detect complicated things such as faces and objects.

neural-networks-layers-visualization