What is Augmented Computer Vision?
As humans, we are capable of understanding and describing a scene encapsulated in an image or Video. Computer Vision is a virtual machine which is capable of simulating the human visual system. Currently, Augmented computer vision is one of the most in demanded deployment of artificial intelligence and machine learning, given its wide variety of applications and tremendous potential.
How is it currently applied in different industries?
Given a two-dimensional image or video, a computer vision system is trained to recognize the objects present and their characteristics such as shapes, colors, sizes, spatial arrangement, among other things, to provide a description as complete as possible of the presented Video or image.
Computer vision is complex but possessing an immense range of practical applications in particular. Enterprises of all types and sizes, from the e-commerce industry to more fundamental brick & mortar model ones, can take advantage of its powerful capabilities.
What are some well-known business use cases?
Inventory management
Predictive maintenance
Defect reduction
Medical image analysis
Visual classification and documentation
Product Sorting
Soil Quality
What tasks are used to render augmented computer vision?
Localization: localization is to define a bounding box enclosing the object in the image. Localization is a particularly useful task. It can allow for the automatic cropping of objects in a set of images, for instance. If it is combined with classification, it allows to quickly build a dataset of (cropped) images of a desired dataset.
Image classification: one of the most well-known tasks in computer vision is image classification. It allows for the classification of a given image as belonging to one of a set of predefined categories. A set of training images is used for model training the algorithm.
Object detection: This involves an action of simultaneous location and classification, repeated for all objects of interest in an image, and detect different objects in the image frame. Large number of objects in an image can be detected simultaneously. The purpose of object detection is, therefore, to find and then classify a variable number of objects in an image.
Instance segmentation: It is the next step after object detection and not only about finding objects in an image, but also about creating a separate mask for each detected object. Using instance segmentation techniques, it’s possible to blur out young children’s faces on broadcast video.
Object tracking: The purpose of object tracking is to track an object that is in motion over time, utilizing consecutive video frames as the input. This functionality is essential for in various human tracking systems, from those which try to understand customer behavior, as in the case of retail, to those which constantly monitor football or basketball players during a game.
A relatively straightforward way to perform object tracking is to apply object detection to each image in a video sequence and then compare the instances of each object to determine how they moved. The drawback of this approach is that performing object detection for each individual image is typically expensive. An alternative approach would be to capture the object being tracked only once (as a rule, the first time it appears) and then discern the movements of that object without explicitly recognizing it in the following images. Finally, an object tracking method does not necessarily need to be capable of detecting objects; it can simply be based on motion criteria, without being aware that the object is being tracked.
Computer Vision at work: Deep Learning Methods have deeply transformed computer vision, along with other areas of artificial intelligence, to such an extent that for many tasks its use is considered standard. In particular, Convolutional Neural Networks (CNN) have achieved beyond state-of-the-art results utilizing traditional computer vision techniques.
These steps outline a general approach to building a successful computer vision model using CNNs:
-
Create a dataset comprised of annotated images or use an existing one. Annotations can be the image category (for a classification problem); pairs of bounding boxes and classes (for an object detection); or a pixel-wise segmentation of each object of interest present in an image (for an instance segmentation).
-
Extract, from each image, features pertinent to the task at hand. This is a key point in modelling the problem. For example, the features used to recognize faces, features based on facial criteria, are obviously not the same as those used to recognize tourist attractions or industrial objects.
-
Train a deep learning model based on the features required for a correct object classification. Training means feeding the machine learning model many images till it reaches a certain level of accuracy. The self-learning ability of the DL algorithm allows the system to continuously learn from live examples as well.
-
Evaluate the model using live video or images so, the accuracy of the training model can be tested.
Importance of datasets: Datasets play a very important, every time a new dataset is released, new methods are released, and new models are compared and often improved upon. Unfortunately, there aren’t enough datasets for object detection. Data availability is harder and generally expensive to build.
Datasets however are critical for developing augmented computer vision applications and millions of images containing bounding box annotations need to be parsed for proper model training. CUSTOM TRAINING of datasets for computer vision deployments can add to the cost of annotation, training and final deployment