Software can now easily spot objects in images, but it can’t always describe those objects well; “short man with horse” not only sounds awkward, it doesn’t reveal what’s really going on. That’s where a computer vision breakthrough from Google and Stanford University might come into play. Their system combines two neural networks, one for image recognition and another for natural language processing, to describe a whole scene using phrases. The program needs to be trained with captioned images, but it produces much more intelligible output than you’d get by picking out individual items. Instead of simply noting that there’s a motorcycle and a person in a photo, the software can tell that this person is riding a motorcycle down a dirt road. The software is also roughly twice as accurate at labeling previously unseen objects when compared to earlier algorithms, since it’s better at recognizing patterns.
The technology has its problems, especially if you don’t have a large pool of training images. It frequently makes mistakes, as you can see in the examples above. However, these are still early days. Larger data sets should help with the detection routine’s performance, and there are likely to be refinements to the code itself. When it does get significantly better, it could have a tremendous effect on everything ranging from artificial intelligence to search. A robot could tell you exactly what it sees without requiring that you look at a camera to check its findings; alternately, you could search for images using ordinary sentences and get only the results you want. It might be years before you see Google’s technique used in the real world, but it’s clear that you won’t have to deal with stilted, machine-like descriptions for too much longer.
Via: New York Times