What Is Visual Search?
It is not a commonly known search. However, it has been around for a while. Therefore, we have compiled these articles to help explain this type of search.
Although one of the lesser-discussed forms of search, visual search has been around for several years and is built into several popular search engines and social media platforms, including Pinterest, Bing, Snapchat, Amazon, and, of course, Google.
According to one statistic reported here, 62% of millennials and members of Gen Z want to use visual search over all other search types.
Visual search uses artificial intelligence technology to help people search through real-world imagery rather than through text search.
So, when a person takes a photograph of an object, using Google Lens, for instance, the software identifies the thing within the picture and provides information and search results to the user.
This technology is beneficial for eCommerce stores and brands. With the implementation of well-optimized content, they could stand the chance of being the returned search result for a user.
A company that appears for the result of a popular search query could stand to make a lot of money.
Visual search falls under the umbrella of what is known as “sensory search,” which includes searching via text, voice, and vision.
Although both visual and image search is based around imagery, the crucial difference lies in the fact people use words to conduct an image search, whereas, with visual search, a user uses an image to perform the search.
As you are probably aware, image search has been in the public consciousness for nearly 20 years. Google introduced the search format way back in July 2001 because the search engine could not handle the number of people searching for an image of Jennifer Lopez in a particular green dress.
Why is Visual Search Important?
Visual search has the potential to change the way we interact with the world around us.
Our culture is already dominated by the visual, so it seems natural to use an image to start a search.
After all, when we shop offline, we rarely start with a text. A visual search brings that sense of visual discovery to the online world.
Moreover, we often want to find a new look, outfit, or theme rather than a single object. Visual search technology helps tie these items together based on aesthetic links in a way that text has never been able to capture.
The technology is still in its infant stages, but recent trends reveal how quickly it can advance. In particular, Pinterest’s visual search technology (known as Pinterest Lens) and Google’s visual search (also, imaginatively, called Lens) are leading the way.
So, I have compiled a list of fascinating, newsworthy, and impressive visual search facts and stats here.
I’ll keep updating it frequently, and hopefully, it will serve as a valuable resource for those who wish to understand how visual search can affect consumers’ everyday lives. Or just for those who need a few visual search statistics for a presentation.
90% of information transmitted to the human brain is visual. MIT
The human brain can identify images seen for as little as 13 milliseconds. MIT
The image recognition market will grow to $25.65 billion by 2019. MarketsandMarkets
62% of millennials want visual search over any other new technology. Vacanze
21% of advertisers believe that visual search was the most critical trend for their business in 2019. Marin Software
45% of retailers in the UK now use visual search. Tech HQ
36% of consumers have conducted a visual search. The Intent Lab
35% of marketers plan to optimize for visual search through 2020. Search Engine Journal
55% of consumers say Visual Search is instrumental in developing their style and taste. Pinterest
By 2021, early adopter brands redesign their websites to support visual, and voice search will increase digital commerce revenue by 30%. Gartner
Visual information is preferred over text by at least 50% of respondents in all categories except electronics, household goods, wine, and spirits. The Intent Lab
59% think visual information is more critical than textual information across categories (vs. 41% who think textual information is more critical). The Intent Lab
When shopping online for clothing or furniture, more than 85% of respondents emphasize visual information than text information. The Intent Lab
20% of app users make use of visual search when the feature is available. GlobalData
The Global Visual Search Market is estimated to surpass $14,727m by 2023, growing at CAGR +9% during the forecast period 2018–2023. Report Consultant
Images are returned for 19% of search queries on Google. Moz
Photos account for 34% of search results. Econsultancy
Google Lens can detect over 1 billion objects. TechSpot
Google Shopping ads see a budget increase of 33%, versus a rise of just 3% for text ads. Merkle report, Q3 2018
Google Lens is being used over 1 billion times. Google (2019)
Google Lens is used over 3 billion times per month. Google (2021)
Since the revolution in artificial neural networks, some methods used to solve business challenges have become less effective and less popular. Latest algorithms work based on black-box principles, which significantly impacts results.
This method relies on image matrices. A matrix is an image converted into machine-readable form. Each pixel is described in numerical format depending on the colour.
With two images matrices, an algorithm can calculate the mathematical distance between images, which will allow measuring the similarity of these images. Let’s turn to the LFW-based results below:
The closer the original photo an image is, the shorter the distance and the higher the similarity between two images. A tag “Yes” means an exact match. But the results are far from been impressive.
It is easier for computers to compare text files than graphic ones. Hash is a unique sequence of symbols, and a special algorithm is used for converting images into strings of text. In image search, image hashing has some specifics well covered here and here.
In brief, if there are two input photos of the same person, but one image is compressed or somehow modified in another way, a hashing algorithm provides two different hashes. And by comparing the hashes, a visual search model will erroneously identify two people differently on the input photos. In computer vision, there are specific hashing algorithms used to avoid such an error. Nonetheless, it doesn’t yield outstanding results.
Algorithms employed within simple approaches perceive an image as a set of spots. So, for example, if there is a black sofa on a white background, the algorithm finds images containing similar black and white spots.
More complex architectures – i.e., artificial neural networks based on deep learning (DL) – consider a semantic meaning. They recognize meaningful objects on an image. The network structure includes multiple layers, each trained to recognize specific patterns. The deepest layer is capable of inferring the semantic meaning of an image. Roughly speaking, it is what helps computers to see images like us.
Each layer of a network provides some output or result that the next layer takes as input data. For the deepest layer, the outcome is called embedding. Next, the embedding should be taught some metrics so that the last layer can deliver quality recognition results. This method is called deep metric learning.
The first robust algorithm for deep metric learning – DeepID2 – was described in 2006. It proved that computers in face recognition had all the potential to become just as good as humans, even better. Since then, neural networks driven by artificial intelligence have dramatically evolved. Now the type of algorithm is not a big deal. Everything depends on a challenge, a chosen method, and parameters for model quality validation.
Loss functions in model training serve for achieving a minimum difference between the output and the ground truth, simply speaking, what is on the image in reality. In addition, they are used to evaluate the ability of the model to predict the class of an image.
To do image retrieval tasks, a model should first learn some discriminative features. Contrastive loss is the most basic function used to teach a model to discriminate between a pair of images based on the set of features. For example, triplet loss works for three shots.
There are more up-to-date losses, such as CoCo loss, vMF loss, and so on. The accuracy varies insignificantly, between 99,12% and 99,86%, which proves that the choice primarily depends on the type of tasks.
Face recognition system FaceNet was developed by Google in 2015, and it has demonstrated state-of-the-art results since then. It encompasses a set of DL-based techniques: triplet loss, semi-hard negative mining, offline sampling, and so on.
The network receives a triplet of input images: Anchor, Positive, and Negative. Here, Anchor and Positive are the photos of the same object or face, while Negative is a photo of a different thing. The choice of a mismatching photo is not random and relies on a unique technique.
Whether you are using this for business or personal use, it is a compelling search.
Articles compiled by hughesagency.ca
Article reference links: