Buy Me That Look

11 min readApr 29, 2021

Fashion Recommendation System, Computer Vision.

This blog is all about the Fashion Recommendation system. It is a unique recommendation system compared with others because here based on provided photo/picture system recommends similar clothes or articles worn by the model in the picture. The architecture and design components are inspired by a Paper: Buy Me That Look: An Approach for Recommending Similar Fashion Products by Myntra.

Agenda

Business problem
ML/DL formulation
Business Constraints
Data Acquisition and Analysis
Research section
My Approach
End Results
Future Work
Link to my profile
References

1. Business Problem:

The research paper states, in short, is to detects all products in an image and retrieves similar fashion clothes from the database with a product buy link.

Online business has become important in day-to-day life for everyone. Virtual stores allow people to shop from the comfort of their homes without the pressure of a salesperson. In this paper, the authors focus on the retrieval of multiple fashion items at once and propose an architecture that detects all products from an image and recommends similar kinds of products.

In stores, we can carry a piece of cloth and request the salesperson to show us similar kinds of products matching Color, design, thickness, etc; But online it is not possible and it is time-consuming while searching for similar kinds of products. So, here we can upload an image and search for similar kinds by using computer vision.

2. ML/DL formulation

Let’s discuss the architecture of the model in this session. Will divide this problem into various stages:

Stage 1: (Pose Estimation)
In this stage, We will detect whether the image is a Full-Front-pose image or not. So this will be a binary classifier (Yes/No)
Stage 2: (Localization and Article Detection)
In this stage, we detect all the articles (clothes) and particular places the article is placed or located. This will be Both a classification and Regression Problem. Classification because of Article Detection and Regression because of Localization (Bounding Box Co-ordinates)
Stage 3: (Image_embeddings)
In this stage, we will generate the embedding ( dense vectors ) for the images as discussed below.
Stage 4: (Getting similar Images)
In this stage, we will use Faiss library to fetch similar clothes based on search query.

3. Business Constraints

Scalable: Our system architecture should be scalable because every day thousands and thousands of new images are going to be added to the site.
Low Latency: A customer is not going to wait for minutes or even for more than 5–10sec for a recommendation. So our architecture should be able to retrieve recommendations within a given time frame.

4. Data Acquisition and Analysis

I have used the data available in Kaggle which scrapped by Shreyas. I tried scrapping but it is time-consuming. I will provide the code snippets in GitHub to scrape the data from Myntra.

https://www.kaggle.com/shreyas90999/mycasestudy02ee

The data has different types of clothes(upper ware, lower wear, and footwear) for ladies. As this data scrapped from Myntra we have masks or bounding boxes for article/clothes localization/detection. So, for the article detection and localization part, I have taken the data from Kaggle Competiton: iMaterialist (Fashion) 2019 at FGVC6. This Fashion data has 45.2k files approximately with output In Encoded_pixel Format with class_labels.

5. Research section

Main Paper: https://arxiv.org/pdf/2008.11638.pdf

I took the below picture from a research paper.

Buy me that look, Fashion recommendation system blueprint.

From the paper, the architecture is as explained below.

By using a pose detection classifier, we must detect full shot images, and based on FFS, we must find front-facing images.
Front-facing images are passed to a CNN network with active learning detects the fashion objects in the image and does localization.
Image embeddings are created for all images available in the catalog and stored in the database. Triplet-Net-Based Embedding Learning is used to generate the test data. we can also use a simple CNN-based autoencoder.
A query image is passed, and similar images are retrieved from the database. Here they have used cosine similarity to get similar metrics from the database.

2. Pose detection: https://nanonets.com/blog/human-pose-estimation-2d-guide/

In this blog, various approaches are used for posing detection problems. By using these pre-trained models, we can save a lot of time. Choosing the architecture that works best on the dataset then later fine-tune or modify the architecture to get the best results.

3. Localization / article detection: https://valohai.com/blog/clothes-detection-for-fashion-recommendation/

In this blog, users have explained about different labeled datasets for fashion object detection. This pre-trained dataset can be used on top of our data to increase the accuracy of the model. The blog has a detailed explanation of how to use object detection API for Tensorflow with good snippets of code which will help try the above models first as a black box and then later choose the architecture that gives the best results on our datasets.

object detection API for Tensorflow came several pre-implemented architectures with pre-trained weights on the COCO (Common Objects in Context) dataset, such as

SSD (Single-Shot Multi-box Detector) with Mobile Nets
SSD with Inception V2.
R-FCN (Region-based Fully Convolutional Networks) with Resnet 101.
Faster RCNN (Region-based Convolutional Neural Networks) with Resnet 101.
Faster RCNN with Inception Resnet v2

4. Triplet Loss: https://towardsdatascience.com/image-similarity-using-trip%20let-loss-3744c0f67973

This blog has a good explanation of how to use Triplet loss for image similarity problems. So, my understanding of Triplet Loss architecture helps us to learn distributed embedding by the notion of similarity and dissimilarity. It’s a kind of neural network architecture where multiple parallel networks are trained that share weights among each other. During prediction time, input data is passed through one network to compute distributed embeddings representation of input data.

Loss function: The cost function for Triplet Loss is as follows:

L(a, p, n) = max(0, D(a, p) — D(a, n) + margin)

where D(x, y): the distance between the learned vector representation of x and y. As a distance metric L2 distance or (1 — cosine similarity) can be used. The objective of this function is to keep the distance between the anchor and positive smaller than the distance between the anchor and negative.

6. My Approach

Here I will be explaining my implementation of the business problem.

Module_1:

In Module1, for pose detection, I tried using HRNet and TensorFlow lite models. Both model outputs are almost similar. So, I picked up TensorFlow lite. From the below snippet it is clear that both models have similar results.

So, here I have used the TensorFlow lite “Posenet” pre-trained model from my research section to find all full poses and front posing images from my corpus.

How do Posenet works?

Pose estimation is the task of using an ML model to estimate the pose of a person from an image or a video by estimating the spatial locations of key body joints (keypoints).

Pose estimation refers to computer vision techniques that detect human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image. It is important to be aware of the fact that pose estimation merely estimates where key body joints are and does not recognize who is in an image or video.

The PoseNet model takes a processed camera image as the input and outputs information about key points. The key points detected are indexed by a part ID, with a confidence score between 0.0 and 1.0. The confidence score indicates the probability that a key point exists in that position.

PoseNet Key points and its identification.

Results:

Module_2:

In module 2, we have to detect all the articles and localize them. For that, I have used the MaskRcnn model. I took data from Kaggle for the competition “iMaterialist (Fashion) 2019 at FGVC6”. After the localization, we have to crop the images and pass them to Module 3 for generating embedding.

How do MaskRCNN works?

Mask R-CNN (regional convolutional neural network) is a two-stage framework: the first stage scans the image and generates proposals (areas likely to contain an object). And the second stage classifies the proposals and generates bounding boxes and masks. Mask R-CNN paper is an extension of its predecessor, Faster R-CNN, by the same authors. Faster R-CNN is a popular framework for object detection, and Mask R-CNN extends it with instance segmentation, among other things.

This tutorial requires TensorFlow version 1.15.3 and Keras 2.2.4. It does not work with TensorFlow 2.0+ or Keras 2.2.5+ because a third-party library has not been updated at the time of writing.

!pip install — no-deps tensorflow==1.15.3

!pip install — no-deps keras==2.2.4

Mask R-CNN is basically an extension of Faster R-CNN. Faster R-CNN is widely used for object detection tasks. The Mask R-CNN framework is built on top of Faster R-CNN. So, for a given image, Mask R-CNN, in addition to the class label and bounding box coordinates for each object, will also return the object mask.

Faster R-CNN first uses a ConvNet to extract feature maps from the images
These feature maps are then passed through a Region Proposal Network (RPN) which returns the candidate bounding boxes
We then apply an RoI ( Region of Interest ) pooling layer on these candidate bounding boxes to bring all the candidates to the same size
And finally, the proposals are passed to a fully connected layer to classify and output the bounding boxes for objects

Similar to the ConvNet that we use in Faster R-CNN to extract feature maps from the image, we use the ResNet 101 architecture to extract features from the images in Mask R-CNN. So, the first step is to take an image and extract features using the ResNet 101 architecture. These features act as an input for the next layer.

Now, we take the feature maps obtained in the previous step and apply a region proposal network (RPM). This basically predicts if an object is present in that region (or not). In this step, we get those regions or feature maps that the model predicts contain some objects.

The regions obtained from the RPN might be of different shapes, right? Hence, we apply a pooling layer and convert all the regions to the same shape. Next, these regions are passed through a fully connected network so that the class label and bounding boxes are predicted.

Till this point, the steps are almost like how Faster R-CNN works. Now comes the difference between the two frameworks. In addition to this, Mask R-CNN also generates the segmentation mask.

For that, we first compute the region of interest so that the computation time can be reduced. For all the predicted regions, we compute the Intersection over Union (IoU) with the ground truth boxes. We can computer IoU like this:

IoU = Area of the intersection / Area of the union

Now, only if the IoU is greater than or equal to 0.7+, we consider that as a region of interest. Otherwise, we neglect that region. We do this for all the regions and then select only a set of regions for which the IoU is greater than 0.7+.

Building a Model for our Data

First, let’s build a model with coco_weights by calling mask rcnn pre-trained model. Then modify the config python file as per our requirements ( please refer to GitHub link below ).

Create our own dataset by modifying by inheriting the “utils. Dataset” python file. Now split the data into train and test.

After few epochs, I stopped my model which neither overfits not underfits.

After localization and article detection these particular articles are cropped and sent to Module-3 for generating the embedding.

Results:

Module_3:

In Module 3, I tried with DenseNet121, ResNet50, ResNet101, MobileNet, and InceptionV3. Out of all these. DenseNet121 gave good results.

DenseNet121 is low sparsity compared with others. I choose Densenet121.

Average Sparsity of embedding generated by the Models

DenseNet generated 1024 dimensional embedding with low sparsity.

As we saw that we have 8 categories of data, I have divided them into 3 super categories for indexing as below.

Upper_wear: women_shirts_tops_tess
Lower_wear: women jeans juggings, women skirts, women trousers
Foot_wear: women casual shoes, flats, heels

Module_4:

In Module 4, I have used FAISS (Facebook AI Similarity Search) library to retrieve similar articles.

Faiss works only with float32 type ndarray. So, first we have convert our embedding into ndarray type float32.

Now let’s create 3type indexies with faiss for upper ware , lower ware , and footware.

facebookresearch/faiss

The basic indexes are given hereafter: The index can be constructed explicitly with the class constructor, or by using…

github.com

From the Facebook Github page provided above, IndexFlatL2 is a brute-force one compared with others that use Euclidean distance to calculate the nearest distance. So, I used it. We can use cosine as well, but we have to normalize the vector before using cosine. Normally cosine distance is used in text similarities.

The index takes only one parameter, which is nothing but a vector with any shape but if we are passing multiple vectors make sure all vectors are in the same shape.

we have a search method with Faiss which depends upon index values to retrieve similar articles. In the search method also we have to pass the vector with the shape that matches with the indexing vector.

As the generate embedding in module 3 returns a list with lenght 2014, not in the same shape ( row vector ) and type ( float 32 ndarray). we have to convert it first before searching.

7. End Results

So we have an out final solution. Model is able to detect and retrieve fashion objects from the given image. So there are a few wrong object detection and wrong retrieval but this is because the model is trained for fewer epochs. Some wrong retrievals are because whole images are embedded and not the objects. Also, the database size is also very small. But overall, we have a first-cut solution that can be further expanded and optimized. please check out in Github link.

8. Future Work

Reduce the latency of the end-to-end application.
For embeddings try different approaches like build your own model with some good score.
Collect more data for object detection and use the latest segmentation approach other than MASK RCNN.
Train the Triplet-net Based Embedding layer network for getting similar images as per the research paper.

9.Link to my profile

Please go through my GitHub profile to have a glance at the code.

Click here to connect with me on LinkedIn.