Thousands of animals are present in shelter homes and much more are present on the streets. In order to stop things like cruelty and euthanization of these animals, we need to increase animal adoption rates. Animals with cute pictures are more likely to get adopted. Shelters need a way to estimate and increase “cuteness” of photos of these animals to get them adobted faster. The goal of our project is to use machine learning to make accurate predictions of “cuteness” and increase animal adoption rates from the shelters.
For CS7641, our project aims at estimating the cuteness/popularity of images of shelter animals. This is an open kaggle challenge. The dataset contains raw images of shelter animals along with metadata. The metadata consists of a set of binary features like presence of eyes, face, etc.
In this project, we use both supervised learning and unsupervised learning to estimate popularity/cuteness of images. In particular, we use representation learning to learn features from raw images along with PCA to select prominent features from the metadata. Finally, we plan on demonstrating the effectiveness of our solution by plotting training and validation loss along with an ablation study.
The dataset consists of:
For each image in the training set, we also have a set of metadata available. The information regarding the following twelve binary features: Focus, Eyes, Face, Near, Action, Accessory, Group, Collage, Human, Occlusion, Info, Blur.
We have visualised the distribution of pawpularity with respect to each of the features. For this, we have used box-plot, histogram and proportion of presence and absence of each feature for each pawpularity levels. We have used modifications of the method described in [7] for the visusalizations. The distributions for each of the features are given below:












As we can see from the charts, the the distributions of the pawpularity scores does not vary much across the positive and negative states of the features. The proportions of the positive and negative states of the features remain almost constant throughtout all values of pawpularity. This is consistent with our analysis in other stages shared in [Fig 3] where we found almost no correlation within the metadata and pawpularity.
The metadata was split into a 80-20 share for the purposes of training and validation. All results reported are for the validation set.
Without PCA: We first ran Linear Regression on the metadata. However, the \(R^2\) score of the regressor turned out to be very poor at only \(0.003\). This meant that the variation in the input features did not explain the variation in target. Additionally, the RMSE score was \(20.4944\).
With PCA [Unsupervised Learning]: We next ran PCA on the meta-data with an intention to retain \(90\)% of the variance in data. We then again ran the transformed features against the target variables. This reduced the \(R^2\) of the model to an even lower value of \(0.0001\).
Our initial attempt was to create the regressor by creating a modification of the Resnet18 model. The preprocessing step consisted of:
Unsupervised method to remove duplicates:
While manually inspecting the image data we also found that the dataset has a lot of noise. Individually looking at the photos, we noticed that the popularity score did not always tally with the cuteness/quality of the animal. In addition to this, we also noticed that there are several duplicate images with different popularity scores in the dataset.
Given this new found knowledge, we tried multiple ways to mitigate this. The final solution was to write a small script to extract the duplicate images by cosine similarity between the pairs of images. We chose to flatten the image and then found the similarity between two images using the formula: \(a.b / |a||b|\)
The images below show a sample of the duplicate images with contradictory pawpularity
scores that we were able to find.
To help with training, we chose to exclude these images from our training and test set.
The final model architecture of the DNN to perform regression on the image data is as below:
We ran the model against the target pawpularity score which provided us an RMSE score of \(19.187\) Below is the plot of RMSE loss in terms of epoch. As we can see the from the training loss plot, we are able to reduce the RMSE from \(40\) to \(20\) using the above model.
[Fig 7] Training loss change with epochs
Our next attempt was to train deep vision architectures that are pre-trained on ImageNet dataset and finetune the network to the Petfinder pawpularity dataset. We used Pytorch Image models (timm) to obtain the pre-trained ImageNet models and used them as the backbone feature extractor in the finetuning.
We obtained the pet image dataset after the duplicate removal and performed several transforms/augmentations to improve the model robustness and enhance the dataset size. Some examples of different transforms we experimented with are given below:
We experimented with the below models and replaced the final classification head with the fully connected regressor head which outputs the final pawpularity value.
Below are the major hyper parameters we experimented / tuned to obtain the optimal architecture:
Best model configuration: We achieved the best validation RMSE of 18.67 using the below architecture and hyper parameter settings.
After ImageNet results, we felt that the model needs to be pre-trained on some pet dataset so that it can learn to understand the heuristics for assigning pawpularity score in our target dataset. We used the Oxford-IIIT [9] dataset to pre-tune a EfficientNet_b4 from scratch on breed detection task for both cats and dogs and then fine-tune it on our pawpularity dataset. Our assumption here was that the breed of the dog plays a major factor in deciding the pawpularity score and while learning to breed the pre-trained model will also learn to understand minute features in pets. Note that this assumption is quite different from our ImageNet approach where we gave lighting conditions and background image a higher priority in assigning the score.
As before, we obtained the pet image dataset after the duplicate removal and performed several
transforms/augmentations to improve the model robustness and enhance the dataset size. Some examples
of the Oxford-IIIT dataset images are:
We used the EfficientNet_b4 model for pre-training task by attaching a linear head for breed classification. After pre-training we repeat our configuration for ImageNet model by removing the last layer and attaching a fully connected regressor head which outputs the final pawpularity score.
Below are the major hyper parameters we experimented / tuned to obtain the optimal architecture:
Best model configuration: We achieved the best validation RMSE of 19.67 using the below architecture and hyper parameter settings.
| ImageNet Architecture | EfficientNet_b4 |
| Training/Val data size ratio | 4:1 |
| Epochs | 10 |
| Training/Val batch size | 64 |
| Learning Rate | 1e-2 |
| Optimizer | Adam |
| LR Scheduler | Exponential decay |
| Architectures | Weights | Duplicate images | Optimizer | Batch Size | Train-test split | Epochs | Learning rate | Multi step | RMSE |
|---|---|---|---|---|---|---|---|---|---|
| EfficienNet-B0 | Encoder Frozen (ImageNet) | Untouched | SGD | 24 | 90 to 10 | 10 | 1.00E-03 | epochs: [5, 8], gamma: 0.1 | 19.2 |
| Fully trainable (ImageNet) | Untouched | SGD | 24 | 90 to 10 | 10 | 1.00E-03 | epochs: [5, 8], gamma: 0.1 | 18.7 | |
| Fully trainable (ImageNet) | Duplicates set to max score of 2 | SGD | 24 | 90 to 10 | 10 | 1.00E-03 | epochs: [5, 8], gamma: 0.1 | 18.72 | |
| Fully trainable (ImageNet) | Duplicates set to max score of 2 | SGD | 24 | 90 to 10 | 10 | 1.00E-03 | epochs: [5, 8], gamma: 0.1 | 18.68 | |
| EfficienNet-B4 | Fully trainable(ImageNet) | Untouched | Adam | 64 | 4 to 1 | 10 | 1.00E-03 | step epoch: 1, gamma: 0.9 | 18.67 |
| Fully trainable (pretrained from Oxford IIT breed dataset) | Duplicates removed | Adam | 64 | 4 to 1 | 10 | 1.00E-02 | Exponential delay | 19.67 | |
| Resnet18 | Encoder Frozen (ImageNet) | Untouched | Adam | 64 | 4 to 1 | 10 | 1.00E-03 | step epoch: 1, gamma: 0.9 | 19.17 |
| Swin Transformer | Fully trainable (ImageNet) | Untouched | Adam | 64 | 4 to 1 | 10 | 1.00E-03 | step epoch: 1, gamma: 0.9 | 19.8 |
After trying different data modalities (meta-data and images), architectures, pre-training datasets and hyperparameters we compile our result in the following table. Note that lower the RMSE, better is the model in predicting the pawpularity score.
| Method | RMSE |
|---|---|
| Linear Regression | 20.4944 |
| Linear Regression with PCA | 20.4777 |
| LightGBM | 20.4667 |
| Pretrained ImageNet with frozen weights | 19.187 |
| ImageNet with finetuned weights | 18.67 |
| Oxford IIIT | 19.67 |
GradCAM++ Visualizations:
To further analyze the cause of our performance we made, we tried to visualize the gradients learned
by the best performing network to
understand where the network is focussing on to produce the popularity score. We used GradCAM++(
Generalized Gradient-based Visual Explanations for Deep Convolutional Networks ) [8] module
supported by the Pytorch framework to achieve this.
The data exploration stage gave us a clear insight that meta-data isn’t discriminating against different pawpularity scores and hence, is a very noisy data. We shifted our focus to using images directly for predicting the score and tested various deep learning methods with various initializations and hyperparameters. By comparing the performance obtained between ImageNet initialization and Oxford-IIIT dataset initialization, we can infer that the pawpularity score depends heavily on the extrinsic features like the background of the image, the objects on the pet, lighting conditions etc. instead of the breed of the pet itself. Based on this result, we can conclude with reasonable certainty that the chances of a pet getting adopted increases if the photograph is taken in a good lighting condition, perhaps with some pet-friendly objects irrespective of the breed of the pet.
All team members contributed equally towards the completion of this project. We appreciate the help by the teaching staff team throughtout the course. Cheers!