Camera traps are widely used in wildlife research. To avoid recording when nothing happens, the camera is usually triggered when the move is detected. Nonetheless, wind can also move the plants and trigger recording, causing a vast amount of snapshots to be blank (without animals recorded). In the research process, they will be manually excluded by researchers, which is a mundane and time-consuming job.
It encouraged me to test if it’s hard to use transfer learning to train the model recognizing blank images. Although I planned to train proof of concept model, I ended up doing deeper data exploration than I expected. In the next paragraphs, you ‘ll find the description of the data exploration process along with the final results on the test set. the images were trained using transfer learning, on the top of ResNet34.
About the Dataset
I’ve used the data form Serengeti Dataset to train, validate and test the model. The data consists of camera trap images from Serengeti National Park in Tanzania. In the dataset about 75% percent of the images are blank. The camera traps were located in multiple locations. Although there are >4TB of images labeled by numerous volunteers from Zooniverse, only ~3200 images were used to train the model. The small dataset was chosen purposely to represent real live data of money/time-limited research team (which has a little time to label more images or a little money to buy camera traps). The final model was tested on the test set consisting of ~40 000 images. The data annotations were downloaded from here.
The images were originally labeled with species names, which I turned into non-blank class. Blank class (no animals present) has remained. I’ve chosen the data so that the dataset will be balanced.
The model was created after 3 attempts, as in each attempt the dataset was modified to tackle new obstacles.
I evaluated the model using precision and recall, both calculated for non-blank class. Ideally, all the non-blank images would be predicted as non-blank to not to lose information about animals living in the area. Blank images labeled as non-blank are acceptable as long as the model label most of the blank images correctly. So, for the non-blank class, the recall is preferred to be close to 1, with as high precision as possible.
I’ve chosen ~1000 images from which ~250 was in the validation set. The images used in the model were resized to 255px. I used the data from 4 locations: 3 of them in the train set and one in the validation set (to check if the model can generalize in unknown locations). I chose rolls (collection of images from one location, captured for one battery duration) with a balanced dataset (blank/non-blank) and more than 3 species in the roll. Thank’s to that solution I didn’t need over/undersampling to balance the dataset. When the camera was triggered by the move, it took 1 up to 3 shots in a row. For simplicity, I used only the second image from each set.
- some images labeled as blank contain visible fragments of the animal body. In the next attempts, I decided to label such images as non-blank (unlike the volunteers, the experts have a high chance to recognize the species by the fragments of the body, so we prefer, that the model would label them as non-blank).
- some non-blank images contain no animal. My investigation turned out that the label was given for the set of images if at least one image in the set has visible animal. As I used only the second image from the set, there was a chance that the animal appeared in the other images in the set and the image was indeed blank. What else, I’ve found in the documentation, that night images have only one shot in the set, so they were misrepresented. Thus, in the next attempts, the first image in the set will be used.
- some animals were in the background. After resizing the image from ~2000px to 255px they were no longer detectable.
All the images were manually labeled to avoid wrong labels as in the first attempt. Mislabeled images were omitted. I extended the train set to ~2800 images and the test set to ~600 images. The original images were compressed to 1000px to save the space and resized to 500px when loaded to the notebook. I used the same locations as previously in the train and validation set. In this and the next attempt, only the first image from the set was used. I created the heatmap to verify if the model focuses on the right target. A little data augmentation was added (brightness, contrast, saturation).
RESULTS: Both precision and recall for the non-blank label in the validation set were around 95%. Looking at the misclassified images, they suffer from the following problems:
- the sunlight cover most of the picture
- the fallen trunk to the right in the background is recognized an animal
- over/underexposed rabbits are poorly detected
- night images seem to be problematic
On the other hand, most of the images were well predicted:
A detailed analysis of this step could be found in the Kaggle notebook.
The train and validation set has the same properties as in the second attempt except for some images. I used images from 7 locations, each of them has a subset in both train and validation set (to test if model better copes with inanimated nature looking like an animal and to expose the model to more animal species).
RESULTS: the validation precision and recall were respectively 96.5% and 90.1% for non-blank class. Although it has worse recall as in the second attempt, I used this model as it was trained on diverse images and was supposed to generalize better. The model could be improved by:
- shift the probability cutoff for non-blank images
- using more data augmentation
- adding more images to train set
Precision & recall
The model has been loaded to AWS EC2 along with ~40 000 images loaded on AWS S3. The original dataset has a majority of blank images. In the validation set, however, the model coped better with blank images, so the test set is imbalanced with the majority of non-blank images. The prediction using p2.xlarge with GPU support took 2,5 hours. The predicted classes have been saved along with probabilities for each class. The confusion matrix below shows the predicted results for the probability cutoff = 50%:
The recall for non-blank images equals 85,7%, which is lower than the respective recall in the third attempt. The model copes better with blank images (recall 93,2%). To increase the non-blank recall we can choose lower cut-off probability for non-blank images. Currently, the image is being classified as non-blank of the probability of non-blank >= 50%. The chart below shows the metrics for non-blank images and the number of images with a given probability of an image being non-blank.
Based on the chart I (arbitrary) chose the cutoff value = 25%. It would increase the non-blank recall to 91% with precision = 94%. Keeping in mind that the real dataset has 75% of blank images, we’ll estimate the dataset by duplicating blank occurrences. The charts below show the metrics for an estimated dataset:
On the one hand, non-blank precision would decrease do 57%. On the other hand, the model managed to detect 101892 blank images, which is 79,6% of all blanks. Assuming that I was able to manually label ~500 images/hour, the model would save about 203 man-hours (5 weeks if we assume 40man-hour/work-week).
Why can’t you recognize it?
The last question is – why the model wasn’t able to correctly recognize 2817 non-blank images? To verify that, I took a sample of the non-blank images – 1144 misrecognized and 600 well recognized. None of the misrecognized non-blank images turned out to be blank. I labeled the images based on the animal position on the image:
- close to the camera
- in the background
- on the border of an image
- animal covers camera (only the fur is visible, sometimes blurred)
Among the labels above, I aimed to use only one, the easiest to recognize (if an image has an animal close to the camera and one in the background, I keep only close to the camera label. However, If I had any doubts, I used all suitable labels describing animal position). Besides that I labeled other circumstances occurring in the images:
- high grass/ plants cover the image
- weather: rain/fog/waterdrops on the camera/sunlight
- broken camera (unrealistic colors, blurred set of images)
- animal blurred due to its movement
I used multiple circumstances labels when suitable. The percentage result was as follows:
Mispredicted non-blank images had a higher fraction of animals being in the background and on the border of an image. Besides that, mispredicted images were taken in harder circumstances than well-predicted images (high grass, weather conditions, night, animal moving very fast). Let’s check yet if some species were problematic for the model. The chart below shows the amount of misrecognized non-blank images along with the fraction of misrecognized images per species:
Among well-populated images, the model has difficulties with human and otherBird labels. They stand for diverse objects: human includes balloons, cars, people, and any other artificial objects. OtherBird stands for multiple bird species, some of them very small and hardly visible. Besides that other problematic species (fraction of misrecognized >15%; porcupine, aardvark, serval, rodents, hyenaStripped, honeyBadger) are poorly populated in the dataset, so the model had a few opportunities to learn them. In the end, let’s display sample non-blank misrecognized images:
I had also an idea to detect under/overexposed animals and correct the gamma parameter to extract the animals from the sunlight, shadow. The idea hasn’t been tested yet, as I didn’t found a clear way to detect under/overexposed animals.
The model is able to detect ~91% of the images with animals or humans. However, if the data is expected to be used in the research, the recall should be higher (I guess 95-97% is sufficient, as 5% of human-labeled images in the first attempt were mislabeled).
The further steps can be:
- bigger train/ validation set, extended by images with:
- plants covering the image
- poor weather conditions (rain, fog, waterdrops on the camera)
- daylight covering the image
- blurred animals being in the movement
- rare, mislabeled species
- training on the original image size. Although computationally expensive, it could help to detect the animals in the background
- using more data augmentation
However, if the results of the model are satisfying, it can save about 5 weeks of manual excluding blank images.