Building a Custom Image Dataset for Deep Learning projects

Building a Custom Image Dataset for Deep Learning projects

I work predominantly in NLP for the last three months at work. It’s been a long time I work on the image data. Hence, I decided to build a unique image classifier model as part of my personal project and learning.

One thing I am really missing in the current pandemic is travelling. These days I used to see a lot of travel vlogs and travel pictures on Instagram, wondering when we will go back to the normal world.

This strikes me to create an image classifier model with five classes like Mountain, Beach, Desert, Lake, and Museum. However, I don’t have an image dataset to build the model and unable to get any dataset from google. One way is to manually scrape the image, but it takes time. I come across google images download and bing image downloader and found it very easy to build your custom image dataset.

I am planning to use transfer learning, so I require only a small amount of images. Hence, I decided to collect 100 images per class using google images download. This blog explains how to build a custom image dataset using google images download and bing image downloader.

For simplicity, I am going to build only for two classes: ‘Mountain’ and ‘Beach.’

Google Image download

Installation of Google Image download

Initially, I tried with pip install google_image_download. However, it is not working. I referred stack overflow and installed the library using JoeClinton’s GitHub link.

You can check the official google images download page here.

Code

The next step is to import the google image download from google image and initiate the class called a response.

Code

Now we need to pass our arguments. I need Mountain, beach images so I am passing ‘Mountain’, ‘Beach’ as a keyword.

Format — It’s a file option. I am looking for a jpg file. This supports gif, png, bmp, svg, webp, ico, raw according to documentation.

limit: It refers to the number of images. The default size is 100. If you want to download more than 100 images, then we need to install Selenium along with the Chromedriver extension. I have not tried the same as I need only 100 images.

Sometimes, we get images less than 100 due to occasional errors while downloading images.

Print URLs: Printing the URLs of the image that extracts

There are other arguments that are available as color, aspect ratio, etc. Please check their documentation and give it a try.

The below flow chart explains the process. It takes the query (arguments), search, download the raw HTML link, scrape all the image links, download and save the images.

Image for post
https://github.com/Joeclinton1/google-images-download/blob/patch-1/images/flow-chart.png

Code

All the images downloaded, and stored in the folder called “download” with a subfolder of “Mountain” and “Beach”. Mountain images stored in the folder Mountain and Beach images are stored in the Beach folder.

Output

We can see that all the images have been stored in the respective folders. The images are in different sizes. It needs to resize the image before feeding it into the model. This library is highly useful if you want to build a custom image dataset for the image classifier.


2) Bing Image downloader

Bing image downloader is a python library which used to download bulk of images from bing.com. Please check here for more information.

Installation of Bing Image downloader

Here, I am going to extract only Mountain images. So, creating a local directory called ‘mountain’ to store the images.

Now, importing bing image downloader and passing the arguments. We need mountain images, hence I am passing mountain as a string to be searched.

Limit = Number of images to download. Bing search can download bulk images. I am limiting to 200 due to my ram. You can try with a higher number and check.

Output_dir = Name of output directory. It is optional. I created a directory called mountain and storing all the images. If you don’t specify the directory, then all the images get stored in your path directory.

adult_filter_off = It helps to disable adult filtration. By default is true.

force_replace: It deletes the folder if present and starts afresh download

Checking the image files in the mountain directory

200 image files are stored in the directory. Let’s read some of the image files using IPython

Another one

We can download bulk images from the bing image downloader. However, sometimes getting an accurate image is challenging.


Note

Please ensure before making using these images for any commercial purpose as it violates its copyright terms. Google or bing downloader does not own the copyright of the images, and it owns by the original creator of the images.

Thanks for reading. Keep learning and stay tuned for more!

Thanks to Anirudh Koul

Reference

  1. https://google-images-download.readthedocs.io/en/latest/arguments.html
  2. https://github.com/Joeclinton1/google-images-download
  3. https://stackoverflow.com/questions/60370799/google-image-download-with-python-cannot-download-images
  4. https://pypi.org/project/bing-image-downloader/