Create Custom Dataset for YOLO by Google Image Crawling

Create Custom Dataset for YOLO by Google Image Crawling

2023, Nov 15    

To create your own dataset for YOLO (You Only Look Once) in Google Colab and save images through Google image crawling, you can use a combination of web scraping tools and Python code. Additionally, you can use VSCode for writing and executing your Python scripts. Here’s a step-by-step guide:

Step 1: Install necessary libraries

In your Colab notebook, install the required libraries for web scraping:

!pip install beautifulsoup4
!pip install requests

Step 2: Create a script for image crawling

Create a Python script (e.g., image_crawler.py) in VSCode and write a script using Beautiful Soup and Requests libraries for web scraping. The script might look like this:

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin

def download_image(url, folder):
    try:
        response = requests.get(url, stream=True)
        file_extension = url.split('.')[-1]
        file_name = os.path.join(folder, f"{hash(url)}.png")

        with open(file_name, 'wb') as file:
            for chunk in response.iter_content(chunk_size=128):
                file.write(chunk)
        
        print(f"Downloaded: {url}")
        return file_name
    except Exception as e:
        print(f"Error downloading {url}: {e}")
        return None

def crawl_images(search_query, num_images, save_folder):
    search_url = f"https://www.google.com/search?q={search_query}&tbm=isch"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    image_urls = []
    for img in soup.find_all('img', class_='t0fcAb'):
        image_url = img.get('src')
        if image_url:
            image_urls.append(image_url)

    if not os.path.exists(save_folder):
        os.makedirs(save_folder)

    count = 0
    for url in image_urls[:num_images]:
        download_image(url, save_folder)
        count += 1

        if count == num_images:
            break

if __name__ == "__main__":
    search_query = "your_search_query"  # Replace with your desired search query
    num_images = 100  # Adjust the number of images you want
    save_folder = "images"  # Replace with your desired save folder

    crawl_images(search_query, num_images, save_folder)

Step 3: Run the script

In VSCode, open a terminal and run the script:

python image_crawler.py

This script will download images from Google Image search based on the specified query and save them in the specified folder.

Step 4: Annotate images

After downloading images, you need to annotate them for YOLO. You can use annotation tools like LabelImg for this purpose.

Step 5: Organize your dataset

Organize your dataset into the YOLO format. Each image should have a corresponding text file with the same name containing the bounding box annotations.

Now, you can use this dataset for training your YOLO model.

Note: Ensure that you have permission to use and download images from the internet, and comply with the terms of service of the websites you scrape. Additionally, web scraping may be subject to legal and ethical considerations. Always respect copyright and intellectual property rights.