Automating Web Page Scrolling with Selenium in Python

In this post, we'll walk through a Python script that uses Selenium to automate scrolling through a web page. This can be particularly useful for scraping content from websites with infinite scrolling, like social media feeds or news sites.

Why Automate Scrolling?

Websites with infinite scrolling load more content as you scroll down the page. To scrape data from such sites, you need to keep scrolling to load all the available content. Manually doing this is tedious, so automating the process with Selenium can save time and effort.

The Script

Here’s a breakdown of our script:

1. Importing Required Libraries

First, we import the necessary libraries. Selenium helps us control the web browser, and the time and datetime libraries allow us to manage wait times and log timestamps.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
import datetime

2. Logging Function

The log function helps us log messages with the current time. This is useful for debugging and tracking the script’s progress.

def log(text):
    """
    Logs a message with the current time.
    """
    now = datetime.datetime.now()
    current_time = now.strftime("%H:%M:%S")
    print(f"[LOG {current_time}] {text}")

3. Scrolling Function

The scroll_page function takes a Selenium WebDriver instance and scrolls the web page. It scrolls a maximum number of times (max_scroll_count) or until no new content is loaded.

def scroll_page(driver, max_scroll_count=5, scroll_pause_time=5):
    """
    Scrolls the webpage using the provided WebDriver instance.

    Args:
    driver (webdriver): The WebDriver instance to use for scrolling.
    max_scroll_count (int): The maximum number of times to scroll the page.
    scroll_pause_time (int): The time to wait after each scroll.

    Returns:
    bool: False if the bottom of the page is reached or the max scroll count is exceeded.
    """
    scroll_count = 1
    last_height = driver.execute_script("return document.body.scrollHeight")

    while scroll_count <= max_scroll_count:
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load the page
        time.sleep(scroll_pause_time)

        log(f"Scroll count={scroll_count}")

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            log("Reached the bottom of the page or no new content loaded.")
            return False
        last_height = new_height

        scroll_count += 1

    log(f"Reached max scroll count {max_scroll_count}")
    return False

4. Main Function

The main function initializes the WebDriver, navigates to the target URL, and calls the scroll_page function. It also ensures the browser closes properly after the script finishes.

def main():
    url = "https://reddit.com"

    # Initialize the WebDriver
    driver = webdriver.Chrome()
    try:
        driver.get(url)
        scroll_page(driver)
        # Keep the browser open for 60 seconds after scrolling
        time.sleep(60)
    finally:
        driver.quit()

if __name__ == "__main__":
    main()

Running the Script

To run this script, make sure you have Selenium installed and the Chrome WebDriver available on your system. You can install Selenium using pip:

pip install selenium

Download the appropriate Chrome WebDriver for your Chrome version from the official site and ensure it’s in your system PATH or specify its location when initializing the WebDriver.

Conclusion

This script automates the process of scrolling through a web page using Selenium, making it easier to scrape content from sites with infinite scrolling. By customizing the max_scroll_count and scroll_pause_time, you can adapt the script to different websites and use cases. Happy coding!

Tags:

Kat Solutions

Search