Scraping Hacker News with Selenium

Ever wanted to automatically fetch the top stories from Hacker News? In this guide, we'll build a simple yet powerful Selenium script to extract the most popular posts, including their titles, links, and points. This is perfect for developers looking to automate web scraping or learn web automation basics.

Why Selenium?

While there are many web scraping tools available, Selenium offers unique advantages:

Handles dynamic JavaScript content
Simulates real browser interactions
Perfect for complex web applications
Great for learning web automation

Prerequisites

Before we dive in, make sure you have:

Node.js installed
Selenium WebDriver (npm install selenium-webdriver)
Chrome WebDriver set up
Basic JavaScript knowledge

The Code

Let's break down our solution into manageable pieces:

const {Builder, By, Key, until} = require('selenium-webdriver');

async function getHackerNews() {
    let driver = await new Builder().forBrowser('chrome').build();
    
    try {
        // Navigate to Hacker News
        await driver.get('https://news.ycombinator.com');
        
        // Wait for the content to load
        await driver.wait(until.elementLocated(By.className('titleline')), 10000);
        
        // Get top 5 posts
        for(let i = 1; i <= 5; i++) {
            // Get title and link
            let titleElement = await driver.findElement(By.css(`tr.athing:nth-child(${3*i-2}) td.title span.titleline a`));
            let title = await titleElement.getText();
            let link = await titleElement.getAttribute('href');
            
            // Get points
            let pointsElement = await driver.findElement(By.css(`tr:nth-child(${3*i-1}) td.subtext span.score`));
            let points = await pointsElement.getText();
            
            console.log(`\nPost #${i}:`);
            console.log(`Title: ${title}`);
            console.log(`Link: ${link}`);
            console.log(`Points: ${points}`);
        }
        
    } catch(error) {
        console.error('An error occurred:', error);
    } finally {
        // Close the browser
        await driver.quit();
    }
}

How It Works

1. Setting Up the Browser

let driver = await new Builder().forBrowser('chrome').build();

This line initializes a new Chrome browser instance that Selenium will control.

2. Navigation and Waiting

await driver.get('https://news.ycombinator.com');
await driver.wait(until.elementLocated(By.className('titleline')), 10000);

We navigate to Hacker News and wait for the content to load by checking for the 'titleline' class.

3. Extracting Information

We use CSS selectors to locate and extract:

Post titles and links using the 'titleline' class
Points using the 'score' class

4. Error Handling

The try-catch block ensures our script handles errors gracefully, and the finally block guarantees browser cleanup.

Running the Script

Save the code in a file (e.g., hackernews.js) and run:

node hackernews.js

Expected Output

Post #1:
Title: Example Post Title
Link: https://example.com
Points: 100 points

[... more posts follow]

Common Challenges and Solutions

Timing Issues: Use explicit waits instead of fixed delays
Selector Changes: Keep selectors updated with site changes
Error Handling: Implement robust error handling for reliability

Extending the Script

You could enhance this script by:

Saving results to a file
Filtering posts by points
Adding more post details
Implementing regular scheduling

Resources

Please keep in mind to respect websites' terms of service and implement appropriate delays in your scraping scripts.