Web Scraping at Scale

Scrape data from protected websites using SessionKit's stealth sessions, residential proxies, and automatic retry logic.

Basic Scraping Pattern

import { SessionKit } from '@sessionkit/sdk'
import { chromium } from 'playwright'

const sk = new SessionKit({ apiKey: process.env.SESSIONKIT_API_KEY })

async function scrape(url: string) {
  const session = await sk.sessions.create({
    proxy: { type: 'residential', country: 'US', sticky: true },
    stealth: 'max',
    fingerprint: 'auto',
    timeout: 120,
  })

  const browser = await chromium.connectOverCDP(session.cdpUrl)
  const page = await browser.newPage()

  try {
    await page.goto(url, { waitUntil: 'networkidle' })

    // Extract data
    const data = await page.evaluate(() => {
      const items = document.querySelectorAll('.product-card')
      return Array.from(items).map(item => ({
        title: item.querySelector('h2')?.textContent?.trim(),
        price: item.querySelector('.price')?.textContent?.trim(),
        url: item.querySelector('a')?.href,
      }))
    })

    return data
  } finally {
    await browser.close()
    await sk.sessions.destroy(session.id)
  }
}

Parallel Scraping with Fleets

For high-throughput scraping, use a fleet to avoid cold-start latency:

// Create a fleet for scraping
const fleet = await sk.fleets.create({
  name: 'product-scraper',
  size: 20,
  maxSize: 50,
  proxy: { type: 'datacenter', region: 'us' },
  stealth: 'basic',
  sessionTimeout: 60,
})

// Scrape multiple URLs in parallel
const urls = ['https://example.com/page/1', 'https://example.com/page/2', ...]

const results = await Promise.all(
  urls.map(async (url) => {
    const session = await sk.fleets.acquire(fleet.id)
    const browser = await chromium.connectOverCDP(session.cdpUrl)
    const page = await browser.newPage()

    try {
      await page.goto(url)
      const data = await page.evaluate(() => document.title)
      return { url, data }
    } finally {
      await browser.close()
      await sk.fleets.release(fleet.id, session.id)
    }
  })
)

Handling Anti-Bot Challenges

async function scrapeProtectedSite(url: string) {
  const session = await sk.sessions.create({
    proxy: { type: 'residential', country: 'US' },
    stealth: 'max',
    fingerprint: 'auto',
  })

  const browser = await chromium.connectOverCDP(session.cdpUrl)
  const page = await browser.newPage()

  await page.goto(url)

  // Wait for Cloudflare challenge to resolve
  await page.waitForFunction(
    () => !document.querySelector('#cf-challenge-running'),
    { timeout: 15000 }
  ).catch(() => {
    console.log('Challenge did not resolve — may need to retry')
  })

  // Check if we passed
  const title = await page.title()
  if (title.includes('Just a moment')) {
    throw new Error('Anti-bot challenge not bypassed')
  }

  // Continue with scraping...
  const content = await page.content()
  return content
}

Best Practices

  1. Use residential proxies for anti-bot protected sites
  2. Enable sticky sessions for multi-page crawls requiring login
  3. Add random delays between actions to mimic human behavior
  4. Rotate fingerprints between sessions to avoid tracking
  5. Respect robots.txt and rate-limit your requests
  6. Handle errors gracefully with exponential backoff

Important: Always comply with the target website's Terms of Service and applicable laws when scraping.