1) Quick decision map

  • If the site provides an API / RSS / sitemap → use that (cleanest, reliable).

  • If pages are static HTML → use requests + BeautifulSoup (fast and simple).

  • If pages render with JavaScript → use a headless browser: Playwright, Puppeteer, or Selenium.

  • If you need the whole site → use wget or HTTrack.

  • Always check robots.txt and the site’s Terms of Service before scraping.

2) Legal & ethical checklist (read first!)

  • Check https://example.com/robots.txt and follow crawl rules.

  • Respect Rate limits (delay between requests).

  • Don’t bypass paywalls or authentication you’re not allowed to use.

  • Don’t collect personal data you aren’t authorized to collect.

  • If in doubt, ask site owner or use the official API.

3) Tools summary (pick one)

  • Python: requests, BeautifulSoup, lxml, pandas (for postprocessing).

  • Headless browsers: Playwright (recommended), Selenium, Puppeteer (Node).

  • Framework for large projects: Scrapy.

  • Command-line whole-site: wget, HTTrack.

  • For JS API endpoints: Inspect Network tab in DevTools — sometimes data comes from JSON endpoints you can call directly.

4) Minimal examples

A — Static HTML: Python (requests + BeautifulSoup)

import requests
from bs4 import BeautifulSoup
import csv

url = "https://example.com/some-page"
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")

# Example: extract all article titles and links
rows = []
for a in soup.select("article h2 a"):
title = a.get_text(strip=True)
href = a.get("href")
rows.append((title, href))

# Save to CSV
with open("articles.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["title", "url"])
writer.writerows(rows)

print(f"Saved {len(rows)} rows to articles.csv")

Notes: set a realistic User-Agent, add time.sleep() between requests for multiple pages.

B — JavaScript-rendered pages: Playwright (Python)

pip install playwright
python -m playwright install
from playwright.sync_api import sync_playwright
import csv

with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/dynamic-page", timeout=60000)
page.wait_for_selector("article h2 a") # wait until content loads

items = page.query_selector_all("article h2 a")
rows = []
for it in items:
title = it.inner_text().strip()
href = it.get_attribute("href")
rows.append((title, href))

browser.close()

with open("dynamic.csv", "w", newline="", encoding="utf-8") as f:
import csv
writer = csv.writer(f)
writer.writerow(["title","url"])
writer.writerows(rows)
print("saved dynamic.csv")

C — Whole-site mirror with wget

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://example.com/
  • --mirror = -r -N -l inf --no-remove-listing

  • Good for offline browsing; use carefully and respect robots.

D — Finding JSON endpoints (faster)

  1. Open Developer Tools → Network tab.

  2. Reload page and filter XHR / fetch.

  3. Look for requests returning application/json. Those endpoints often return structured data you can call directly with requests.

5) Pagination & links

  • Find next-page selector (e.g., .next), loop until no next link.

  • Use canonical URLs or construct absolute URLs with urllib.parse.urljoin(base, href).

6) Authentication & sessions

  • Use requests.Session() to persist cookies.

  • For login forms: submit credentials to the login endpoint, or use Playwright for sites with OAuth/JavaScript-driven logins.

  • For 2FA or captchas: you generally shouldn’t bypass them; look for API or request permission.

7) Respectful scraping best practices

  • Add sleep between requests: e.g., time.sleep(1 + random.random()).

  • Use Retry logic for transient errors.

  • Limit concurrent requests (Scrapy and Playwright concurrency settings).

  • Use caching if you re-run collection.

8) Parsing tips

  • Use CSS selectors (soup.select(...)) or XPath (lxml).

  • For tables, pandas.read_html() can quickly convert HTML tables into DataFrames.

  • Clean text with .get_text(strip=True) and normalize whitespace.

9) Storing results

  • Small: CSV or JSON.

  • Larger: SQLite or a database like PostgreSQL.

  • For images/files: download binary content with requests.get(url).content and save to disk.

10) Example project outline (scale up)

  1. Start with one page → confirm selector(s).

  2. Implement pagination and link discovery.

  3. Add rate limiting, error handling, logging.

  4. Save into structured format (JSON/CSV/DB).

  5. Add tests and incremental crawling to avoid re-downloading.

  6. Monitor for schema drift (site changes).