1) Quick decision map
-
If the site provides an API / RSS / sitemap → use that (cleanest, reliable).
-
If pages are static HTML → use
requests+BeautifulSoup(fast and simple). -
If pages render with JavaScript → use a headless browser: Playwright, Puppeteer, or Selenium.
-
If you need the whole site → use
wgetorHTTrack. -
Always check robots.txt and the site’s Terms of Service before scraping.
2) Legal & ethical checklist (read first!)
-
Check
https://example.com/robots.txtand follow crawl rules. -
Respect
Rate limits(delay between requests). -
Don’t bypass paywalls or authentication you’re not allowed to use.
-
Don’t collect personal data you aren’t authorized to collect.
-
If in doubt, ask site owner or use the official API.
3) Tools summary (pick one)
-
Python:
requests,BeautifulSoup,lxml,pandas(for postprocessing). -
Headless browsers:
Playwright(recommended),Selenium,Puppeteer(Node). -
Framework for large projects:
Scrapy. -
Command-line whole-site:
wget,HTTrack. -
For JS API endpoints: Inspect Network tab in DevTools — sometimes data comes from JSON endpoints you can call directly.
4) Minimal examples
A — Static HTML: Python (requests + BeautifulSoup)
Notes: set a realistic User-Agent, add time.sleep() between requests for multiple pages.
B — JavaScript-rendered pages: Playwright (Python)
C — Whole-site mirror with wget
-
--mirror=-r -N -l inf --no-remove-listing -
Good for offline browsing; use carefully and respect robots.
D — Finding JSON endpoints (faster)
-
Open Developer Tools → Network tab.
-
Reload page and filter
XHR/fetch. -
Look for requests returning
application/json. Those endpoints often return structured data you can call directly withrequests.
5) Pagination & links
-
Find next-page selector (e.g.,
.next), loop until no next link. -
Use canonical URLs or construct absolute URLs with
urllib.parse.urljoin(base, href).
6) Authentication & sessions
-
Use
requests.Session()to persist cookies. -
For login forms: submit credentials to the login endpoint, or use Playwright for sites with OAuth/JavaScript-driven logins.
-
For 2FA or captchas: you generally shouldn’t bypass them; look for API or request permission.
7) Respectful scraping best practices
-
Add
sleepbetween requests: e.g.,time.sleep(1 + random.random()). -
Use
Retrylogic for transient errors. -
Limit concurrent requests (Scrapy and Playwright concurrency settings).
-
Use caching if you re-run collection.
8) Parsing tips
-
Use CSS selectors (
soup.select(...)) or XPath (lxml). -
For tables,
pandas.read_html()can quickly convert HTML tables into DataFrames. -
Clean text with
.get_text(strip=True)and normalize whitespace.
9) Storing results
-
Small: CSV or JSON.
-
Larger: SQLite or a database like PostgreSQL.
-
For images/files: download binary content with
requests.get(url).contentand save to disk.
10) Example project outline (scale up)
-
Start with one page → confirm selector(s).
-
Implement pagination and link discovery.
-
Add rate limiting, error handling, logging.
-
Save into structured format (JSON/CSV/DB).
-
Add tests and incremental crawling to avoid re-downloading.
-
Monitor for schema drift (site changes).




