Cheerio is a lightweight, fast, and flexible library designed for web scraping and manipulation. It provides a jQuery-like API for the server, allowing you to parse and manipulate HTML and XML documents.
Key Features:
Installation:
npm install cheerio
Basic Usage Example:
const cheerio = require("cheerio");
const axios = require("axios");
const scrapeWebsite = async () => {
const { data } = await axios.get("https://example.com");
const $ = cheerio.load(data);
$("h1").each((i, element) => {
console.log($(element).text());
});
};
scrapeWebsite();
Key Methods:
load(html)
: Load HTML into Cheerio.$(selector)
: Select elements using CSS selectors.text()
: Get or set text content.attr(name)
: Get or set attribute values.html()
: Get or set HTML content.Advanced Usage Example:
const getLinks = async () => {
const { data } = await axios.get("https://example.com");
const $ = cheerio.load(data);
$("a").each((i, element) => {
console.log($(element).attr("href"));
});
};
getLinks();
Puppeteer is a Node library that provides a high-level API to control headless Chrome or Chromium browsers. It allows for full automation of browser tasks, such as web scraping, UI testing, and rendering JavaScript-heavy pages.
Key Features:
Installation:
npm install puppeteer
Basic Usage Example:
const puppeteer = require("puppeteer");
const scrapeWebsite = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const title = await page.title();
console.log(`Title: ${title}`);
await browser.close();
};
scrapeWebsite();
Key Methods:
launch()
: Launch a new browser instance.newPage()
: Create a new page/tab.goto(url)
: Navigate to a URL.evaluate()
: Execute JavaScript in the page context.screenshot(options)
: Take a screenshot of the page.Advanced Usage Example:
const scrapeWithPuppeteer = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll("h1")).map(
(element) => element.textContent
);
});
console.log(data);
await browser.close();
};
scrapeWithPuppeteer();
Cheerio:
Puppeteer:
Improvement Note:
Cheerio:
Scraping Data with Pagination:
If a website uses pagination to display results, you might need to navigate through multiple pages to collect data.
const cheerio = require("cheerio");
const axios = require("axios");
const scrapePaginatedData = async (url) => {
let pageNumber = 1;
let hasMorePages = true;
while (hasMorePages) {
const { data } = await axios.get(`${url}?page=${pageNumber}`);
const $ = cheerio.load(data);
const items = $("selector-for-items");
if (items.length === 0) {
hasMorePages = false;
} else {
items.each((i, element) => {
console.log($(element).text());
});
pageNumber++;
}
}
};
scrapePaginatedData("https://example.com/items");
Handling Complex HTML Structures:
const getComplexData = async () => {
const { data } = await axios.get("https://example.com/complex-page");
const $ = cheerio.load(data);
$("div.parent-class").each((i, element) => {
const title = $(element).find("h2.title").text();
const description = $(element).find("p.description").text();
console.log({ title, description });
});
};
getComplexData();
Puppeteer:
Navigating and Interacting with Forms:
const puppeteer = require("puppeteer");
const interactWithForm = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com/form-page");
await page.type('input[name="username"]', "myUsername");
await page.type('input[name="password"]', "myPassword");
await page.click('button[type="submit"]');
await page.waitForNavigation();
console.log("Form submitted successfully");
await browser.close();
};
interactWithForm();
Generating PDFs and Screenshots:
const generatePDF = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
await page.pdf({ path: "page.pdf", format: "A4" });
console.log("PDF generated");
await page.screenshot({ path: "screenshot.png", fullPage: true });
console.log("Screenshot taken");
await browser.close();
};
generatePDF();
Handling JavaScript-Heavy Pages:
const scrapeJavaScriptHeavyPage = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com/js-heavy-page", {
waitUntil: "networkidle2",
});
const content = await page.evaluate(() => {
return document.querySelector("div.dynamic-content").innerText;
});
console.log(content);
await browser.close();
};
scrapeJavaScriptHeavyPage();
Real-Time Data Monitoring:
Track changes on a webpage or service in real time and alert when specific conditions are met. Useful for monitoring price changes, stock updates, or news feeds.
Dynamic Content Extraction:
Extract data from pages where content is generated dynamically with JavaScript. Puppeteer can be used to render the page and extract data after all scripts have executed.
Automated Testing of User Flows:
Test complex user flows in web applications by simulating user interactions. Useful for QA teams to ensure that application workflows perform as expected under different conditions.
Web Data Aggregation:
Combine data from multiple sources or websites into a single dataset. Cheerio can be used to extract data from static pages, while Puppeteer can handle dynamic content.
Social Media Content Scraping:
Extract posts, comments, or user interactions from social media platforms. Puppeteer can handle pages with infinite scrolling or dynamic content loading.
SEO Analysis:
Scrape web pages to analyze SEO elements such as meta tags, headings, and keyword usage. Automated tools can check large numbers of pages and generate SEO reports.
Content Personalization:
Gather user data from websites to create personalized content recommendations. Useful for building user profiles based on their browsing history.
Data Migration:
Transfer content from old websites to new platforms by scraping the old site and injecting the data into the new one.
Competitor Analysis:
Collect data on competitors’ websites, including product prices, promotions, and reviews, to perform market analysis.
Content Quality Assurance:
Validate the quality of content on a website, including checking for broken links, missing images, and formatting issues.