1. Setting up your project
Install the NuGet packages:
dotnet add package HtmlAgilityPack
dotnet add package AngleSharp
dotnet add package System.Text.Json2. Scraping with HtmlAgilityPack (XPath)
HtmlAgilityPack is the most popular .NET HTML parser. It uses XPath expressions to query elements:
using HtmlAgilityPack;
var web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)";
var doc = web.Load("https://books.toscrape.com/");
// Select all book articles using XPath
var books = doc.DocumentNode
.SelectNodes("//article[@class='product_pod']");
foreach (var book in books)
{
var title = book.SelectSingleNode(".//h3/a")
?.GetAttributeValue("title", "N/A");
var price = book.SelectSingleNode(".//p[@class='price_color']")
?.InnerText.Trim();
Console.WriteLine($"{title}: {price}");
}XPath is powerful for complex queries. SelectNodes() returns all matches, while SelectSingleNode() returns the first match.
3. Scraping with AngleSharp (CSS selectors)
AngleSharp provides a browser-like API with CSS selectors. If you come from JavaScript, this will feel familiar:
using AngleSharp;
using AngleSharp.Dom;
var config = Configuration.Default
.WithDefaultLoader()
.WithDefaultCookies();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(
"https://books.toscrape.com/"
);
// CSS selectors - just like JavaScript
var books = document.QuerySelectorAll("article.product_pod");
foreach (var book in books)
{
var title = book.QuerySelector("h3 a")
?.GetAttribute("title") ?? "N/A";
var price = book.QuerySelector(".price_color")
?.TextContent.Trim();
Console.WriteLine($"{title}: {price}");
}4. Configuring HttpClient
Always reuse a single HttpClient instance and set realistic browser headers:
using System.Net.Http.Headers;
// Reuse a single HttpClient instance
var handler = new HttpClientHandler
{
AutomaticDecompression =
System.Net.DecompressionMethods.GZip |
System.Net.DecompressionMethods.Deflate
};
var client = new HttpClient(handler);
client.DefaultRequestHeaders.UserAgent.ParseAdd(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
"AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36"
);
client.DefaultRequestHeaders.Accept.ParseAdd(
"text/html,application/xhtml+xml"
);
client.DefaultRequestHeaders.AcceptLanguage.ParseAdd("en-US,en;q=0.9");
var html = await client.GetStringAsync(
"https://example.com/products"
);
Console.WriteLine($"Fetched {html.Length} chars");Pro tip
Never create a new HttpClient per request. .NET's socket exhaustion problem is real. Use IHttpClientFactory in ASP.NET or a shared static instance in console apps.
5. Async concurrent scraping
C#'s async/await and SemaphoreSlim make concurrent scraping elegant:
using System.Collections.Concurrent;
var urls = Enumerable.Range(1, 50)
.Select(i => $"https://example.com/page/{i}")
.ToList();
var results = new ConcurrentBag<(string url, string data)>();
var semaphore = new SemaphoreSlim(5); // max 5 concurrent
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var html = await client.GetStringAsync(url);
// ... parse with HtmlAgilityPack or AngleSharp
results.Add((url, html));
Console.WriteLine($"Done: {url}");
}
catch (HttpRequestException ex)
{
Console.WriteLine($"Error on {url}: {ex.Message}");
}
finally
{
semaphore.Release();
}
// Polite delay
await Task.Delay(Random.Shared.Next(1000, 3000));
});
await Task.WhenAll(tasks);
Console.WriteLine($"Scraped {results.Count} pages");6. Storing scraped data
System.Text.Json handles JSON serialization. For CSV, use string interpolation or the CsvHelper package:
using System.Text.Json;
var products = new List<Product>
{
new("Widget Pro", "$29.99", "/products/widget-pro"),
new("Gadget Max", "$49.99", "/products/gadget-max")
};
// Save to JSON
var json = JsonSerializer.Serialize(products,
new JsonSerializerOptions { WriteIndented = true });
File.WriteAllText("products.json", json);
// Save to CSV
var csv = "Name,Price,Url\n" + string.Join("\n",
products.Select(p => $"{p.Name},{p.Price},{p.Url}"));
File.WriteAllText("products.csv", csv);
record Product(string Name, string Price, string Url);7. Handling JavaScript pages with SnapRender
HttpClient and HtmlAgilityPack cannot execute JavaScript. SPAs built with React, Angular, or Blazor return empty shells. Use SnapRender to get fully-rendered content:
Render as markdown
using System.Text.Json;
var client = new HttpClient();
client.DefaultRequestHeaders.Add("x-api-key", "sr_live_YOUR_KEY");
// Render any JS-heavy page as clean markdown
var payload = new
{
url = "https://example.com/spa-page",
format = "markdown"
};
var response = await client.PostAsJsonAsync(
"https://api.snaprender.dev/v1/render", payload
);
var json = await response.Content.ReadAsStringAsync();
var doc = JsonDocument.Parse(json);
var markdown = doc.RootElement
.GetProperty("data")
.GetProperty("markdown")
.GetString();
Console.WriteLine(markdown);Extract structured data
Use CSS selectors to pull specific fields. Returns clean JSON — no parsing needed.
var payload = new
{
url = "https://example.com/products/widget-pro",
selectors = new Dictionary<string, string>
{
["name"] = "h1.product-title",
["price"] = ".price-current",
["rating"] = ".star-rating",
["description"] = ".product-description p",
["in_stock"] = ".availability-status"
}
};
var response = await client.PostAsJsonAsync(
"https://api.snaprender.dev/v1/extract", payload
);
var result = await response.Content.ReadAsStringAsync();
Console.WriteLine(result);Comparison: when to use what
| Approach | Best for | Limitation |
|---|---|---|
| HtmlAgilityPack | Static HTML, XPath queries | No JS rendering |
| AngleSharp | CSS selectors, modern API | JS support is experimental |
| Playwright .NET | Full browser automation | Heavy, slow, resource-hungry |
| SnapRender API | JS + anti-bot + scale | API cost at high volume |
Skip the browser infrastructure
SnapRender handles JavaScript rendering, anti-bot bypass, and data extraction. Just send a URL from your C# app and get results back as JSON.
Get Your API Key — FreeFrequently asked questions
C# is excellent for web scraping, especially in enterprise environments. HtmlAgilityPack and AngleSharp are mature, well-maintained libraries. C# offers strong typing, async/await, and excellent IDE support. It integrates naturally into .NET pipelines, Azure Functions, and Windows services.
HtmlAgilityPack uses XPath for querying and is the older, more established library. AngleSharp uses CSS selectors (like jQuery) and has a more modern API. AngleSharp also supports JavaScript execution via AngleSharp.Js. For most scraping, AngleSharp is the more developer-friendly choice.
Standard C# HTTP clients (HttpClient, RestSharp) cannot execute JavaScript. You can use Playwright for .NET, Selenium WebDriver, or AngleSharp.Js for local rendering. For production scraping, an API like SnapRender handles JS rendering server-side without browser dependencies.
Set realistic User-Agent headers, implement random delays between requests, rotate proxies, and handle 429/403 responses with exponential backoff. For Cloudflare-protected sites, use SnapRender with the use_flaresolverr flag to bypass protection automatically.
HttpClient is built into .NET and is the recommended choice for most scraping. It supports connection pooling, automatic decompression, and cookies. RestSharp adds convenience methods but introduces an extra dependency. For scraping, HttpClient with a shared instance is the standard approach.