Website scrape

Webscrape basics and settings

This section helps you customize the website scraping process to efficiently collect the data you need. Here, you'll learn how to specify which pages to scrape, refine data extraction, and adjust scraper behavior to respect website protocols and optimize performance.

It's important to note that websites differ significantly in their structure and content management. Therefore, tweaking the settings to align with the specific characteristics of the site you are targeting is crucial for getting accurate and relevant results. Each setting is designed to enhance control and efficiency in your data collection tasks, allowing for a tailored approach to each unique web environment.

Let's get started and configure your scraper effectively to help EbbotGPT gain some knowledge!

URL

Here you provide the URL that the scraper should start from, if "Only scrape pages starting with url" is checked, the scraper will only scrape pages that contains the full URL provided in this field.

Include/Exclude Settings

Also referred to as whitelist/blacklist, these settings allow you to control which pages are included or excluded during scraping. For inclusion, specify parts of the URL that must or must not match. For example, to include pages containing "/ai-chatbot" in the URL, you can add "ebbot.com/sv/ai-chatbot" or "/ai-chatbot" to the whitelist. Use "and" to require that all conditions are met, or "or" to allow any condition to be sufficient.

Query Selector

Understanding query selectors is beneficial when configuring this setting. The scraper waits for the specified HTML element to appear before proceeding. To define a selector, inspect the desired element on the web page, identify a unique ID or class, and input it as #yourId for IDs or .yourClass for classes. You can combine these for more specific targeting (e.g., #id1.class1.class2). Test your selector using the browser's console to ensure it selects the intended element correctly. For more information on how to use query selectors click here.

Query Selector Remove

This option allows you to remove elements from the HTML by matching them with a query selector. If certain content is consistently undesirable (cookie consent elements for example), specify it here to exclude it from all scraped data. Similar methods apply as with the initial query selector configuration.

Disable JavaScript

Enabling this setting will stop all JavaScript on the page, which can prevent tracking scripts and cookie banners from loading. However, be cautious with sites that rely on JavaScript for rendering, as disabling it may prevent the page from displaying correctly.

Disable Readability

This setting turns off the readability script, allowing the scraper to view the page as a regular visitor would. If the site does not support readability mode, or if readability alters the desired content, consider disabling this feature.

Check sitemap.xml

If enabled, the scraper checks the site's sitemap.xml for URLs before starting, which may reveal unlinked or hidden pages. Be mindful of scraping non-public pages and disable this feature if necessary.

Automatically remove duplicate elements

When enabled, the scraper will try to automatically detect and remove repeated conten such as headers, footers, cookie banners, and other common elements. However, it’s still strongly recommended to manually use query selectors to explicitly exclude headers, footers, and popups for more reliable and accurate results.

Slowmode

Slowmode reduces the number of simultaneous connections to one, decreasing the likelihood of being perceived as a threat by security systems. This mode significantly increases scrape duration. If a lengthy scrape is necessary, consider breaking it into segments using targeted whitelists/blacklists to ensure completion within time limits.

These settings provide comprehensive control over the scraping process, allowing for tailored data collection that respects site constraints and maximizes efficiency.

Advanced transformer settings

Use this feature to create rules that transform text on all documents from this specific source. Click here for more information about the transformer.

Some websites may block the scraper, viewing it as hostile due to the speed at which it collects data. Make sure that the scraper is not blocked by whitelisting the scrapers IP-adress.

How to set up a scrape (query selectors)

Using query selectors is one of the easiest ways to improve the quality of your web scraping results. Query selectors allow you to target only the parts of a webpage you actually want, like the main content, and skip over irrelevant sections like headers, footers, or cookie banners.

By adding just a few query selectors, you can reduce the number of scraped documents by 50–70%, making your data cleaner and easier to work with.

Since every website is structured differently, you'll need to figure out which selectors work best for your specific site. To do this, open your site in a browser, right-click on the content you want, and choose Inspect. This will open the Elements panel, where you can find and test different selectors.

Tip: Target the Main Content Element

When scraping websites, it's helpful to identify the main container element that holds the core content of the page, excluding headers, footers, and popups. Most pages have a wrapper element like this, and selecting it ensures you capture only the relevant content.

How to Write Query Selectors

Once you've identified the element you want to include or exclude, check whether it's using a class or an id:

For an ID (id="main"), use a hashtag (#):
```
#main
```
For a Class (class="main"), use a dot (.):
```
.main
```

Multiple Selectors

You can include multiple selectors by separating them with a comma and a space:

.header, .footer

Handling Multiple Class or ID Names

If an element has multiple class or ID names separated by spaces (e.g., id="elementor-container elementor-column"), you need to replace the space with a dot (.).

For example:

<div id="elementor-container elementor-column">

Should be written in Ebbot as:

.elementor-container.elementor-column

Even though it's an id, Ebbot uses a dot (.) prefix here because of its custom selector rules. Just remember: replace spaces with dots, and always start with a . or # based on the rule.

Is your web scrape failing?

If your web scraping efforts result in 0 documents retrieved, the issue most frequently stems from one of two primary causes:

Anti-Scraping Measures: Your website may have active security tools or detection systems designed to identify and block automated scraping attempts. If your website use such tools make sure that ebbot's scraper's IP is whitelisted.
robots.txt Directives: Your website's robots.txt file might contain rules that explicitly disallow web scrapers from accessing content on your site. This file serves as a guide for benevolent web crawlers, indicating which parts of the site they should not visit.

Whitelist web scrape IP adress

Current web scrape IP-adress: 35.246.173.254 New web scrape IP-adress: 141.94.175.178

In 2025, the current IP address will be replaced with the new one. To avoid future issues, ensure both IP addresses are whitelisted now, so you don’t need to update the whitelist later.

Add ebbot to your robots.txt file

To check if your website has robots.txt rules, simply go to your website's URL and append /robots.txt (e.g., https://www.yourwebsite.com/robots.txt).

Within the robots.txt file, you'll see directives starting with User-agent:. A User-agent identifies a specific web crawler (like a search engine bot or a web scraper).

If you see User-agent: * followed by Allow: /, it means all web scrapers (and other bots) are generally allowed to access the entire site.

User-agent: *
Allow: /

Conversely, if it says User-agent: * followed by Disallow: /, it indicates that all web scrapers (and other bots) are disallowed from crawling the entire site.

User-agent: *
Disallow: /

How to add ebbot to robots.txt

If you want to allow only Ebbot's scraper while still disallowing others, you can add the following specific rule:

User-agent: Ebbot-Scraper
Allow: /

PreviousFile NextDocx file

Last updated 3 months ago

Was this helpful?