# Scrape Site

## How to scrape a website

This section helps you customize the website scraping process to efficiently collect the data you need. Here, you'll learn how to specify which pages to scrape, refine data extraction, and adjust scraper behavior to respect website protocols and optimize performance.

It's important to note that websites differ significantly in their structure and content management. Therefore, tweaking the settings to align with the specific characteristics of the site you are targeting is crucial for getting accurate and relevant results. Each setting is designed to enhance control and efficiency in your data collection tasks, allowing for a tailored approach to each unique web environment.

Let's get started and configure your scraper effectively to let your AI agent gain some knowledge!

<figure><img src="https://2117387010-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F3rWESGvwA3vHJ3zNiAG1%2Fuploads%2Fd4koSsjKjyIFMhsOmw4U%2Fimage.png?alt=media&#x26;token=bb410a19-5259-4e6b-92e9-b170b5c8e395" alt=""><figcaption></figcaption></figure>

### **URL**

Here you provide the URL that the scraper should start from, if "Only scrape pages starting with url" is checked, the scraper will only scrape pages that contains the full URL provided in this field.

### **Include/Exclude Settings**

Also referred to as whitelist/blacklist, these settings allow you to control which pages are included or excluded during scraping. For inclusion, specify parts of the URL that must or must not match. For example, to include pages containing "/ai-chatbot" in the URL, you can add "ebbot.com/sv/ai-chatbot" or "/ai-chatbot" to the whitelist. Use "and" to require that all conditions are met, or "or" to allow any condition to be sufficient.

### **Query Selector**

Understanding query selectors is beneficial when configuring this setting. The scraper waits for the specified HTML element to appear before proceeding. To define a selector, inspect the desired element on the web page, identify a unique ID or class, and input it as `#yourId` for IDs or `.yourClass` for classes. You can combine these for more specific targeting (e.g., `#id1.class1.class2`). Test your selector using the browser's console to ensure it selects the intended element correctly. For more information on how to use query selectors [click here](#how-to-set-up-a-scrape-query-selectors).

### **Query Selector Remove**

This option allows you to remove elements from the HTML by matching them with a query selector. If certain content is consistently undesirable (cookie consent elements for example), specify it here to exclude it from all scraped data. Similar methods apply as with the initial query selector configuration.

### **Disable JavaScript**

Enabling this setting will stop all JavaScript on the page, which can prevent tracking scripts and cookie banners from loading. However, be cautious with sites that rely on JavaScript for rendering, as disabling it may prevent the page from displaying correctly.

### **Disable Readability**

Enable this setting to extract only the core, readable content from a webpage, omitting clutter like sidebars and popups. If the website doesn't support this mode, or if it accidentally hides the data you need, make sure readability is disabled.

Tired of headers and menus cluttering your results? Try this feature to clean up the noise automatically. Just double-check to make sure it isn't accidentally filtering out the info you actually need!

### **Check sitemap.xml**

If enabled, the scraper checks the site's sitemap.xml for URLs before starting, which may reveal unlinked or hidden pages. Be mindful of scraping non-public pages and disable this feature if necessary.

### **Automatically remove duplicate elements**

When enabled, the scraper will try to automatically detect and remove repeated conten such as headers, footers, cookie banners, and other common elements. However, it’s still strongly recommended to manually use query selectors to explicitly exclude headers, footers, and popups for more reliable and accurate results.

### **Slowmode**

Slowmode reduces the number of simultaneous connections to one, decreasing the likelihood of being perceived as a threat by security systems. This mode significantly increases scrape duration. If a lengthy scrape is necessary, consider breaking it into segments using targeted whitelists/blacklists to ensure completion within time limits.

These settings provide comprehensive control over the scraping process, allowing for tailored data collection that respects site constraints and maximizes efficiency.

### **Advanced transformer settings**

Use this feature to create rules that transform text on all documents from this specific source. [Click here](https://docs.ebbot.ai/ebbot-docs/core-capabilities/ebbotgpt/ebbotgpt-knowledge/knowledge-pre-processing/advanced-transformer-settings?q=click+here) for more information about the transformer.

Some websites may block the scraper, viewing it as hostile due to the speed at which it collects data. Make sure that the scraper is not blocked by whitelisting the scrapers IP-adress.

## How to use Query Selectors when scraping <a href="#how-to-set-up-a-scrape-query-selectors" id="how-to-set-up-a-scrape-query-selectors"></a>

Using query selectors is one of the easiest ways to improve the quality of your web scraping results. Query selectors allow you to target only the parts of a webpage you actually want, like the main content, and skip over irrelevant sections like headers, footers, or cookie banners.

By adding just a few query selectors, you can reduce the number of scraped documents by **50–70%**, making your data cleaner and easier to work with.

Since every website is structured differently, you'll need to figure out which selectors work best for your specific site. To do this, open your site in a browser, right-click on the content you want, and choose **Inspect**. This will open the **Elements** panel, where you can find and test different selectors.

**Tip: target the main content element!** When scraping websites, it's helpful to identify the main container element that holds the core content of the page, excluding headers, footers, and popups. Most pages have a wrapper element like this, and selecting it ensures you capture only the relevant content.

### **How to write query selectors**

Once you've identified the element you want to include or exclude, check whether it's using a `class` or an `id`:

* For an **ID** (`id="main"`), use a hashtag (`#`):

  <a class="button secondary">Copy</a>

  ```
  #main
  ```
* For a **Class** (`class="main"`), use a dot (`.`):

  <a class="button secondary">Copy</a>

  ```
  .main
  ```

### **Multiple selectors**

You can include multiple selectors by separating them with a comma and a space:

<a class="button secondary">Copy</a>

```
.header, .footer
```

### **Handling multiple class or ID names**

If an element has multiple class or ID names separated by spaces (e.g., `id="elementor-container elementor-column"`), you need to replace the space with a dot (`.`).

For example:

<a class="button secondary">Copy</a>

```
<div id="elementor-container elementor-column">
```

Should be written in Ebbot as:

<a class="button secondary">Copy</a>

```
.elementor-container.elementor-column
```

> Even though it's an `id`, Ebbot uses a dot (`.`) prefix here because of its custom selector rules. Just remember: replace spaces with dots, and always start with a `.` or `#` based on the rule.

## Find out why your web scraping is failing <a href="#is-your-web-scrape-failing" id="is-your-web-scrape-failing"></a>

If your web scraping efforts result in 0 documents retrieved, the issue most frequently stems from one the following reasons:

* **Scrape URL settings**: "Only scrape pages starting with URL" setting correctly configured?
* **Anti-Scraping Measures:** Your website may have active security tools or detection systems designed to identify and block automated scraping attempts. If your website use such tools make sure that ebbot's scraper's IP is whitelisted.
* **robots.txt Directives:** Your website's robots.txt file might contain rules that explicitly disallow web scrapers from accessing content on your site. This file serves as a guide for benevolent web crawlers, indicating which parts of the site they should not visit.
* **Rendering**: Does the site require a delay or a specific selector for JS?

{% tabs %}
{% tab title="URL settings" %}

### URL settings

To prevent the scraper from wandering into irrelevant parts of the site:

* Check Setting: "Only scrape pages starting with URL"
* Best Practice: If you only need the support section, use <https://client.com/support/>

**The Risk**: If this is too restrictive, you might miss pages; if it's too broad, you might clutter the data set with a lot of useless data.
{% endtab %}

{% tab title="Whitelist web scrape IP" %}

### Whitelist web scrape IP adress

Current web scrape IP-adress: **35.246.173.254**\
New web scrape IP-adress: **141.94.175.178**

The current IP address will be replaced with the new one during 2026. To avoid future issues, ensure both IP addresses are whitelisted now, so you don’t need to update the whitelist later.
{% endtab %}

{% tab title="Add ebbot to robots.txt" %}

### Add ebbot to your robots.txt file

To check if your website has robots.txt rules, simply go to your website's URL and append `/robots.txt` (e.g., <https://www.yourwebsite.com/robots.txt>).

Within the robots.txt file, you'll see directives starting with User-agent:. A User-agent identifies a specific web crawler (like a search engine bot or a web scraper).

* If you see User-agent: \* followed by Allow: /, it means all web scrapers (and other bots) are generally allowed to access the entire site.

```
User-agent: *
Allow: /
```

* Conversely, if it says User-agent: \* followed by Disallow: /, it indicates that all web scrapers (and other bots) are disallowed from crawling the entire site.

```
User-agent: *
Disallow: /
```

#### How to add ebbot to robots.txt

If you want to allow only Ebbot's scraper while still disallowing others, you can add the following specific rule:

```
User-agent: Ebbot-Scraper
Allow: /
```

{% endtab %}

{% tab title="Rending" %}

### JavaScript & SPA Rendering

If the scraper returns an empty result or a "loading" icon, the site is likely a Single Page Application (SPA). The scraper might be grabbing the HTML before the content has loaded.

* The Fix: Set a specific Query Selector to tell the scraper to wait for the content.

**Recommended**: Use main\[role="main"] or a specific container ID like #main-content.&#x20;
{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ebbot.ai/ebbot-docs/core-capabilities/ebbotgpt/ebbotgpt-knowledge/source-types/scrape-site.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
