Website scrape
Last updated
Was this helpful?
Last updated
Was this helpful?
This section helps you customize the website scraping process to efficiently collect the data you need. Here, you'll learn how to specify which pages to scrape, refine data extraction, and adjust scraper behavior to respect website protocols and optimize performance.
It's important to note that websites differ significantly in their structure and content management. Therefore, tweaking the settings to align with the specific characteristics of the site you are targeting is crucial for getting accurate and relevant results. Each setting is designed to enhance control and efficiency in your data collection tasks, allowing for a tailored approach to each unique web environment.
Let's get started and configure your scraper effectively to help EbbotGPT gain some knowledge!
Here you provide the URL that the scraper should start from, if "Only scrape pages starting with url" is checked, the scraper will only scrape pages that contains the full URL provided in this field.
Also referred to as whitelist/blacklist, these settings allow you to control which pages are included or excluded during scraping. For inclusion, specify parts of the URL that must or must not match. For example, to include pages containing "/ai-chatbot" in the URL, you can add "ebbot.com/sv/ai-chatbot" or "/ai-chatbot" to the whitelist. Use "and" to require that all conditions are met, or "or" to allow any condition to be sufficient.
This option allows you to remove elements from the HTML by matching them with a query selector. If certain content is consistently undesirable (cookie consent elements for example), specify it here to exclude it from all scraped data. Similar methods apply as with the initial query selector configuration.
Enabling this setting will stop all JavaScript on the page, which can prevent tracking scripts and cookie banners from loading. However, be cautious with sites that rely on JavaScript for rendering, as disabling it may prevent the page from displaying correctly.
This setting turns off the readability script, allowing the scraper to view the page as a regular visitor would. If the site does not support readability mode, or if readability alters the desired content, consider disabling this feature.
If enabled, the scraper checks the site's sitemap.xml for URLs before starting, which may reveal unlinked or hidden pages. Be mindful of scraping non-public pages and disable this feature if necessary.
When enabled, the scraper will try to automatically detect and remove repeated conten such as headers, footers, cookie banners, and other common elements. However, it’s still strongly recommended to manually use query selectors to explicitly exclude headers, footers, and popups for more reliable and accurate results.
Slowmode reduces the number of simultaneous connections to one, decreasing the likelihood of being perceived as a threat by security systems. This mode significantly increases scrape duration. If a lengthy scrape is necessary, consider breaking it into segments using targeted whitelists/blacklists to ensure completion within time limits.
These settings provide comprehensive control over the scraping process, allowing for tailored data collection that respects site constraints and maximizes efficiency.
Some websites may block the scraper, viewing it as hostile due to the speed at which it collects data. Make sure that the scraper is not blocked by whitelisting the scrapers IP-adress.
Current web scrape IP-adress: 35.246.173.254 New web scrape IP-adress: 141.94.175.178
In 2025, the current IP address will be replaced with the new one. To avoid future issues, ensure both IP addresses are whitelisted now, so you don’t need to update the whitelist later.
Using query selectors is one of the easiest ways to improve the quality of your web scraping results. Query selectors allow you to target only the parts of a webpage you actually want, like the main content, and skip over irrelevant sections like headers, footers, or cookie banners.
By adding just a few query selectors, you can reduce the number of scraped documents by 50–70%, making your data cleaner and easier to work with.
Since every website is structured differently, you'll need to figure out which selectors work best for your specific site. To do this, open your site in a browser, right-click on the content you want, and choose Inspect. This will open the Elements panel, where you can find and test different selectors.
When scraping websites, it's helpful to identify the main container element that holds the core content of the page, excluding headers, footers, and popups. Most pages have a wrapper element like this, and selecting it ensures you capture only the relevant content.
Once you've identified the element you want to include or exclude, check whether it's using a class
or an id
:
For an ID (id="main"
), use a dot (.
):
For a Class (class="main"
), use a hashtag (#
):
You can include multiple selectors by separating them with a comma and a space:
If an element has multiple class or ID names separated by spaces (e.g., id="elementor-container elementor-column"
), you need to replace the space with a dot (.
).
For example:
Should be written in Ebbot as:
Even though it's an
id
, Ebbot uses a dot (.
) prefix here because of its custom selector rules. Just remember: replace spaces with dots, and always start with a.
or#
based on the rule.
Understanding query selectors is beneficial when configuring this setting. The scraper waits for the specified HTML element to appear before proceeding. To define a selector, inspect the desired element on the web page, identify a unique ID or class, and input it as #yourId
for IDs or .yourClass
for classes. You can combine these for more specific targeting (e.g., #id1.class1.class2
). Test your selector using the browser's console to ensure it selects the intended element correctly.
For more information on how to use query selectors .
Use this feature to create rules that transform text on all documents from this specific source. for more information about the transformer.