The scrapper data source allows to retrieve data from the client page using the link attribute from the input and
all data retrieved will be added and transformed to the mappings.
Scrapped data will added to the input data or will replace current attributes if using the same fields.
Create a Scrapper Data source
Link one input node to provide data to the scrapper
Configure a Scrapper API Data source
From your Scrapper node click on Config to access to the settings
Selectors
Use CSS selectors to fetch elements from the product page. Adding a selector will create a field for your scrapper dataset. Use the right config depending on your field.
Output is the value returned by the scrapper after fetching through the selector:
html: returns the inner HTML
outerHtml: returns the outer HTML
attribute: returns the attribute value set on the input "Output attribute"
text: returns the inner text of all children elements
httpStatusCode: this a special output that returns the http status code of the scrapped page response
Scheduling settings
Scrapper job start
Your scrapper datasource can be scheduled, using the cron scheduler
The scrapper will also be triggered when the input node completes. In order to disable this behavior, you must disable the auto execute option on the source link
Scrapper job end
The scrapper execution will run until all products in the input node are processed. The next nodes will then be executed except if auto execute option is disabled.
Products limit
If you want to execute the scrapper in batches you can use the products limite parameter.
When the products limit is reached, the job will be snoozed until the next snooze schedule.
The same job will then resume until the limit is reached again or all products are processed.
Delay settings
By default a new request is sent as soon as the previous one completes
To reduce the request rate on the customer website, you can configure a delay to wait for some time between every single request.
When used with parallelism > 1 the same delay will be used across all parallel threads.
Parallelism settings
By default the scrapped uses a single thread to send requests sequentially
In order to increase the scrapper speed when the request take too long to complete, you can increase the parallelism to send multiple requests at the same time.
When used with delay > 0 all theads will be executed with the same delays settings
Advanced settings
Set up your settings to affect the crawling of products by the bot Dataiads.
Browser mode
Special crawling fetching the product page after rendering. This mode can be used only there is no other way to scrap the page. Please note that the browser.
User Agent
Always keep DataiadsBot by defaut
HTTP headers
Headers to add to populate your request
To receive your data as JSON output, don't forget to add:Accept: application/json
Query params
To add parameters to your endpoint URL
Testing your products
Once your setup finished you can test your products to check the scrapped data without saving the configuration.
Warning - input node must be connected in order to have input products for the test.
Optionally you can provide a product ID to test a specific product from your input feed. When using a product which is not from the feed. The response will be null.
Finally you can save your settings to commit your configuration
Request rate formula
For example, with a 100ms response time with parallelism 2 and delay 500:
req rate = 2 x (0.5 + 0.1) = 1,2 req / sec
Related articles
















