Skip to main content

Nodes: Scrapper

Y
Written by Yohan
Updated this week

The scrapper data source allows to retrieve data from the client page using the link attribute from the input and

all data retrieved will be added and transformed to the mappings.

Scrapped data will added to the input data or will replace current attributes if using the same fields.

Create a Scrapper Data source

  • On the Dataflow section click on Add node

  • Choose External API then Web scrapper

Link one input node to provide data to the scrapper

Configure a Scrapper API Data source

From your Scrapper node click on Config to access to the settings

Selectors

Use CSS selectors to fetch elements from the product page. Adding a selector will create a field for your scrapper dataset. Use the right config depending on your field.

Output is the value returned by the scrapper after fetching through the selector:

  • html: returns the inner HTML

  • outerHtml: returns the outer HTML

  • attribute: returns the attribute value set on the input "Output attribute"

  • text: returns the inner text of all children elements

  • httpStatusCode: this a special output that returns the http status code of the scrapped page response

Scheduling settings

Scrapper job start

Your scrapper datasource can be scheduled, using the cron scheduler

The scrapper will also be triggered when the input node completes. In order to disable this behavior, you must disable the auto execute option on the source link

Scrapper job end

The scrapper execution will run until all products in the input node are processed. The next nodes will then be executed except if auto execute option is disabled.

Products limit

If you want to execute the scrapper in batches you can use the products limite parameter.

When the products limit is reached, the job will be snoozed until the next snooze schedule.

The same job will then resume until the limit is reached again or all products are processed.

Delay settings

By default a new request is sent as soon as the previous one completes

To reduce the request rate on the customer website, you can configure a delay to wait for some time between every single request.

When used with parallelism > 1 the same delay will be used across all parallel threads.

Parallelism settings

By default the scrapped uses a single thread to send requests sequentially

In order to increase the scrapper speed when the request take too long to complete, you can increase the parallelism to send multiple requests at the same time.

When used with delay > 0 all theads will be executed with the same delays settings

Advanced settings

Set up your settings to affect the crawling of products by the bot Dataiads.


Browser mode

Special crawling fetching the product page after rendering. This mode can be used only there is no other way to scrap the page. Please note that the browser.

User Agent

Always keep DataiadsBot by defaut

HTTP headers

Headers to add to populate your request

To receive your data as JSON output, don't forget to add:​Accept: application/json

Query params

To add parameters to your endpoint URL

Testing your products

Once your setup finished you can test your products to check the scrapped data without saving the configuration.

Warning - input node must be connected in order to have input products for the test.



Optionally you can provide a product ID to test a specific product from your input feed. When using a product which is not from the feed. The response will be null.

Finally you can save your settings to commit your configuration

Request rate formula

For example, with a 100ms response time with parallelism 2 and delay 500:

req rate = 2 x (0.5 + 0.1) = 1,2 req / sec

Related articles

Did this answer your question?