Scraping with paging
Today we are going to talk about scrapping for the billionth time. Yes I know I lack originality but some reason I feel an urge to srape just about anything.
Anyways, back to our subject. How does one go about scraping multiples pages in a row without having to go page by page? very simple.
The use case ..
For the sake of the example lets say we want to scrape all the products of an store with 40 pages.
Here are the multiple steps needed to do so :
- Retrieve all the urls you need to scrape.
- Scrap all those urls
Retrieving the urls we need to scrap
Most websites are build the same way so let’s say you want to get the urls of all the products. Here’s how to do it.
let’s say we want to scrape this website , notice here that when trying to get the second page we see another additional parameter at the end of the url which is ?p=2. So how do you know how many pages are there ? could be a thousand and you will not click a thousand times on the pagination widget to find out many pages there are.
The way most APIs are built is that if you pass it a huge number into the pagination that exceeds the actual number of pages it will return the last page index.
So following this if you type at the last of our example url p=100000, it will return p=’40' which is the last page.
Now that you know how many total pages there are simply build your array of urls so :
There you have all the urls you need.
Let’s Scrap
Now that we got all the urls we want we need to scrap them one by one and write the result into a single huge JSON file.
For this example we’ll retrieve the brand for each parfum, their name, description, url of the product, the price and the image.
Now we need to use the promise returned by this method to apply it to all the urls inside our array of urls :
And that’s it.
So here is the code all in one :
So why would you do so ? well for instance if you need to build yourself some API and that that you need data but no API exist to support it => in that case it would be very useful.
Hope it helps. If you have any questions comment down below.
If you have any additional question or need assistance don’t hesitate to shoot me an email at johnmeguira@gmail.com.