My last post looked at web scraping using VBA, where I loaded a single page and extracted content.
In this post I'm looking at loading multiple pages from a site and getting the content I want from each page.
Let's say I want to get the latest Excel articles from the Office 365 blog
This is the first page of the site
There's a header image followed by the blog posts. Some have an associated image, but all have a title linking to the main post, and a summary.
Below the posts there is navigation to access the next page or go direct to a particular page.
How To Load Multiple Pages
By creating an instance of Internet Explorer from VBA, you can navigate to whatever URL you like.
Create IE Object
Navigate to a Page
To access multiple pages I need to find out what the URL structure is for each page.
The URL for the home page is https://www.microsoft.com/en-us/microsoft-365/blog/excel/
The URL for the 2nd page is https://www.microsoft.com/en-us/microsoft-365/blog/excel/page/2/ so to access a page you need to add /page/x/ to the end of the main URL, where x is the page number you want.
Knowing this URL structure it's now easy to write code to page through however many pages you want.
I've written the code so that IE is visible as it loads different pages. It's good to see it working so you know it's doing what it should.
If you want to hide IE, change True to False in this line.
What Are We Extracting From the Pages?
I want to get the post title, and the link to that post. To do this I need to be able to find them in the HTML code that makes up the page.
Right click on a post title and choose Inspect element - or whatever similar wording is your your browser's menu.
This will open your browser's inspection window and will highlight the element you want to inspect - the blog title.
I can see that the blog title is a H3 - a type of heading - with the CSS class "c-heading". This H3 heading contains the <a> link to the blog post itself.
Knowing this I can write my VBA to extract all H3 headings
and then look for those with the class "c-heading".
Then get the <a> link from inside that H3 heading.
Now I write this out to the sheet as a hyperlink, and loop through all the links doing the same thing.
When I get to the last one I load the next page, and repeat the process.
Your sheet should now contain a list of hyperlinks to the blog posts.
Websites that have paged content should have a consistent structure for the URL's on each page, so it's just a case of finding that and then you can write your VBA to load each page.