Previously we looked at scraping data from the web by example, where the website paginated the data. That is, it spread the results over multiple pages. This week I’m going to show you how you can automatically scrape data from multiple web pages with Power Query using a custom function.
This approach works with URLs that have a predictable structure. Looking at the URL from the WorldCat website used in the previous post we can see in the image below there is a reference for the ‘start’ number highlighted in yellow:
This start number is referring to the first record of 10 returned on the page. As we move through the pages of results the start number in the URL increments by 10.
Note: Not all URLs will follow this pattern. Some might use a sequential numbering system that refers to the page number, others might use text strings that relate to different regions etc. Either way, the process is virtually the same.
Download Power BI Desktop File
Enter your email address below to download the sample workbook.
Watch the Video
Steps for Automating Scraping of Data from Multiple Web Pages
There are 4 steps required to scrape data from multiple web pages with Power Query. We did the first step in the previous post where we created a query that returns the first page of results.
In this post we’re going to look at the next 3 steps:
- Convert the query to a function that can be executed for each page of results
- Generate a list of page start numbers
- Execute the function for each page start number
Step 1: Convert Query to a Function
Starting off where we left the last example, we need to go to the Query Editor Home tab and open the Advanced Editor. Here we see the M code Power Query wrote for us.
We’re only interested in the first row that contains the URL for the web page. In this URL is the start number of the first record being returned on the page. You can see it magnified in the image below:
Each page returns 10 results, therefore the URL for the second page will have a start number of 11, and page 3 will have a start number of 21 and so on. We need to replace this hard-keyed value with a variable and convert the query to a function.
To convert the query to a function we add a line of code at the top for the variable name:
(PageStart as text) =>
I’ve called the variable ‘PageStart’. Then we replace the 1 in the URL with the variable name. Notice you need to append it with a double quote and ampersand on either side:
Important: Power Query is case sensitive therefore you must enter your variable name with the case that matches the variable name. e.g. pagestart would return an error.
When you click ‘Close’ on the Advanced Editor you’ll see that the query is now converted to a function. I like to give my functions a name prefixed with ‘fn’ to differentiate them from other queries, as you can see in the Properties below:
Step 2: Generate Page Start Numbers
For this website we need a series of numbers for the variable which we’ll generate using Power Query, but other websites might use text strings. If so, you can import a table that contains your list of text string variables.
Ok, we’ll create a new query to automatically generate the list of page start numbers. Right-click in the queries pane > New Query > Blank Query:
In the formula bar we’ll generate a list of numbers 1 through to 21610 with the following formula:
This will return a list that we can convert to a table: List Tools Transform tab > To Table:
Add a Modulo column to detect every 10th number in the list (remember we only need the page start number and there are 10 records per page):
Filter the Modulo column for rows containing 1:
This will leave us with a list of PageStart numbers in Column1 that we need for our variable:
The Modulo column has done its job, so you can click on the column header and press the Delete key to get rid of it.
Now, double click the header for ‘Column1’ and give it a proper name. I called mine ‘PageStartNumbers’.
We now have 2,161 page start numbers! Loading 2,161 web pages to scrape their data will take a very long time, so for the purpose of this example I’m going to filter them to the first 20 with a Filter for numbers less than 200:
Next, I need to change the data type for the PageStartNumbers column to text, as it’ll be inserted into the URL, which is a text string. Click on the ABC123 icon in the left of the column header > Text:
Now we’re ready to invoke the custom function we created in step 1.
Step 3: Invoke Custom FunctionAdd Column tab > Invoke Custom Function:
At the Invoke Custom Function dialog box select the function from the Function query list and the PageStartNumbers column for the PageStart variable:
Now we can expand the columns: click on the double headed arrow in the fnEysenckBooks column > Expand. Be sure to deselect ‘Use original column name as prefix’:
You should see a list of books with title, author, type and publisher information.
You can delete the ‘PageStartNumbers’ column as this is no longer needed.
One of the great things about Power Query is that it maintains a connection to the web page which enables you to refresh the connection and get updates as needed.