Forum

Extracting URL from...
 
Notifications
Clear all

Extracting URL from the Webpage from multiple HTML tables

3 Posts
3 Users
0 Reactions
419 Views
(@rivthebest)
Posts: 1
New Member
Topic starter
 

Hi,

Here is the scenario in which I got stuck and seeking your valuable advice for the same.

Here is the GitHub link containing category wise public API links:

https://github.com/public-apis/public-apis#test-data

It contains around 51 different categories listed as index at the beginning. As you scroll down the page you would find that each of the categories are presented in the HTML table format.

My objective is to fetch each of the API (URL) under each of the topics along with the other table information and collated in the one single table.

To accomplish the task I have chosen Power Query utility and tried in Office365(Excel) and Power BI Desktop.

The challenge I had faced while executing the task: 

Step1: Using the above GitHub link I had reached until this:

[Image Can Not Be Found]

 

Now if I expand the table then other than the link of each of the API for each category I am unable to capture. Then I tried to utilize the following code snippet posted in Chris Webb's blog as the intermediate step to insert the URL fetching code for each of the API for each of the categories.
Chris Webb's Blog link: Chris Webb's BI Blog: Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M C...

And the portion of the code after making relevant modification is:

"Added Custom" = Table.AddColumn(Html.Table(Source, {{"Links", "a[href^=""http""]", each [Attributes][href]}}))

and later on this by following this blog:

Power Query – how to simply get hyperlinks from webpages – Trainings, consultancy, tutorials

#"Added Custom" = Table.AddColumn(Source, {{"API_URL", ":nth-last-child(155) > TBODY > TR > :nth-child(1) > A[rel=""nofollow""]:nth-child(1):nth-last-child(1)", each [Attributes][href]?}}, [RowSelector="TABLE:nth-child(20) > TBODY > TR"])

In either of the cases I could not be able to achieve the desired goal.

 
I had also opted to Add Table using Example option in the Power BI's Power Query Navigator Interface. Here is the screenshot.
 
[Image Can Not Be Found]
 
 
But the problem is it only captures one category that is one HTML table not all the tables.
In this case I was not able to add the URL column with the rest of the datasets to accomplish the task.

Please let me know if any other information you could need to recreate the steps.

 
I am certain that I am making some terrible mistake or overlooking something.
 
Please help me out.
 
Regards
Ritabrata Bhattacharya
 
Posted : 02/09/2022 2:10 pm
(@catalinb)
Posts: 1937
Member Admin
 

Hi Ritabrata,

Impossible to debug what you did without your test file, can you upload a sample file that reproduces the error mentioned?

 
Posted : 06/09/2022 5:02 am
Philip Treacy
(@philipt)
Posts: 1632
Member Admin
 

Hi Ritabrata,

Html.Table only works in PBI and it requires a CSS selector to get the data from the web page.  The tables on that GitHub page have no CSS classes to identify them so Html.Table can't be used.

The code in the file below grabs the entire page and then using things like Text.BetweenDelimiters, extracts the following data:

api table

Regards

Phil

 
Posted : 06/09/2022 10:01 pm
Share: