New Member
February 11, 2022
I found your video "Scrape Data from Multiple Web Pages with Power Query" and found it very helpful. After following the steps the table I realized I was getting duplicates. I assumed I did it wrong, tried it several ways and realized that the website inserts the order randomly each time and going from page 1 to 2, 3, etc. was resetting the random order. If it was just duplicates that would be easy, but each duplicate row in the query takes the place of a record that wasn't chosen for that random generator.
URL: (PageStart variable in red)
Example 1: https://www.nachi.org/certifie.....e/us?page=1
Example 2: https://www.nachi.org/certifie.....browse/us/florida?page=1
The attached file has Example 1, which is all listings for the United States across 100+ pages. I started scraping Example 2 because searching for the whole country doesn't show which state they're in. I included Example 1 in the file because it's less complicated and has the same duplicate problem. I tried looking at the source code of the page for help figuring out what might be causing the problem. I can see that it might be using a cookie to change "no-js" or "has-js." I don't know much about coding but that might be the trigger that resets the random order. That's my best guess, any help would be greatly appreciated!
EDIT: After uploading I noticed I type the source URL in the function. I originally had "https://www.nachi.org/certified-inspectors/browse/us?page="&PageStart&"1" but should have removed the 1 on the end to be &PageStart&"". It had the same problem, it just went to Page 11, 21, 31, etc. instead of 1, 2, 3...
Attachments:
Capture.PNG- screenshot of website source code
Capture2.PNG- Duplicates on the table coming from different pages
Web Scraping test.pbix- sample file/data
October 5, 2010
Hi Mike,
That website isn't giving the results correctly. It shouldn't be duplicating results like that. Not much we can do to prevent this as all the code is doing is querying the website.
All you need to do is to Remove Duplicate from the Member URL column.
Regards
Phil
1 Guest(s)