Active Member
November 10, 2020
Hi folks,
I'm a Data Analyst who has stumbled his way into using Power Query as a solution for flattening our platform data into something more useful and tailor made in a relatively automated fashion. I've never bumped into any major issues until now. My workbook has become sluggish to the point of near unusability. Let me try to spell out some details:
I'm extracting some oddly structured data focused around project details along with the projects' financials (Budget, Forecast, Actuals) by month and category. The data is laid out something like this:
Project Number| Project Name | Detail 1 | Detail 2 | ... | Detail 19 | Report Created | Report Type | Month 1 Internal | Month 1 External | Month 1 Equipment | Month 2 Internal | Month 2 External | Month 2 Equipment | ... | Month 12 Internal | Month 12 External | Month 12 Equipment
Currently, this is 101 Columns x 2341 rows, or 236,441 cells.
Query "Prep" is used first to grab and manipulate the raw data. This query moves in the typical rapid pace. I use it to do a variety of transformations, including:
- Unpivot the 84 month fields (this report uses the cost categories, but not the month spread)
- Date Calcs based around the Report Created field
- Determine whether certain month data is to be included or excluded (i.e. Actuals for future months won't be included, Forecast for past months won't be included)
- Concatenate Report Type and Cost Category attributes into full labels such as "Budget: Internal" or "Actuals: Equipment"
- I end with a Pivot/Unpivot back to back to collapse separate month's data into single rows
This produces data at 17 Columns x 14218 rows, or 241,706 cells.
Next I have a fork where two separate queries ("Project Level" and "Program Level") are referencing "Prep" and performing mirrored calculations, but at differing levels of detail. These queries seem to be where the problems start. Making the smallest changes such as renaming a step causes the query window to seize and process for 2+ minutes. Here's a quick rundown of what these queries are doing:
- Pivot the labels created in Prep
- Calculate several "Total" columns for shared Report Types (i.e. Budget: Internal + Budget: External + Budget: Equipment = Budget Total)
- Create a constant label for the level of Detail ("Project" for Project Level, "Program" for "Program Level")
- Calculate a few differences and percentages such as "Percent of Budget remaining" or "Project Budget Variance"
- Calculate a label based on the size of percentage variance (i.e. if [Project Budget Variance %] > 0.1 then "Over Budget")
- If I had to speculate where my problem is coming from, this could be it. I was dealing with a null data issue, so this is how this calc looks:
- = Table.AddColumn(#"Changed Type2", "Custom", each if [Percent Of.1] = null then "Incalculable"
else if Number.Abs([Percent Of.1]) < .1 then "On Budget"
else if [Percent Of.1] > .1 then "Over Budget"
else if [Percent Of.1] < -.1 then "Under Budget"
else "Other")
- = Table.AddColumn(#"Changed Type2", "Custom", each if [Percent Of.1] = null then "Incalculable"
- If I had to speculate where my problem is coming from, this could be it. I was dealing with a null data issue, so this is how this calc looks:
Finally I take those two queries and Append them to one another to create an illusion that each set of Projects has a subtotal row underneath it at the Program level.
So, with all of that being said, does anyone have ideas of where I may be having issues? If posting the M code from one or more of my queries would help, I'd be happy to do that. Thanks in advance for any and all help! 🙂
Active Member
November 10, 2020
As Phil requested, I'm uploading a file demonstrating the data and queries. I took a subset of the data, and tried to mask any important details with silly labels. The data subset as pulled from our system is on sheet 1, the first prep results on sheet 2, the forked levels of detail on 3a and 3b, and the final blended result on sheet 4.
Naturally, now that there is substantially less data (13 projects of the original 440), the queries seem to be speeding through without a hitch.
In any case, I'm still having troubles with the full data set, so any advice people have for optimizing or improving my queries would be greatly appreciated!
Active Member
November 10, 2020
So I realize there hasn't been real discussion of this to speak of, but I seem to have resolved the issue myself and thought I should share. I began to realize I had a lot of attribute columns that were static throughout 95% of the transformative process, and guessed that PowerQuery may be spending resources reading and reading these unnecessarily. So I stripped my source data sheet down to the bare essentials.
I then used a SQL query to pull the Project Detail data I had removed and joined it into my Project Level and Program Level queries as late as possible.
This shift has reduced my cumulative query time by roughly 90% (~10 minutes down to ~1 minute).
July 16, 2010
Glad you found some efficiency gains, Jason and thanks for reporting back. However, something concerns me with what you describe. You say:
"I stripped my source data sheet down to the bare essentials. I then used a SQL query to pull the Project Detail data I had removed and joined it into my Project Level and Program Level queries as late as possible."
Where is the data sheet? The term data sheet implies it's in an Excel, CSV or Text file, but you say you're getting it with a SQL query, so I'm puzzled.
If your data is in a SQL database then you should use Power Query to get the data and at the first step, remove the columns you don't need. This will have the same effect as removing them from the source data.
Also, you should avoid writing your own SQL query and pasting that in the Advanced dialog box in Power Query's Get SQL Data window. The reason for this is that when you write your own SQL query, Power Query can no longer fold the processing back to the SQL database. Instead, Power Query must do all the work, and often it's way less efficient than a powerful SQL database.
I hope that all makes sense and helps you further optimise your query.
Mynda
1 Guest(s)