Just recently I had a query with about 15,000 lines in it. One column recorded 'response' using free text, and I wanted to standardise the column responses so I could count positive and negative responses. I created a 'clean' response column by loading my query to a table, copy-paste values the Response column into a new worksheet, removed duplicates, classified the free text responses, loaded this new table back as a new query, and merged the original response column to the new classified column using the de-duplicated column of free text responses.
First question: is this the best way to do this or is there a smarter way to do it within PQ? I had 155 unique free text responses (spelling variants etc) so doing a 'replace values' wasn't a viable option.
Second question: when I merged, I found a whole lot of cells that didn't match. I eventually discovered that Excel and PQ were seeing 'unique' in different ways. When I removed duplicates in Excel, it considered 'Support with caveat' and 'support with caveat' to be identical, so only retained the first; but PQ apparently sees these as two different things and said there was no matching value for 'support with caveat'. It wasn't until the next day when I had time to really look into it that I figured out what had happened. Is there a way to do handle this kind of situation better so could I avoid this problem next time?
As an aside, having done your PQ course made me hero of the hour for this work - we were under real time pressure, and I was faced with 115 workbooks with 5 tabs a piece. If I hadn't done your course, we would have been manually aggregating data from all those worksheets. The course paid for itself right there.
Many thanks,
Catherine
Hi Catherine,
Firstly, congratulations on your success with Power Query. I'm so pleased I could help in some way, but you are the one who did the hard work.
Power Query is case sensitive, Excel isn't. This is the reason for the de-dup differences. You can use PQ to de-dupe next time and it will avoid this issue. Just make a copy/reference the query > select the column you want to remove duplicates from > Home tab > Remove Rows > Remove Duplicates.
The way you describe to clean the responses sounds fine and probably what I would have done. Depending on how messy the data is you could have tried using 'Column from examples' but it won't work when your data set requires a lot of examples.
Another option is to add a column with an if statement that tests if a word/phrase is present in the response. You'd use the function; Text.Contains to see if a keyword is present in the text. If it is you assign it 'positive', else 'negative'.
e.g
if Text.Contains("good", [response column name]) then "Positive" else "Negative"
Mynda
Thank you, that's really helpful to have all those options at my disposal. I'm going to use the Text.Contains option right away on another query!
Thanks for the prompt responses,
Catherine