Hi all,
I'm an occasional Excel user but would say intermediate skill level.
I have Excel Home/Student 2010 on an old Windows laptop.
I'm working with a s/sheet of 810k rows and 20 columns.
Column G holds a category text-string (up to 100 chars long) for each row. There are 362 unique categories across the 810k rows.
I've added a column with a formula to calculate for each row how many times each of the 362 categories occurs.
Once calculated I can 'de-dupe' on category to reduce the 810k rows down to 362, and the grand-total of the subtotals across the 362 add-up to 810k.
The formula I've used is: =COUNTIF($G$2:$G$810001,G2)
My problem is, it takes about 8 hours for this formula to finish calculating across all 810k rows.
I've tried adding a further column with a numerical representation for each of the 362 categories (i.e. numbers 1 to 362) but that isn't any faster.
Am I doing something wrong, or is the COUNTIF function on large datasets always this slow?
Is there another function I could/should use?
N.B. The data is confidential, so I can't attach a copy here, sorry.
Would appreciate any thoughts, advice, comments - let me know if you require any further detail,
regards, Chris (in the UK)
Hi Chris,
Why don't you try using a PivotTable to get the count? Insert a PivotTable with column G in the row labels and the values area, with the values area set to count (it will default to count if the values in column G are text, but just in case they're numbers, you will have to change the aggregation to count). This will give you a de-duplicated report of how many times the 362 items in column G occur.
Mynda
Hi Mynda, thanks for replying.
Yes I created a pivot-table as you suggested, and noticed it creates the sub-total counts for each of the 362 categories almost instantly. Which begs the question: if Excel can sort, dedupe, and count the 810k rows that quickly, why does it take 8 hours with the COUNTIF function?
The issue with the pivot-table approach is that I want the counts in a helper-column in my table of 810k rows so that I can then use them in further calculations e.g. to calculate % of each of the 362 category counts based on other factors/columns.
Since my dataset is static, having calculated the 362 subtotals with COUNTIF, I then copy and paste the values in a column so that I can remove the formula from the s/sheet.
Do you know if the issue with COUNTIF being so slow is due to the very old version of Excel I'm using? i.e. it would be much faster with the current version of Excel?
Is there another function I could use instead of COUNTIF?
Chris.
You could write a macro, and if you create a copy of your file and use find and replace where you change the sensitive contents to something different, can you then attach a copy?
Hi Hans, thanks for looking at this.
Since my initial posting of this issue, I've discovered that the s/sheet I've been working on is not confidential. The friend who provided me a copy has confirmed he downloaded it from a UK Government Ministry of Justice website, so it's in the public domain.
So I've attached my modified version of the s/sheet here, having removed unnecessary columns, now total of 11 rather than the original 20.
Column J holds the COUNTIF calculation - as you will see, the formula is currently on the first row only, to be copied down to the rest of the rows, and then 'calculate' to be run.
Column K holds the values for this COUNTIF calculation for all the rows from my previous running (that took 8 hours), summing to 809,731 i.e. equal to the number of rows in the dataset.
I stated above there are 362 different unique values in column G, but it's actually 361.
Would be interested to know how long the column J calculation takes for you if you'd run it, thanks.
P.S. Just noticed there's a limit on size of attachment, I can't upload, my file is 44.3MB
Hi Chris,
Load your file to Google Drive or OneDrive and then post a link to it.
Phil
Performing 800k calculations, all but 362 of which are repetitions, is not very efficient. I'd suggest you either sort the data first to that you only need to do the count for each of the 362 individual items once, or use GETPIVOTDATA formulas to just get the total values from your pivot table.
Hi Philip - as you can probably tell, I'm somewhat new to all this. But I think I've managed to upload the large s/sheet to Google Drive, here's the link:
If you, or Hans (or anyone else) can copy the formula down in column J, then time how long it takes to complete all 810k calculations, that would be helpful.
Velouria, thanks for your comment. The GETPIVOTDATA function isn't available in Excel 2010.
With regard to the 362 (I was mistaken, it's 361) unique values, to be clear, each of the 810k rows holds one of these 361 values. Not sure if that makes much difference though,
regards, Chris.
My bad, turns out GETPIVOTDATA is available in my 2010 version of Excel.
I'll take a closer look at that, could be very useful, thanks Velouria.
Hi Chris,
Got the file, will see what I can tell you later today
I wrote a small vba module to do the process of opening the file, calculate the worksheet
and then save and close the file:
The results
Using a VBA module to track it all
Open file: 10:11:05
File Opened: 10:11:18
Calculate: 10:11:18
Completed: 10:31:50
Close save: 10:31:50
Done: 10:32:00
The Calcultae is what takes long
I placed this macro in my Personal macro workbook so that it stays visible
the macro:
Public Sub test_ChrisG()
Dim WB As Workbook
Const WSname As String = "data"
Debug.Print "Open file:", Time
Set WB = Workbooks.Open(" <full path to your workbook here> Chris G dataset.xlsx")
Debug.Print "File Opened:", Time
Debug.Print
Debug.Print "Calculate:", Time
WB.Worksheets(WSname).Calculate
Debug.Print "Completed:", Time
Debug.Print
Debug.Print "Close save:", Time
WB.Close True
Debug.Print "Done:", Time
Debug.Print
End Sub
You run the macro when the workbook is closed but Excel is active, select the macro from the Developper tab and run it
Thanks, Hans.
So you you copied my COUNTIF formula in column J down to all rows, then ran the calculation for all of the 810k rows which took just over 20 mins, that's a lot faster than my 8 hours.
What version of Excel do you have, and what's the summary of your computer spec?
Chris.
I also downloaded the file and created the pivot table as Mynda suggested. Rather then using COUNTIF on 810K rows, I used a VLOOKUP to pickup the count from the pivot table for each Offence. Entered the formula
=VLOOKUP(G2,Sheet1!$A$4:$B$364,2,FALSE)
in J2 and then copied it down. Timed it and it took less then 3 seconds to fill all 809731 rows in the data sheet and creating the pivot table itself was done in a matter of seconds as well. Now, I'm on a pretty powerful machine and can't really tell how much that matters.
Give this two-step approach a try and let us know how it goes.
For clarity, Sheet1!$A$4:$B$364 is where the pivot table sits. Offence in the Row field and Count of Years in the Value field.
Hi Riny. I wondered if it was possible to use VLOOKUP on a pivot table, after Velouria suggested the GETPIVOTDATA function - that sounds like an ideal solution, thanks.
I'll give that a go when I have time this evening, interesting to see how long it takes on my very old 'steam powered' setup!
P.S. Not sure about 'Count of Years' in the value field, my calculation doesn't take Year into account. I'm just counting the number of times each of the 361 different offences appears across all 810k rows.
Just drag any column into the Value field and make sure it's being counted and not summed. I took Years as it was the first column in the table.