

October 21, 2023

Hi all,
I'm an occasional Excel user but would say intermediate skill level.
I have Excel Home/Student 2010 on an old Windows laptop.
I'm working with a s/sheet of 810k rows and 20 columns.
Column G holds a category text-string (up to 100 chars long) for each row. There are 362 unique categories across the 810k rows.
I've added a column with a formula to calculate for each row how many times each of the 362 categories occurs.
Once calculated I can 'de-dupe' on category to reduce the 810k rows down to 362, and the grand-total of the subtotals across the 362 add-up to 810k.
The formula I've used is: =COUNTIF($G$2:$G$810001,G2)
My problem is, it takes about 8 hours for this formula to finish calculating across all 810k rows.
I've tried adding a further column with a numerical representation for each of the 362 categories (i.e. numbers 1 to 362) but that isn't any faster.
Am I doing something wrong, or is the COUNTIF function on large datasets always this slow?
Is there another function I could/should use?
N.B. The data is confidential, so I can't attach a copy here, sorry.
Would appreciate any thoughts, advice, comments - let me know if you require any further detail,
regards, Chris (in the UK)


July 16, 2010

Hi Chris,
Why don't you try using a PivotTable to get the count? Insert a PivotTable with column G in the row labels and the values area, with the values area set to count (it will default to count if the values in column G are text, but just in case they're numbers, you will have to change the aggregation to count). This will give you a de-duplicated report of how many times the 362 items in column G occur.
Mynda


October 21, 2023

Hi Mynda, thanks for replying.
Yes I created a pivot-table as you suggested, and noticed it creates the sub-total counts for each of the 362 categories almost instantly. Which begs the question: if Excel can sort, dedupe, and count the 810k rows that quickly, why does it take 8 hours with the COUNTIF function?
The issue with the pivot-table approach is that I want the counts in a helper-column in my table of 810k rows so that I can then use them in further calculations e.g. to calculate % of each of the 362 category counts based on other factors/columns.
Since my dataset is static, having calculated the 362 subtotals with COUNTIF, I then copy and paste the values in a column so that I can remove the formula from the s/sheet.
Do you know if the issue with COUNTIF being so slow is due to the very old version of Excel I'm using? i.e. it would be much faster with the current version of Excel?
Is there another function I could use instead of COUNTIF?
Chris.


Trusted Members

October 17, 2018



October 21, 2023

Hi Hans, thanks for looking at this.
Since my initial posting of this issue, I've discovered that the s/sheet I've been working on is not confidential. The friend who provided me a copy has confirmed he downloaded it from a UK Government Ministry of Justice website, so it's in the public domain.
So I've attached my modified version of the s/sheet here, having removed unnecessary columns, now total of 11 rather than the original 20.
Column J holds the COUNTIF calculation - as you will see, the formula is currently on the first row only, to be copied down to the rest of the rows, and then 'calculate' to be run.
Column K holds the values for this COUNTIF calculation for all the rows from my previous running (that took 8 hours), summing to 809,731 i.e. equal to the number of rows in the dataset.
I stated above there are 362 different unique values in column G, but it's actually 361.
Would be interested to know how long the column J calculation takes for you if you'd run it, thanks.
P.S. Just noticed there's a limit on size of attachment, I can't upload, my file is 44.3MB


Trusted Members
Moderators

November 1, 2018

Performing 800k calculations, all but 362 of which are repetitions, is not very efficient. I'd suggest you either sort the data first to that you only need to do the count for each of the 362 individual items once, or use GETPIVOTDATA formulas to just get the total values from your pivot table.


October 21, 2023

Hi Philip - as you can probably tell, I'm somewhat new to all this. But I think I've managed to upload the large s/sheet to Google Drive, here's the link:
https://docs.google.com/spread.....38;sd=true
If you, or Hans (or anyone else) can copy the formula down in column J, then time how long it takes to complete all 810k calculations, that would be helpful.
Velouria, thanks for your comment. The GETPIVOTDATA function isn't available in Excel 2010.
With regard to the 362 (I was mistaken, it's 361) unique values, to be clear, each of the 810k rows holds one of these 361 values. Not sure if that makes much difference though,
regards, Chris.


Trusted Members

October 17, 2018



Trusted Members

October 17, 2018

I wrote a small vba module to do the process of opening the file, calculate the worksheet
and then save and close the file:
The results
Using a VBA module to track it all
Open file: 10:11:05
File Opened: 10:11:18
Calculate: 10:11:18
Completed: 10:31:50
Close save: 10:31:50
Done: 10:32:00
The Calcultae is what takes long
I placed this macro in my Personal macro workbook so that it stays visible
the macro:
Public Sub test_ChrisG()
Dim WB As Workbook
Const WSname As String = "data"
Debug.Print "Open file:", Time
Set WB = Workbooks.Open(" <full path to your workbook here> \Chris G dataset.xlsx")
Debug.Print "File Opened:", Time
Debug.Print
Debug.Print "Calculate:", Time
WB.Worksheets(WSname).Calculate
Debug.Print "Completed:", Time
Debug.Print
Debug.Print "Close save:", Time
WB.Close True
Debug.Print "Done:", Time
Debug.Print
End Sub
You run the macro when the workbook is closed but Excel is active, select the macro from the Developper tab and run it


October 21, 2023

Thanks, Hans.
So you you copied my COUNTIF formula in column J down to all rows, then ran the calculation for all of the 810k rows which took just over 20 mins, that's a lot faster than my 8 hours.
What version of Excel do you have, and what's the summary of your computer spec?
Chris.


Trusted Members
Moderators

January 31, 2022

I also downloaded the file and created the pivot table as Mynda suggested. Rather then using COUNTIF on 810K rows, I used a VLOOKUP to pickup the count from the pivot table for each Offence. Entered the formula
=VLOOKUP(G2,Sheet1!$A$4:$B$364,2,FALSE)
in J2 and then copied it down. Timed it and it took less then 3 seconds to fill all 809731 rows in the data sheet and creating the pivot table itself was done in a matter of seconds as well. Now, I'm on a pretty powerful machine and can't really tell how much that matters.
Give this two-step approach a try and let us know how it goes.
For clarity, Sheet1!$A$4:$B$364 is where the pivot table sits. Offence in the Row field and Count of Years in the Value field.


October 21, 2023

Hi Riny. I wondered if it was possible to use VLOOKUP on a pivot table, after Velouria suggested the GETPIVOTDATA function - that sounds like an ideal solution, thanks.
I'll give that a go when I have time this evening, interesting to see how long it takes on my very old 'steam powered' setup!
P.S. Not sure about 'Count of Years' in the value field, my calculation doesn't take Year into account. I'm just counting the number of times each of the 361 different offences appears across all 810k rows.


Trusted Members
Moderators

January 31, 2022



Trusted Members
Moderators

November 1, 2018

For a pure formula approach, if you sort the data on column G and then use:
=IF(G2=G1,J1,COUNTIF(G:G,G2))
copied down, that should be considerably faster. It took about 7-8 seconds on my machine (running in a VM on my MacBook). Using a GETPIVOTDATA formula produced similar timings (not taking advantage of sorting). Adjusting the GETPIVOTDATA to take advantage of sorted column G made the calculation pretty much instantaneous.


October 21, 2023

Okay, so I've implemented both methods.
Velouria's formula =IF(G2=G1,J1,COUNTIF(G:G,G2)) took about 20 seconds to calculate (it actually took longer to copy the formula down the whole column before calculating).
But Riny's formula using VLOOKUP on a pivot-table took less than 10 seconds - I've attached a screenshot of the outcome where I placed the pivot-table on the same worksheet as the source-data.
After running both methods, I deduped the data on OFFENCE, so that I could sum the total to confirm = 809,731 which of course it did in both cases.
Really impressed with your thinking guys, thanks again for lending your expertise to this.
I'll definitely be back on here the next time I'm struggling,
regards, Chris.


Trusted Members

October 17, 2018



October 21, 2023

I ran additional COUNTIFS calculations for each of the 361 offence types, to calculate the sub-totals for 'Plea Guilty' (column L in the Example 1 screenshot attached) and another calculation for 'Plea Not Guilty' (column M) - they took even longer than the previous calculations.
I've extended the pivot-table to include the sub-totals for 'Guilty' and 'Not Guilty' for each of the 361 offence-types as shown in Example 2 screenshot.
As indicated with the red-circles, the total number of rows for each offence-type is copied into the main table using a VLOOKUP on the pivot-table data as suggested by Riny above.
Now I want to use a similar VLOOKUP to populate columns K and L each indicated by colours green and blue.
How do I do this given the format of the pivot-table i.e. Guilty and Not Guilty sub-totals in the lines under each offence category?
Is there some way of formatting the pivot-table so that there's still only 1 row for each of the 361 offence-types, with 'Count of Offence', 'Count for Guilty', and 'Count for Not Guilty' in consecutive columns?
So that I can use the same formula to look-up each of the 3 numbers by simply changing the column-reference
e.g. =VLOOKUP(G127,$N$4:$O$364,2,FALSE) pulls the 'Count of Offence' from column 2
But =VLOOKUP(G127,$N$4:$O$364,3,FALSE) would pull the 'Count for Guilty' from column 3 etc.


1 Guest(s)
