How to sort the data data by the number of identical records?

Question

There is a dataframe of the following form:

id Name Sex 0 Jack male 1 Andrew male 2 Andrew female 3 Jack male 4 Yuriy male 5 Johanna female

Need to get the most frequently used female / male name. How can this be implemented?

Related question Group by pandas dataframe and select most common string factor

MaxU MaxU 52.3k 6 18 51 · Answer 1 · 2018-04-22T18:55:18

Series.value_counts() returns the number of occurrences for each value in the form of a series sorted in descending order of occurrences of the series, therefore the .idxmax() call is an unnecessary waste of resources.

Example:

 In [50]: df.groupby('Sex')['Name'].agg(lambda g: g.value_counts().index[0]).reset_index(name='Most_popular_name') Out[50]: Sex Most_popular_name 0 female Johanna 1 male Jack In [51]: df.groupby('Sex')['Name'].agg(lambda g: g.value_counts().index[0]).to_dict() Out[51]: {'female': 'Johanna', 'male': 'Jack'}

sorting (O (n log n)) is not needed, max can be found in linear time.
your answer uses sorting (this is a trifle, but since you raised this topic yourself (about performance), then adequate solutions should be proposed — I don’t dare say which option is faster for a specific input)

Answer 2 · 2018-04-22T17:28:48

To print the most common female and male names:

 for sex in ['male', 'female']: print(df.loc[df.Sex==sex, 'Name'].value_counts(sort=False).idxmax())

Result:

 Jack Andrew

Or as one expression:

 >>> df.groupby('Sex').agg(lambda g: g.value_counts(sort=False).idxmax())) Name Sex female Andrew male Jack

Or explicitly choosing names:

 >>> top_names = df.groupby('Sex')['Name'].agg(lambda g: g.value_counts(sort=False).idxmax()) >>> top_names.to_dict() {'female': 'Andrew', 'male': 'Jack'}

@MaxU is correct, the response by the link I gave is using index[0] .
This is less readable, so I use explicit idxmax () after value_counts () (this does not degrade the O-large solution) ¶ If there is a need to improve the constants during the time complexity, then it will be possible to fasten.
If the profiler shows that it is a bottleneck in the program.

How to sort the data data by the number of identical records?

2 answers 2

More articles: