There is a dataframe of the following form:
id Name Sex 0 Jack male 1 Andrew male 2 Andrew female 3 Jack male 4 Yuriy male 5 Johanna female Need to get the most frequently used female / male name. How can this be implemented?
There is a dataframe of the following form:
id Name Sex 0 Jack male 1 Andrew male 2 Andrew female 3 Jack male 4 Yuriy male 5 Johanna female Need to get the most frequently used female / male name. How can this be implemented?
Series.value_counts() returns the number of occurrences for each value in the form of a series sorted in descending order of occurrences of the series, therefore the .idxmax() call is an unnecessary waste of resources.
Example:
In [50]: df.groupby('Sex')['Name'].agg(lambda g: g.value_counts().index[0]).reset_index(name='Most_popular_name') Out[50]: Sex Most_popular_name 0 female Johanna 1 male Jack In [51]: df.groupby('Sex')['Name'].agg(lambda g: g.value_counts().index[0]).to_dict() Out[51]: {'female': 'Johanna', 'male': 'Jack'} To print the most common female and male names:
for sex in ['male', 'female']: print(df.loc[df.Sex==sex, 'Name'].value_counts(sort=False).idxmax()) Result:
Jack Andrew Or as one expression:
>>> df.groupby('Sex').agg(lambda g: g.value_counts(sort=False).idxmax())) Name Sex female Andrew male Jack Or explicitly choosing names:
>>> top_names = df.groupby('Sex')['Name'].agg(lambda g: g.value_counts(sort=False).idxmax()) >>> top_names.to_dict() {'female': 'Andrew', 'male': 'Jack'} index[0] . This is less readable, so I use explicit idxmax () after value_counts () (this does not degrade the O-large solution) ¶ If there is a need to improve the constants during the time complexity, then it will be possible to fasten. If the profiler shows that it is a bottleneck in the program. - jfsSource: https://ru.stackoverflow.com/questions/817773/
All Articles