Counting the number of duplicates

Question

There is a list of object parameters, for example:

N Событие Исход 123213 Реал Мадрид-Барселона 1 123214 Фиорентина-Аталанта X 123213 Реал Мадрид-Барселона 2 123213 Реал Мадрид-Барселона 1 123215 Венеция-Перуджа X 123213 Реал Мадрид-Барселона 1

It is necessary to count the number of duplicates. Expected to get something like this:

 N Событие 1 X 2 123213 Реал Мадрид-Барселона 3 0 1 123214 Фиорентина-Аталанта 0 1 0 123215 Венеция-Перуджа 0 1 0

You can of course create a list of lists (a list of tuples, etc.), and when added to it, check if there is a completely identical set, if not, then add. And then count the variety sets.

Maybe there is a more elegant way? Maybe somehow using the dataframe in pandas

Give the question the reproducible example of data in the form of a list of lists or a list of tuples

Accepted Answer · 2019-03-04T23:37:08

I tried using Pandas to get the desired result.

The input data is placed in CSV and read from there:

 N,Событие,Исход 123213,Реал Мадрид-Барселона,1 123214,Фиорентина-Аталанта,X 123213,Реал Мадрид-Барселона,2 123213,Реал Мадрид-Барселона,1 123215,Венеция-Перуджа,X 123213,Реал Мадрид-Барселона,1

Code:

 import pandas as pd df = pd.read_csv('events.csv') res = (df.groupby(['N', 'Событие'])['Исход'] .value_counts() .unstack() .reset_index() .fillna(0) .astype({'1': int, 'X': int, '2': int}) .reindex(columns=['N', 'Событие', '1', 'X', '2'])) res

Result:

 Исход N Событие 1 X 2 0 123213 Реал Мадрид-Барселона 3 0 1 1 123214 Фиорентина-Аталанта 0 1 0 2 123215 Венеция-Перуджа 0 1 0

To convert to interest:

 res.loc[:, '1':] = (res.loc[:, '1':] .div(res.loc[:, '1':].sum(axis=1), axis=0) .mul(100).astype(int)) res

Result:

 Исход N Событие 1 X 2 0 123213 Реал Мадрид-Барселона 75 0 25 1 123214 Фиорентина-Аталанта 0 100 0 2 123215 Венеция-Перуджа 0 100 0

When saving, for example in CSV, the index (column "Exodus") can be omitted:

 res.to_csv('output.csv', index=False)

But is it possible to do something, so that in each line, it is not the number of duplicates that is calculated, but their percentage among each event?
yes, but I don’t like that percentages are whole, is it necessary to do this for a change?
@danilshik how do you want it to look like - 75.0, 25.0, etc.?
yes, only not of the float type, but of the double type, otherwise there are many values there, and accuracy is needed higher than float

Answer 2 · 2019-03-04T23:05:02

pandas.crosstab :

 import pandas as pd d = '''123213 Реал Мадрид-Барселона 1 123214 Фиорентина-Аталанта X 123213 Реал Мадрид-Барселона 2 123213 Реал Мадрид-Барселона 1 123215 Венеция-Перуджа X 123213 Реал Мадрид-Барселона 1''' lol = [] for l in d.splitlines(): t = l.rstrip().split() lol.append([t[0], ' '.join(t[1:-1]), t[-1]]) print(f'Список списков: {lol}\n') df = pd.DataFrame(lol, columns=['N', 'Событие', 'Исход']) print(f'Вход:\n{df}\n') df = pd.crosstab([df['N'], df['Событие']], df['Исход']).rename_axis(None, axis=1).reset_index() print(f'Посчитали исходы и перевернули:\n{df}\n') print(f'Колонки: {df.columns.tolist()}\n') print(f'Список списков: {df.values.tolist()}')

Result:

 Список списков: [['123213', 'Реал Мадрид-Барселона', '1'], ['123214', 'Фиорентина-Аталанта','X'], ['123213', 'Реал Мадрид-Барселона', '2'], ['123213', 'Реал Мадрид-Барселона', '1'], ['123215', 'Венеция-Перуджа', 'X'], ['123213', 'Реал Мадрид-Барселона', '1']] Вход: N Событие Исход 0 123213 Реал Мадрид-Барселона 1 1 123214 Фиорентина-Аталанта X 2 123213 Реал Мадрид-Барселона 2 3 123213 Реал Мадрид-Барселона 1 4 123215 Венеция-Перуджа X 5 123213 Реал Мадрид-Барселона 1 Посчитали исходы и перевернули: N Событие 1 2 X 0 123213 Реал Мадрид-Барселона 3 1 0 1 123214 Фиорентина-Аталанта 0 0 1 2 123215 Венеция-Перуджа 0 0 1 Колонки: ['N', 'Событие', '1', '2', 'X'] Список списков: [['123213', 'Реал Мадрид-Барселона', 3, 1, 0], ['123214', 'Фиорентина-Аталанта', 0, 0, 1], ['123215', 'Венеция-Перуджа', 0, 0, 1]]

@AndreyOdegov is a good example, but this data will be either in the list of lists or in the list of tuples.

ChocolateSwan ChocolateSwan 669 7 · Answer 3 · 2019-03-04T19:31:30

If each such line (123213 Real Madrid-Barcelona 1) is an object in the list, then set () can be made from this list and then the count list method can count number of each line in the original map(lambda x: (x, data_list.count(x)), data_set) list map(lambda x: (x, data_list.count(x)), data_set)

Only if this list is not very large)

this line will be an internal list, tuple or other element in Python.
@danilshik I mean that there is a lot or not ... if there is an opportunity every time not to count all this, then it is better to introduce some variables with counters or something like that.
And then shovel every time this list for a long time, but the method that should work in response.
Everything will be brought to set quickly, it remains only to count the occurrences either in a cycle (in one pass) or as it is more convenient.

Counting the number of duplicates

3 answers 3

More articles: