Why do you get numbers in a different format when sorting data?

Question

There is data that needs to be prepared for combining with another block of data. For this I would like to chronologically organize them. Sample source data:

df1 pid syear pgsbil pgfamstd \ 0 101 1984 [3] Fachhochschulreife [1] verheiratet zus. 1 101 1985 [3] Fachhochschulreife [1] verheiratet zus. 2 101 1986 [3] Fachhochschulreife [1] verheiratet zus. ... ... ... ... 6 102 1984 [1] Hauptschulabschluss [1] verheiratet zus. 7 102 1985 [1] Hauptschulabschluss [1] verheiratet zus. ... ... ... ... 484168 31433802 2012 [2] Realschulabschluss [1] verheiratet zus. 484169 31433901 2012 [4] Abitur [2] verheiratet getr.

I tried to sort using code:

 DF1 = df1.sort_values(by='syear', ascending=1)

But instead of a year, I get, in my opinion, it is in a different encoding (like everything else!):

 Df1 Out[53]: pid syear pgsbil pgfamstd \ 248899 320797655 -32656 81 -95 248825 891723238 -32419 43 43 250014 345587954 -32377 NaN -119 ... ... ... ... 250163 957561202 31108 -91 27 250166 449665857 31554 -1 -1

Why do you get numbers in a different format when sorting data? How do i fix this?

commands: df1.syear.min() , df.syear.max() and df1.dtypes
AttributeError: 'Series' object has no attribute 'agg' error for the df1.syear.agg(['min','max']) command AttributeError: 'Series' object has no attribute 'agg' And for the second command: pid int32 / syear int16 / pgsbil category / pgfamstd category / pglabgro int32 / pgemplst category / dtype: object
Yes, I already noticed an error and therefore corrected the code in the comment ...
Strange, on the first command df1.syear.min() I get -32656 , and on the second normal data df.syear.max() result is 2012 Although I give the command print(max(df1['syear'])) , I get 31554.

Accepted Answer · 2016-08-11T18:09:40

it looks like you have either a syear column syear really negative numbers or, most likely, large positive numbers (more than 32767), which turn into negative ones with the np.int16 data type ...

Demo:

Information about max () and min () for type np.int16 :

 In [67]: np.iinfo(np.int16) Out[67]: iinfo(min=-32768, max=32767, dtype=int16)

how from a large positive integer (32880) you get a negative (-32656) when using the np.int16 type:

 In [72]: df = pd.DataFrame({'a':[32880]}, dtype=np.int16) In [73]: df Out[73]: a 0 -32656

wrong ("bad") years:

 In [88]: df1.query('syear <= 1980 or syear > 2016').syear Out[88]: 248737 -9076 248738 -26593 248739 1725 248740 -25171 248741 7963 248742 27137 248743 19854 248744 26738 248745 6716 248746 9885 248747 19361 248748 -19726 248749 -24605 248750 24074 248751 -8070 248752 -16027 248753 -23424 248754 3848 248755 1471 248756 30634 248757 -8162 248758 -18937 248759 16733 248760 -21923 248761 16817 248762 3834 248763 -13556 248764 -16229 248765 24272 248766 25642 ... 252510 -1 252511 -1 252512 -1 252513 -1 252514 -1 252515 -1 252516 -1 252517 -1 Name: syear, dtype: int16

An interesting observation is that all the "bad" data goes in a continuous block (with indices: 248737 - 252517)

While scrolling through the data, I did not see any such value as 32767 or something like that, or negative values.
can you upload your data somewhere in CSV / JSON / HDF5 format?
how to show bad data: df1.query('syear < 0 or syear > 2016')
I did not quite clearly put it - I meant that you have 3781 lines with "bad" years.
For example in the lines: df1.ix[248737 : 248747, 'syear']

user21 user21 155 1 golden mark 1 silver mark 15 bronze marks · Answer 2 · 2016-08-11T18:15:53

I think this is actually such data, since I sort it in ascending order, the strangest such values come first. I tried the command:

 syear_counts = df1['syear'].value_counts() syear_counts 2000 24174 2002 23541 2006 22399 2003 22285 2001 21985 2004 21703 2011 21154 -1 3274 .... -17733 1 29884 1 24765 1 -11361 1

Most likely, I need to somehow throw away this piece of data that is not representative. All strings that correspond to such values.

The main question is how was the syear column syear in df1 ?
“throwing away” the wrong data is simple ( df = df[df.syear > 1950] ) - I think it’s worth
I downloaded the df1 = pd.read_stata('gen_data.dta') data.
I downloaded the data again: import pandas as pd -> import os -> os.chdir(r'C:\Users\...\SOEPlongv29_stata') -> df1 = pd.read_stata('gen_data.dta') ->. .. -> syear_counts[:10] And got the same results as described above.
I understood how this function works, it remains to do the same with the upper limit in 2012 in order to throw out very large data, instead of a year.
it's just not difficult: df1_clean = df1.query('1950 < syear <= 2012')

Why do you get numbers in a different format when sorting data?

2 answers 2

More articles: