There is data that needs to be prepared for combining with another block of data. For this I would like to chronologically organize them. Sample source data:

df1 pid syear pgsbil pgfamstd \ 0 101 1984 [3] Fachhochschulreife [1] verheiratet zus. 1 101 1985 [3] Fachhochschulreife [1] verheiratet zus. 2 101 1986 [3] Fachhochschulreife [1] verheiratet zus. ... ... ... ... 6 102 1984 [1] Hauptschulabschluss [1] verheiratet zus. 7 102 1985 [1] Hauptschulabschluss [1] verheiratet zus. ... ... ... ... 484168 31433802 2012 [2] Realschulabschluss [1] verheiratet zus. 484169 31433901 2012 [4] Abitur [2] verheiratet getr. 

I tried to sort using code:

 DF1 = df1.sort_values(by='syear', ascending=1) 

But instead of a year, I get, in my opinion, it is in a different encoding (like everything else!):

 Df1 Out[53]: pid syear pgsbil pgfamstd \ 248899 320797655 -32656 81 -95 248825 891723238 -32419 43 43 250014 345587954 -32377 NaN -119 ... ... ... ... 250163 957561202 31108 -91 27 250166 449665857 31554 -1 -1 

Why do you get numbers in a different format when sorting data? How do i fix this?

  • Please indicate the output (output) next. commands: df1.syear.min() , df.syear.max() and df1.dtypes - MaxU
  • AttributeError: 'Series' object has no attribute 'agg' error for the df1.syear.agg(['min','max']) command AttributeError: 'Series' object has no attribute 'agg' And for the second command: pid int32 / syear int16 / pgsbil category / pgfamstd category / pglabgro int32 / pgemplst category / dtype: object - user21
  • Yes, I already noticed an error and therefore corrected the code in the comment ... - MaxU
  • Strange, on the first command df1.syear.min() I get -32656 , and on the second normal data df.syear.max() result is 2012 Although I give the command print(max(df1['syear'])) , I get 31554. - user21

2 answers 2

it looks like you have either a syear column syear really negative numbers or, most likely, large positive numbers (more than 32767), which turn into negative ones with the np.int16 data type ...

Demo:

Information about max () and min () for type np.int16 :

 In [67]: np.iinfo(np.int16) Out[67]: iinfo(min=-32768, max=32767, dtype=int16) 

how from a large positive integer (32880) you get a negative (-32656) when using the np.int16 type:

 In [72]: df = pd.DataFrame({'a':[32880]}, dtype=np.int16) In [73]: df Out[73]: a 0 -32656 

wrong ("bad") years:

 In [88]: df1.query('syear <= 1980 or syear > 2016').syear Out[88]: 248737 -9076 248738 -26593 248739 1725 248740 -25171 248741 7963 248742 27137 248743 19854 248744 26738 248745 6716 248746 9885 248747 19361 248748 -19726 248749 -24605 248750 24074 248751 -8070 248752 -16027 248753 -23424 248754 3848 248755 1471 248756 30634 248757 -8162 248758 -18937 248759 16733 248760 -21923 248761 16817 248762 3834 248763 -13556 248764 -16229 248765 24272 248766 25642 ... 252510 -1 252511 -1 252512 -1 252513 -1 252514 -1 252515 -1 252516 -1 252517 -1 Name: syear, dtype: int16 

An interesting observation is that all the "bad" data goes in a continuous block (with indices: 248737 - 252517)

  • While scrolling through the data, I did not see any such value as 32767 or something like that, or negative values. - user21
  • can you upload your data somewhere in CSV / JSON / HDF5 format? - MaxU
  • one
    how to show bad data: df1.query('syear < 0 or syear > 2016') - MaxU
  • I dropped two files, one source and the other in HDF5. Link In DropBox - user21
  • one
    I did not quite clearly put it - I meant that you have 3781 lines with "bad" years. For example in the lines: df1.ix[248737 : 248747, 'syear'] - MaxU

I think this is actually such data, since I sort it in ascending order, the strangest such values ​​come first. I tried the command:

 syear_counts = df1['syear'].value_counts() syear_counts 2000 24174 2002 23541 2006 22399 2003 22285 2001 21985 2004 21703 2011 21154 -1 3274 .... -17733 1 29884 1 24765 1 -11361 1 

Most likely, I need to somehow throw away this piece of data that is not representative. All strings that correspond to such values.

  • Yes, sorting could not "spoil" the data. The main question is how was the syear column syear in df1 ? - MaxU
  • one
    “throwing away” the wrong data is simple ( df = df[df.syear > 1950] ) - I think it’s worth finding out where the negative years come from ... - MaxU
  • I downloaded the df1 = pd.read_stata('gen_data.dta') data. It looks like half the data. I downloaded the data again: import pandas as pd -> import os -> os.chdir(r'C:\Users\...\SOEPlongv29_stata') -> df1 = pd.read_stata('gen_data.dta') ->. .. -> syear_counts[:10] And got the same results as described above. - user21
  • I understood how this function works, it remains to do the same with the upper limit in 2012 in order to throw out very large data, instead of a year. Such as: 18770 1 .... 24276 1 - user21
  • one
    it's just not difficult: df1_clean = df1.query('1950 < syear <= 2012') - MaxU