Suppose that there is an array of observations in which one of the variables takes random values ​​from 1 to 100. How to make an ordinal variable from the latter that would take values ​​depending on the specified thresholds (for example: "1" if <50; "2" if [50,60]; otherwise, "3")? I wanted to use a map or lambda function, but failed: C

    2 answers 2

    In [123]: lst Out[123]: [87, 92, 22, 1, 94, 18, 92, 44, 77, 73, 53, 24, 9, 67, 20] In [142]: res = ["1" if x < 50 else "2" if x <= 60 else "3" for x in lst] In [143]: res Out[143]: ['3', '3', '1', '1', '3', '1', '3', '1', '3', '3', '2', '1', '1', '3', '1'] 

    For large amounts of data it is better to use Numpy or Pandas - they work much faster:

     import pandas as pd import numpy as np # для генерации случайных чисел 

    Sample input data:

     In [166]: df = pd.DataFrame({'var':np.random.randint(1, 101, 10)}) In [167]: df Out[167]: var 0 38 1 100 2 74 3 5 4 66 5 32 6 91 7 6 8 68 9 50 

    use pd.cut () :

     In [168]: df['tag1'] = pd.cut(df['var'], bins=[0,50,60,101], labels=[1,2,3]) In [169]: df['tag2'] = pd.cut(df['var'], bins=[0,50,60,101]) 

    Result: - if you do not specify values ​​for the parameter labels pd.cut () itself constructs value ranges - this may come in handy:

     In [170]: df Out[170]: var tag1 tag2 0 38 1 (0, 50] 1 100 3 (60, 101] 2 74 3 (60, 101] 3 5 1 (0, 50] 4 66 3 (60, 101] 5 32 1 (0, 50] 6 91 3 (60, 101] 7 6 1 (0, 50] 8 68 3 (60, 101] 9 50 1 (0, 50] 

    You can also include the left boundaries of the ranges instead of the right:

     In [172]: df['tag3'] = pd.cut(df['var'], bins=[0,50,60,101], right=False) In [173]: df Out[173]: var tag1 tag2 tag3 0 38 1 (0, 50] [0, 50) 1 100 3 (60, 101] [60, 101) 2 74 3 (60, 101] [60, 101) 3 5 1 (0, 50] [0, 50) 4 66 3 (60, 101] [60, 101) 5 32 1 (0, 50] [0, 50) 6 91 3 (60, 101] [60, 101) 7 6 1 (0, 50] [0, 50) 8 68 3 (60, 101] [60, 101) 9 50 1 (0, 50] [50, 60) 
    • one
      Check 50 <= redundant, will work without it too. - insolor
    • @insolor, thanks! fixed ... - MaxU

    Just there is the numpy.digitize() function , which returns the numbers of the ranges to which the array elements belong:

     >>> import numpy as np >>> a = np.random.randint(1, 101, size=10) >>> a array([16, 42, 19, 88, 69, 15, 5, 1, 33, 50]) >>> np.digitize(a, [1, 50, 60, 101]) array([1, 1, 1, 3, 3, 1, 1, 1, 1, 2]) 
    • 1 <= 16 < 50 so the range number for 16 is 1
    • 50 <= 50 < 60 so the range number for 50 is 2
    • 60 <= 88 < 101 so the range number for 88 is 3
    • @MaxU: I think the author is simply not familiar with the notation: [50, 60) For example, there is no mention in the question of "from 1 to 100" inclusive or not. Therefore, I clearly indicated that there were no discrepancies. If necessary, you can write 61 instead of 60 . - jfs