I am doing a cluster analysis using k-means. I have generated 720 datasets grouped into one sheet and there is a separate dataset containing the values ​​of the number of centers ( k ) for each of the 720 datasets. I try to make so that the cluster analysis was performed at once for all datasets in a sheet in one action with the help of lapply and the number of centers corresponding to each dataset was chosen. The problem is that I do not know how to make lapply besides the alternation of datasets, alternating the corresponding numbers of centers.

Example:

 # генерируем 5 датасетов set.seed(199) df1<-data.frame(replicate(4,sample(1:100,40,rep=TRUE))) df2<-data.frame(replicate(3,sample(1:100,30,rep=TRUE))) df3<-data.frame(replicate(5,sample(1:100,20,rep=TRUE))) df4<-data.frame(replicate(6,sample(1:100,40,rep=TRUE))) df5<-data.frame(replicate(3,sample(1:100,50,rep=TRUE))) # засовываем их в лист list_df = list(df1, df2, df3, df4, df5) list_names = c("df1", "df2", "df3", "df4", "df5") list_df<-setNames(list_df, list_names) # создаем датасет с центрами для каждого датасета df_centers <- data.frame(centers=c(3,4,2,6,8)) # попытка применить lapply (код неверный) km.clust <- lapply(list_df, kmeans, centers = df_centers$centers) 

I try to do this in a similar way, however I don’t know how to make kmeans use the corresponding df_centers for each list_df in the list_df list. How can this be implemented?

    2 answers 2

    As you already understood, mapply() is the most suitable option. But you can also use lapply() , you just need to pass indices instead of values:

     km.clust <- lapply(seq_along(list_df), function(i) { kmeans(list_df[[i]], centers = df_centers$centers[i]) }) 

    Also, do not forget about parallel::mcmapply , which is an analogue of the mapply function with parallelization support (you can speed up the code when working with a large number of datasets).

    If you need to vary a large number of parameters, it is convenient to use the pmap function from the purrr package:

     params <- list(x = list_df, centers = df_centers$centers) km.clust <- purrr::pmap(params, kmeans) 

      Solved the issue myself. Replacing lapply with mapply allows varying parameters to be used for each element.

       km.clust <- mapply(kmeans, list_df, centers = df_centers$centers) 
      • It is better to add SIMPLIFY = FALSE mapply() call, since kmeans returns a list. - Artem Klevtsov