📜 ⬆️ ⬇️

Data Science "special forces" on their own

Practice shows that many enterprise companies face difficulties in implementing analytical projects.


The thing is that, unlike classical projects for the supply of iron or the implementation of vendor solutions that fit into the linear model of execution, the tasks associated with advanced analytics (data science) are very difficult to formalize as a clear and unambiguous TK in the form of sufficient for transmission to the performer. The situation is aggravated by the fact that the implementation of the task requires the integration of a mass of various internal IT systems and data sources; some questions and answers can appear only after working with the data begins and the real state of affairs is revealed, which is very different from the documentary picture of the world. This all means that in order to write competent TK, it is necessary to conduct a preliminary piece of work comparable to half the project, devoted to the study and formalization of real needs, analysis of data sources, their relationships, structure and gaps. Within organizations, employees who are capable of cranking such a large-scale work are practically non-existent. It turns out that the contests lay out quite raw requirements. At best, contests are canceled (sent for revision) after a cycle of clarifying questions. In the worst case, for a golome budget and a long timeframe, it turns out something completely different from the plans of the authors of the claims. And they remain at the broken trough.


A sensible alternative is to create data science (DS) teams within the company. If you do not threaten the construction of the Egyptian pyramids, the team and 2-3 competent professionals can do very, very much. But then another question arises, how to prepare these specialists. Below I want to share a set of successfully tested considerations for the rapid preparation of such a "special forces" with R as a weapon.


It is a continuation of previous publications .


Problematics


At the moment, searching the market for competent competent professionals is a big problem. Therefore, it is very useful to consider the training strategy of just literate and adequate. At the same time, the specificity of the required training is observed:



With all the wonderful Coursera, Datacamp, various books, as well as programs on ML, none of the sets of courses did not allow to obtain the required set of characteristics. They serve as excellent sources for mastery, but are quick to start. The main task at a quick start is to indicate paths, swamps, traps; familiarize yourself with the range of existing tools; show how the company's tasks can be solved by using the tool; throw into the lake from a boat and make him swim.


It is important to show that R is not only a tool, but also an appropriate community. Therefore, the use of a large number of relevant developments, incl. presentation, is one of the formats of work with the community. Hadley can even write questions to the twitter or github. On worthy questions you can get comprehensive answers.


As a result of various experiments, the structural approach of “Deep Dive Into R” to the supply of base material was obtained.


Immersion in R



Each student gets a practical task from his leadership (“coursework”) in the form of a real task, which he will have to perform during the dive and protect upon completion of the course.


Day 1


Briefly about R. Syntax and structure of the language. Basics of using IDE RStudio for analysis and development. Base types and data. Interactive calculations and execution of program code. Brief acquaintance with R Markdown and R notebook Principles of working with libraries. Preparing for analytical work, installing the necessary libraries, creating a project. Principles of profiling calculations, the search for narrow (extremely long) places and their elimination.



Day 2


The concept and ecosystem of the 'tidyverse' packages ( https://www.tidyverse.org/ ). A brief overview of the packages included in it (import / processing / visualization / export). The concept of tidy data as the basis of working methods in tidyverse . 'tibble' as a presentation format. (Packages from tidyverse ecosystem). Transformations and data manipulations. Syntax and principles of stream processing (pipe). Группировка - вычисление - сборка . (Packages tibble , dplyr , tidyr .)



Day 3


Formation of graphical representations by means of ggplot ( https://ggplot2.tidyverse.org/reference/index.html ). Using graphical tools for analyzing business data.



Day 4


Work with string and enum types. Basics of regular expressions. Work with dates. (Packages stringi , stringr , forcats , re2r , lubridate , anytime )



Day 5


Advanced data import. txt, csv, json, odbc, web scrapping (REST API), xlsx.
(Packages readr , opexlsx , jsonlite , curl , rvest , httr , readtext , DBI , data.table )



Day 6


Export data. rds, csv, json, xlsx, docx, pptx, odbc. Basics of R Markdown and R Notebook.
(Packs opexlsx , officer , DBI , jsonlite , readr , data.table , knitr )


Including



Day 7


Basics of programming in R. Creating functions. Scope of variables. View objects. View objects, their structure. Principles of work "by reference."



Day 8


Approaches to the validation of intermediate and final results. Principles of collaboration and the formation of reproducible calculations. Demonstration of shiny applications as a target interface for end users. (Packages checkmate , reprex , futile.logger , shiny )



Day 9


Methods and approaches in working with data of "medium" size. Package data.table . Main functions. Comparative experimental analysis.
Review of additional questions that appeared in 1-8 days.


Day 10


Course defense


Requirements for the workplace participants based on Windows 10



Books



Conclusion


  1. The proposed sequence of submission of the material is not a dogma. There may be various digressions and the inclusion of additional. materials, including mathematical inserts. Everything is determined by the actual topical issues and tasks that will be determined for coursework and a list of popular production issues. The most popular are the algorithms of regression, clustering, text mining, work with time series.
  2. Issues of parallel computing, creating shiny applications, using ML algorithms and external platforms do not fit into the concept of “fast immersion”, but may be a continuation after the start of practical work.

PS Usually, HR has difficulty in formulating job requirements.
Here is a possible example for the seed. Each complements and governs based on their expectations.


Data Science (DS): Big Data and Analytics. Job Requirements



Previous publication - “How R is fast for productive?” .



Source: https://habr.com/ru/post/440700/