What is the data science life cycle?
Let’s take a look at the 5 most important steps in the Data science life cycle. It is not only necessary to know about algorithms, but it is also very necessary to have a holistic vision of the projects and, in this task is where the life cycle is. The life cycle of any software development project, data science is software development applied to business, describes the steps or stages that are necessary to correctly develop a data science project.
In relation to the life cycle, there are data science projects that do not have to have any of the stages, but this is just a generalization.
Understanding the problem and the business
First of all when a problem is raised is to know what the problem is and how to solve it. Data science is not about algorithms or technology, it is about solving problems. There are use cases in which if you already have business knowledge, this phase can be very fast, but in others it may seem necessary to do a lot of research on the problem.
Information location and analysis
The second stage would be to locate and know the data with which the problem will be dealt with. It is necessary to know where the data is, what it means, how to extract it and the approximate volume. The origins can be very varied, coming internally from the organization, from public data sources or even from external information providers. How to extract them is also going to influence productivity.
Data quality and reprocessing
The next step is to reprocess and analyze the data. One of the most important concepts is Quality Assurance, that is, knowledge of the data from a technical point of view. On the other hand, a data validation is necessary for us to deal with quality information. There is a well-known phrase in the industry that is “garbage in, garbage out.” That is, if the data is bad, the outputs are bad.
Modeling
First would be exploratory analysis, another would be variable engineering, another would be training, and finally validation. All of them have to be included in the same stage since this point is usually iterative and a change in the generation of a variable impacts training and validation. The same happens with a change in an independent variable. Suppose that in the exploratory analysis we have seen that a variable does not make business sense, it is eliminated and it is necessary to execute the four stages.
At this stage is where we make the machine learning model and verify that it works correctly by validating it. The modeling part is usually given a lot of weight in companies, which then in practice is only part of the picture. Generally, the data scientist usually spends much more time in the rest of the phases than in this phase.
Deployment
At this point we already have a model made correctly in the previous point and here it is displayed. That is, it starts running automatically. A later stage is the monitoring of this deployed model. When something fails in monitoring, it is necessary to review the model and return to the previous stage, or even to previous stages.
This whole process is very high level, being able to elaborate a lot in each of the stages.
So these are the 5 major stages in the data science life cycle.
If you would like to make a career in data science, you can learn more about our courses from the link below.