The data science process

The data science process typically consists of six steps


1. Setting the research goal:

  • It defines the purpose and goals of the data science project.
  • Defining the what, the why, and the how of your project in a project charter.
  • It includes information like what is being researched, the benefits to the company, required data and resources, a timetable, and expected deliverables.

2. Retrieving data:

  • Collect the necessary data as specified in the project charter.
  • Data collection ensures the availability of the specified data, quality assessment, and Confirm access to the identified data sources.
  • This data is either found within the company or retrieved from a third party.
  • It takes many forms ranging from Excel spreadsheets to different types of databases.

3. Data preparation:

  • Data preparation mitigates errors in data collection through three subphases.
  • This phase consists of three subphases:
    • Data Cleansing: Remove false values and inconsistencies.
    • Data Integration: Combine information from multiple sources.
    • Data Transformation: Format data for modeling.
  • It enhances the quality of the data and prepare it for use in subsequent steps and transforming it into a suitable format for your models

4. Data exploration

  • Exploratory Data Analysis (EDA) is a crucial step in the data science process, aiming to enhance our understanding of the dataset.
  • It uses descriptive statistics, visual techniques, and simple modeling to understand variable interactions, data distribution, and identify outliers.

5. Data modeling or model building:

  • The goal of this phase is to utilize models, domain knowledge, and insights gained from earlier steps to address the research question.
  • It involves selecting a modeling technique from various fields, such as statistics or machine learning, operations research, and so on.
  • Building a model is an iterative process that involves selecting the variables for the model, executing the model, and model diagnostics.

6. Presentation and automation

  • The last step is to presenting your results to the stakeholders.
  • The results can be presented in various forms, from presentations to research reports.
  • Automating your model facilitates the reuse of insights in other projects or enable an operational process to use the outcome from the model.

An iterative process

  • The data science process is not strictly linear. Instead, it involves iterative steps with the potential for revisiting and reworking findings. For example, discovering outliers during data exploration might signal data import errors, prompting a return to earlier stages. Incremental insights often lead to new questions. To minimize rework, it’s crucial to establish a clear and comprehensive scope for the business question at the outset.
Design a site like this with WordPress.com
Get started