Mastering MLOps: Enhancing Machine Learning Workflows with Prefect - Insights from MLOps Zoomcamp 2023

Mastering MLOps: Enhancing Machine Learning Workflows with Prefect - Insights from MLOps Zoomcamp 2023

The third topic covered in MLOps Zoomcamp by DataTalksClub is the orchestration of machine learning workflows. Orchestration involves managing and coordinating tasks and components within a machine learning pipeline or workflow. It streamlines and automates the various tasks and stages in a structured and controlled manner.

In this module, the orchestration platform used is Prefect. The tasks covered include:

  1. Using a local Prefect server

  2. Setting up files containing tasks and flows

  3. Deploying the flow

  4. Scheduling tasks

  5. Sending email notifications within the workflow

  6. Using Prefect Cloud

  7. Automating notifications

  8. Installing Prefect on your local computer using pip in a Python environment.

The installation of Prefect on your local computer is straightforward using pip in a Python environment.

pip install prefect

And run the server in the working directory by executing the following command.

prefect server start

There are at least two units used to build workflows in Prefect, namely tasks and flows. These units are implemented as decorators in our Python code functions. A task represents a unit of work within a workflow, which can be a simple or complex operation. On the other hand, a flow is a collection of tasks arranged in a specific order with predefined dependencies. Here is an example implementation in Python code.

@task(retries=3, retry_delay_seconds=2, name="Read taxi data")
def read_data(filename: str) -> pd.DataFrame:
    """Read data into DataFrame"""
    # Load or read data into dataframe code

@task(name="Extract the features")
def add_features():
    """Add features to the model"""
    # Extract the feature from data code

@task(log_prints=True, name="Train the model")
def train_best_model():
    """train a model with best hyperparams and write everything out"""
    # train the model code

@flow
def send_notification_email(email_addresses: list[str], msg: str):
    # send notification to emails code

@flow
def main_flow() 
    """The main training pipeline"""

    # Load
    df_train = read_data(train_path)
    df_val = read_data(val_path)

    # Transform
    X_train, X_val, y_train, y_val, dv = add_features(df_train, df_val)

    # Train
    markdown_report = train_best_model(X_train, X_val, y_train, y_val, dv)

    # Send notification
    send_notification_email(email_addresses, msg)


if __name__ == "__main__":
    main_flow()

From the code, some functions that serve as tasks:

  1. read_data()

  2. add_features()

  3. train_best_model()

Functions that serve as flows:

  1. main_flow()

  2. send_notification_email(), this flow will be a sub-flow of the main_flow()

In the implementation process, I observed that it can be divided into two stages: development and deployment. In the development stage, we only need to ensure that our pipelines are created correctly without determining scheduling, notifications, and other handling. By ensuring that the Prefect server is running, we can simply execute the file containing the code for tasks and flows. For example, if we write it in orchestrate.py in your working directory, execute the following command.

python orchestrate.py

The results can be viewed in the Prefect UI.

For the deployment stage, the following steps are taken:

  1. Initiate the Prefect project, with run the command

     prefect project init
    

    It will generate some config files for Prefect project.

  2. Deploy the flow, with run the command

     prefect deploy orchestrate.py:main_flow -n hw_taxi1 -p zoomcamppool
    

    In the example command we set the name of the deployed flow to 'hw_taxi1' and set the agent pool to 'zoomcamppool' for running the flow. We can create the agent pool via UI or command.

  3. To initiate the worker, ensure that the agent pool or worker pool is not in a paused or stopped state.

     bash prefect worker start -p zoomcamppool
    
  4. Run the deployed flow

     bash prefect deployment run main-flow/hw_taxi1
    
  5. To set the schedule, we can utilize either the UI or the CLI. Here's an example for reference.

      bash prefect deployment set-schedule --cron "0 9 3 * *" main-flow/hw_taxi1
    

We can monitor the running flows and configure settings through the Prefect UI.

That's all for this week's article and progress. I believe there is still much to explore, especially regarding integration with other aspects of MLOps. Are there any alternative platforms for machine learning orchestration?

saifulrijal-ds/mlops-zoomcamp-hw-prefect (github.com)