5.2. 超参数调优#

我们可以使用 Dask 进行超参数调优,主要有两种方式:

  • 基于 scikit-learn 的 joblib 后端,将多个超参数调优任务分布到 Dask 集群

  • 使用 Dask-ML 提供的超参数调优 API

这两种方式都是针对训练数据量可放到单机内存中的场景。

scikit-learn joblib#

单机的 scikit-learn 已经提供了丰富易用的模型训练和超参数调优接口,它默认使用 joblib 在单机多核之间并行。像随机搜索和网格搜索等超参数调优任务容易并行,任务之间没有依赖关系,很容易并行起来。

案例:飞机延误预测(scikit-learn)#

下面展示一个基于 scikit-learn 的机器学习分类案例,我们使用 scikit-learn 提供的网格搜索。

import os

import sys
sys.path.append("..")
from utils import nyc_flights

import numpy as np
import pandas as pd

folder_path = nyc_flights()
file_path = os.path.join(folder_path, "nyc-flights", "1991.csv")
input_cols = [
    "Year",
    "Month",
    "DayofMonth",
    "DayOfWeek",
    "CRSDepTime",
    "CRSArrTime",
    "UniqueCarrier",
    "FlightNum",
    "ActualElapsedTime",
    "Origin",
    "Dest",
    "Distance",
    "Diverted",
    "ArrDelay",
]

df = pd.read_csv(file_path, usecols=input_cols)
df = df.dropna()

# 预测是否延误
df["ArrDelayBinary"] = 1.0 * (df["ArrDelay"] > 10)

df = df[df.columns.difference(["ArrDelay"])]

# 将 Dest/Origin/UniqueCarrier 等字段转化为 category 类型
for col in df.select_dtypes(["object"]).columns:
    df[col] = df[col].astype("category").cat.codes.astype(np.int32)

for col in df.columns:
    df[col] = df[col].astype(np.float32)
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import GridSearchCV as SkGridSearchCV
from sklearn.model_selection import train_test_split as sk_train_test_split

_y_label = "ArrDelayBinary"
X_train, X_test, y_train, y_test = sk_train_test_split(
    df.loc[:, df.columns != _y_label], 
    df[_y_label], 
    test_size=0.25,
    shuffle=False,
)

model = SGDClassifier(penalty='elasticnet', max_iter=1_000, warm_start=True, loss='log_loss')
params = {'alpha': np.logspace(-4, 1, num=81)}

sk_grid_search = SkGridSearchCV(model, params)

在进行超参数搜索时,只需要添加 with joblib.parallel_config('dask'):,将网格搜索计算任务扩展到 Dask 集群。

import joblib
from dask.distributed import Client, LocalCluster

# 修改为你的 Dask Scheduler IP 地址
client = Client("10.0.0.3:8786")
2024-05-08 07:36:02,224 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
with joblib.parallel_config('dask'):
    sk_grid_search.fit(X_train, y_train)

使用 score() 方法查看模型的准确度:

sk_grid_search.score(X_test, y_test)
0.7775224665166276

Dask-ML API#

前面介绍了基于 scikit-learn 的超参数调优,整个流程中只需要修改 joblib.parallel_config('dask'),计算任务就被分发到 Dask 集群上。

Dask-ML 自己也实现了一些超参数调优的 API,除了提供和 scikit-learn 对标的 GridSearchCVRandomizedSearchCV 等算法外,还提供了连续减半算法、Hyperband 算法等,比如 SuccessiveHalvingSearchCVHyperbandSearchCV

案例:飞机延误预测(Dask-ML)#

下面展示一个基于 Dask-ML 的 Hyperband 超参数调优案例。

Dask-ML 的超参数调优算法要求输入为 Dask DataFrame 或 Dask Array 等可被切分的数据,而非 pandas DataFrame,因此数据预处理部分需要改为 Dask。

值得注意的是,Dask-ML 提供的 SuccessiveHalvingSearchCVHyperbandSearchCV 等算法要求模型必须支持 partial_fit()score()partial_fit() 是 scikit-learn 中迭代式算法(比如梯度下降法)的一次迭代过程。连续减半算法和 Hyperband 算法先分配一些算力额度,不是完成试验的所有迭代,而只做一定次数的迭代(对 partial_fit() 调用有限次数),评估性能(在验证集上调用 score() 方法),淘汰性能较差的试验。

import dask.dataframe as dd

input_cols = [
    "Year",
    "Month",
    "DayofMonth",
    "DayOfWeek",
    "CRSDepTime",
    "CRSArrTime",
    "UniqueCarrier",
    "FlightNum",
    "ActualElapsedTime",
    "Origin",
    "Dest",
    "Distance",
    "Diverted",
    "ArrDelay",
]

ddf = dd.read_csv(file_path, usecols=input_cols,)

# 预测是否延误
ddf["ArrDelayBinary"] = 1.0 * (ddf["ArrDelay"] > 10)

ddf = ddf[ddf.columns.difference(["ArrDelay"])]
ddf = ddf.dropna()
ddf = ddf.repartition(npartitions=8)

另外,Dask 处理类型变量时与 pandas/scikit-learn 也稍有不同,我们需要:

  • 将该特征转换为 category 类型,比如,使用 Dask DataFrame 的 categorize() 方法,或 Dask-ML 的 Categorizer 预处理器。

  • 进行独热编码:Dask-ML 中的 DummyEncoder 对类别特征进行独热编码,是 scikit-learn OneHotEncoder 的 Dask 替代。

from dask_ml.preprocessing import DummyEncoder

dummy = DummyEncoder()
ddf = ddf.categorize(columns=["Dest", "Origin", "UniqueCarrier"])
dummified_ddf = dummy.fit_transform(ddf)

并使用 Dask-ML 的 train_test_split 方法切分训练集和测试集:

from dask_ml.model_selection import train_test_split as dsk_train_test_split

_y_label = "ArrDelayBinary"
X_train, X_test, y_train, y_test = dsk_train_test_split(
    dummified_ddf.loc[:, dummified_ddf.columns != _y_label], 
    dummified_ddf[_y_label], 
    test_size=0.25,
    shuffle=False,
)

定义模型和搜索空间的方式与 scikit-learn 类似,然后调用 Dask-ML 的 HyperbandSearchCV 进行超参数调优。

from dask_ml.model_selection import HyperbandSearchCV

# client = Client(LocalCluster())
model = SGDClassifier(penalty='elasticnet', max_iter=1_000, warm_start=True, loss='log_loss')
params = {'alpha': np.logspace(-4, 1, num=30)}

dsk_hyperband = HyperbandSearchCV(model, params, max_iter=243)
dsk_hyperband.fit(X_train, y_train, classes=[0.0, 1.0])
/fs/fast/u20200002/envs/dispy/lib/python3.11/site-packages/sklearn/model_selection/_search.py:318: UserWarning: The total space of parameters 30 is smaller than n_iter=81. Running 30 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
/fs/fast/u20200002/envs/dispy/lib/python3.11/site-packages/sklearn/model_selection/_search.py:318: UserWarning: The total space of parameters 30 is smaller than n_iter=34. Running 30 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
HyperbandSearchCV(estimator=SGDClassifier(loss='log_loss', penalty='elasticnet',
                                          warm_start=True),
                  max_iter=243,
                  parameters={'alpha': array([1.00000000e-04, 1.48735211e-04, 2.21221629e-04, 3.29034456e-04,
       4.89390092e-04, 7.27895384e-04, 1.08263673e-03, 1.61026203e-03,
       2.39502662e-03, 3.56224789e-03, 5.29831691e-03, 7.88046282e-03,
       1.17210230e-02, 1.74332882e-02, 2.59294380e-02, 3.85662042e-02,
       5.73615251e-02, 8.53167852e-02, 1.26896100e-01, 1.88739182e-01,
       2.80721620e-01, 4.17531894e-01, 6.21016942e-01, 9.23670857e-01,
       1.37382380e+00, 2.04335972e+00, 3.03919538e+00, 4.52035366e+00,
       6.72335754e+00, 1.00000000e+01])})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
dsk_hyperband.score(X_test, y_test)
0.8241373877422404

本书还会介绍 Ray 的超参数调优,相比 Dask,Ray 在超参数调优上的兼容性和功能完善程度更好,读者可以根据自身需求选择适合自己的框架。