API
算法库
底层基础
说明:
01.底层基础:
线性代数库 Breeze
breeze.linalg
支持密集向量-稀疏向量-标量向量
支持矩阵 --RowMatrix IndexedRowMatrix CoordinateMatrix
org.apache.spark.ml.linalg.Vectors
object Vectors {}
org.apache.spark.ml.linalg.
object Matrices {}
sealed trait Matrix extends Serializable {}
Spark Mlib的版本说明
Machine Learning Library (MLlib)
The MLlib DataFrame-based API is primary API the spark.ml package
The MLlib RDD-based API is now in maintenance mode the spark.mllib package
02.Spark ML Pipeline 的设计思想和基本概念
若干个过程
1、源数据ETL
2、数据预处理
3、特征选取
4、模型训练与验证
MLlib提供标准的接口来使联合多个算法到单个的管道或者工作流
Pieline 相关的概念有: DataFrame 、 Transformer 、Estimator、Parameter等
DataFrame : DataFrame can use ML Vector types
Transformer : one DataFrame into another DataFrame
a method transform()
Estimator :a DataFrame to produce a Transformer
a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer.
an ML model is a Transformer
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Parameter: All Transformers and Estimators now share a common API for specifying parameters
方法:
LogisticRegression is an Estimator calling fit() trains a LogisticRegressionModel - Transformer
一些数据
Pipeline PipelineModel
Predictor 类 (Estimator 的子类)
/**
* LogisticRegression 模型
* setMaxIter 设置最大迭代次数(默认100),具体迭代次数可能在不足最大迭代次数停止
* setTol 设置容错(默认1e-6),每次迭代会计算一个误差,误差值随着迭代次数增加而减小,当误差小于设置容错,则停止迭代
* setRegParam 设置正则化项系数(默认0),正则化主要用于防止过拟合现象,如果数据集较小,特征维数又多,易出现过拟合,考虑增大正则化系数
* setElasticNetParam 正则化范式比(默认0),正则化有两种方式:L1(Lasso)和L2(Ridge),L1用于特征的稀疏化,L2用于防止过拟合
* setPredictionCol 设置预测列
* setThreshold 设置二分类阈值
*/
// 将数据随机划分为 train set 和 Test set集(30%进行测试)
val Array(trainingDF, testDF) = vecDF.randomSplit(Array(0.7, 0.3))
val lr = new LogisticRegression()
.setMaxIter(12) //最大迭代次数
.setRegParam(0.3) //正则化主要用于防止过拟合现象,如果数据集较小,特征维数又多,易出现过拟合,考虑增大正则化系数
.setFamily("auto") //binomial(二分类)/multinomial(多分类)/auto,默认为auto。设为auto时,会根据schema或者样本中实际的class情况设置是二分类还是多分类,最好明确设置
.setFeaturesCol("feature_colunms")//特征列
.setLabelCol("label__colunm") //标签列
.setThreshold(0.5) //设置二分类阈值
// .setStages(new PipelineStage[] {labelIndexer, featureIndexer, rf, labelConverter});
// 构建完成一个 stage piple
val lrPipeline = new Pipeline().setStages(Array(g_encoder,a_encoder,assembler, lr))
// Fit the pipeline to training documents.
val lrModel = lrPipeline.fit(train_data);
// Now we can optionally save the fitted pipeline to disk
lrModel.write().overwrite().save("/mymodel);
源码中的案例:
//Scala
// val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, countVectorizer))
//Java
// Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {tokenizer, hashingTF, lr});
源码:
abstract class Estimator[M <: Model[M]] extends PipelineStage
其他:
Dataset<Row> df = spark.createDataFrame(data, schema);
OneHotEncoderEstimator encoder = new OneHotEncoderEstimator()
.setInputCols(new String[] {"categoryIndex1", "categoryIndex2"})
.setOutputCols(new String[] {"categoryVec1", "categoryVec2"});
OneHotEncoderModel model = encoder.fit(df);
Dataset<Row> encoded = model.transform(df);
a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. an ML model is a Transformer
A learning model<a Transformer> might take a DataFrame -- output a new DataFrame with predicted labels appended as a column
A feature transformer might take a DataFrame
如何解决机器学习中数据不平衡问题 https://www.cnblogs.com/zhaokui/p/5101301.html
Spark入门:标签和索引的转化:StringIndexer- IndexToString-VectorIndexer
http://dblab.xmu.edu.cn/blog/1297-2/
ML Pipelines http://spark.apache.org/docs/latest/ml-pipeline.html
ML