【Dr.Elephant中文文档-4】开发者指南

on 2019-02-18 | by hyperxu 关注作者

`Dr.Elephant`设置

请按照快速安装说明操作here.

先决条件

Play/Activator

Hadoop/Spark on Yarn

为了在本地部署Dr.Elephant测试，你需要安装Hadoop(version 2.x)或者Spark(Yarn mode, version > 1.4.0)，以及资源管理服务和历史作业服务（可以用伪分布式）。关于伪分布式模式在YARN上运行MapReduce作业相关说明可以在这里找到。

如果还没设置环境变量，可以导入HADOOP_HOME变量

$> export HADOOP_HOME=/path/to/hadoop/home$> export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

将hadoop的home目录添加到系统变量下，因为Dr.Elephant会调用到hadoop的某些类库

$> export PATH=$HADOOP_HOME/bin:$PATH

确保历史作业服务器正常运行，因为Dr.Elephant需要依赖他运行

$> $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

数据库

Dr.Elephant需要一个数据库来存储相关祖业信息和分析结果数据

本地配置并启动一个mysql。可以从以下链接获取最新版的mysql：https://www.mysql.com/downloads/。`Dr.Elephant`支持`mysql 5.5+以上的版本，有啥问题可以去Alex (wget.null@gmail.com) 的Google小组讨论。创建一个名为drelephant`的库。

$> mysql -u root -pmysql> create database drelephant

可以在Dr.Elephant的配置文件app-conf/elephant.conf中配置数据库的url、数据库名称、用户名和密码。

使用其他数据库
目前，Dr.Elephant默认是支持MySQL数据库。但我们可以在evolution files中看到DDL声明。如果想配置其他的数据库，可以参考这里进行配置。

测试`Dr.Elephant`

你可以通过调用编译脚本来测试，脚本会进行所有单元测试。

项目结构

1234567891011121314151617181920212223242526272829303132333435

app                             → Contains all the source files └ com.linkedin.drelepahnt      → Application Daemons └ org.apache.spark             → Spark Support └ controllers                  → Controller logic └ models                       → Includes models that Map to DB └ views                        → Page templatesapp-conf                        → Application Configurations └ elephant.conf                → Port, DB, Keytab and other JVM Configurations (Overrides application.conf) └ FetcherConf.xml              → Fetcher Configurations └ HeuristicConf.xml            → Heuristic Configurations └ JobTypeConf.xml              → JobType Configurationsconf                            → Configurations files └ evolutions                   → DB Schema └ application.conf             → Main configuration file └ log4j.properties             → log configuration file └ routes                       → Routes definitionimages └ wiki                         → Contains the images used in the wiki documentationpublic                          → Public assets └ assets                       → Library files └ css                          → CSS files └ images                       → Image files └ js                           → Javascript filesscripts └ start.sh                     → Starts Dr. Elephant └ stop.sh                      → Stops Dr. Elephanttest                            → Source folder for unit testscompile.sh                      → Compiles the application

启发式算法

Dr.Elephant已经为MapReduce和Spark集成了一系列的启发式算法。有关这些算法的详细信息，请参阅启发式算法指南。这些算法都是可插拔式的模块，可以很简单的配置好。

添加新的启发式算法

你可以添加自定义的算法到Dr.Elephant中。创建新的启发式算法，并完成测试为自定义的启发式算法创建一个新的view页，例如helpMapperSpill.scala.html在HeuristicConf.xml文件中添加该启发式算法的详情HeuristicConf.xml文件应该包含下列内容：applicationtype：应用程序类型，是MapReduce还是sparkheuristicname：算法名称classname：类名全称viewname：view页全称hadoopversions：该算法匹配的hadoop版本号运行Dr.Elephant，他应该包含你新添加的算法了

HeuristicConf.xml文件示例

<heuristic><applicationtype>mapreduce</applicationtype><heuristicname>Mapper GC</heuristicname><classname>com.linkedin.drelephant.mapreduce.heuristics.MapperGCHeuristic</classname><viewname>views.html.help.mapreduce.helpGC</viewname></heuristic>

配置启发式算法

如果你想要覆盖启发式算法中用到的关于严重性指标的的阈值，你可以在HeuristicConf.xml文件中指定其值，例子如下。
配置严重性阈值

1234567891011

<heuristic><applicationtype>mapreduce</applicationtype><heuristicname>Mapper Data Skew</heuristicname><classname>com.linkedin.drelephant.mapreduce.heuristics.MapperDataSkewHeuristic</classname><viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname><params>  <num_tasks_severity>10, 50, 100, 200</num_tasks_severity>  <deviation_severity>2, 4, 8, 16</deviation_severity>  <files_severity>1/8, 1/4, 1/2, 1</files_severity></params></heuristic>

调度器

如今，Dr.Elephant支持3种工作流调度器。他们是Azkaban，Airflow和Oozie。默认情况下，这些调度器都是可用的，除了Airflow和Oozie需要一些配置外，一般都是开箱即用。

调度器配置

调度器和他们所有的参数都在app-conf目录下的SchedulerConf.xml文件中配置。
通过下面的示例SchedulerConf.xml文件，了解调度器相应的配置和属性。

123456789101112131415161718192021222324252627282930313233343536373839

<!-- Scheduler configurations --><schedulers>    <scheduler>        <name>azkaban</name>        <classname>com.linkedin.drelephant.schedulers.AzkabanScheduler</classname>    </scheduler>    <scheduler>        <name>airflow</name>        <classname>com.linkedin.drelephant.schedulers.AirflowScheduler</classname>        <params>            <airflowbaseurl>http://localhost:8000</airflowbaseurl>        </params>    </scheduler>    <scheduler>        <name>oozie</name>        <classname>com.linkedin.drelephant.schedulers.OozieScheduler</classname>        <params>            <!-- URL of oozie host -->            <oozie_api_url>http://localhost:11000/oozie</oozie_api_url>            <!-- ### Non mandatory properties ###            ### choose authentication method            <oozie_auth_option>KERBEROS/SIMPLE</oozie_auth_option>            ### override oozie console url with a template (only parameter will be the id)            <oozie_job_url_template></oozie_job_url_template>            <oozie_job_exec_url_template></oozie_job_exec_url_template>            ### (if scheduled jobs are expected make sure to add following templates since oozie doesn't provide their URLS on server v4.1.0)            <oozie_workflow_url_template>http://localhost:11000/oozie/?job=%s</oozie_workflow_url_template>            <oozie_workflow_exec_url_template>http://localhost:11000/oozie/?job=%s</oozie_workflow_exec_url_template>            ### Use true if you can assure all app names are unique.            ### When true dr-elephant will unit all coordinator runs (in case of coordinator killed and then run again)            <oozie_app_name_uniqueness>false</oozie_app_name_uniqueness>            -->        </params>    </scheduler></schedulers>

贡献新的调度器

为了充分利用Dr. Elephant的全部功能，需要提供以下4个ID

作业定义ID：整个作业流程中定义的唯一ID。通过过滤这个ID可以查询所有历史作业作业执行ID：作业执行的唯一ID工作流定义ID：独立于任何执行的对整个流程的唯一ID工作流执行ID：特定流程执行的唯一ID

Dr. Elephant希望通过上述ID能与任何调度器对接。没有这些ID，Dr. Elephant无法为Azkaban提供集成。例如，如果没有提供作业定义Id，那么Dr. Elephant将无法捕获作业的历史数据。同样，如果没有提供Flow定义Id，则无法捕获工作流的历史记录。如果没有上述所有链接，Dr. Elephant只能在执行过程中（Mapreduce作业级别）显示作业的性能数据。

除了上述的4个ID之外，Dr. Elephant还需要一个可选的工作名称和4个可选链接，这些链接将帮助用户轻松的从Dr. Elephant跳转到相应的作业应用程序。
请注意，这不会影响Dr. Elephant的功能。

Flow Definition UrlFlow Execution UrlJob Definition UrlJob Execution Url

打分器

在Dr.Elephant中，通过启发式算法来分析运行完成的任务，会得到一个打分。这个分数的计算方法比较简单，可以通过将待优化等级的值乘以作业(task)数量。

int score = 0;if (severity != Severity.NONE && severity != Severity.LOW) {    score = severity.getValue() * tasks;}return score;

我们定义下列打分类型：

作业得分：所有作业的待优化等级数值之和任务得分：该任务中所有的作业分数之和任务流得分：该任务流中所有的任务分数之和

文章来源：

Author：hyperxu
link：http://www.hyperxu.com/2019/02/18/dr-elephant-4/

下一篇：【Dr.Elephant中文文档-5】用户指南

上一篇：【Dr.Elephant中文文档-3】快速安装说明

【Dr.Elephant中文文档-4】开发者指南

`Dr.Elephant`设置

先决条件

Play/Activator

Hadoop/Spark on Yarn

数据库

测试`Dr.Elephant`

项目结构

启发式算法

添加新的启发式算法

配置启发式算法

调度器

调度器配置

贡献新的调度器

打分器

添加我喜欢的博客

编辑我的关注

更多推荐博客

【Dr.Elephant中文文档-4】开发者指南

Dr.Elephant设置

先决条件

Play/Activator

Hadoop/Spark on Yarn

数据库

测试Dr.Elephant

项目结构

启发式算法

添加新的启发式算法

配置启发式算法

调度器

调度器配置

贡献新的调度器

打分器

添加我喜欢的博客

编辑 我的关注

更多 推荐博客

`Dr.Elephant`设置

测试`Dr.Elephant`

编辑我的关注

更多推荐博客