【Dr.Elephant中文文档-6】度量指标和启发式算法

度量指标

资源用量

资源使用情况是你作业在GB小时内使用的资源量。

计量统计

我们将作业的资源使用量定义为任务容器大小和任务运行时间的乘积。因此,作业的资源使用量可以定义为mapperreducer任务的资源使用量总和。

范例

12345678
Consider a job with: 4 mappers with runtime {12, 15, 20, 30} mins. 4 reducers with runtime {10 , 12, 15, 18} mins. Container size of 4 GB Then, Resource used by all mappers: 4 * (( 12 + 15 + 20 + 30 ) / 60 ) GB Hours = 5.133 GB Hours Resource used by all reducers: 4 * (( 10 + 12 + 15 + 18 ) / 60 ) GB Hours = 3.666 GB Hours Total resource used by the job = 5.133 + 3.6666 = 8.799 GB Hours

浪费的资源量

这显示了作业以GB小时浪费的资源量或以浪费的资源百分比。

计量统计

1234567891011121314151617181920212223
To calculate the resources wasted, we calculate the following: The minimum memory wasted by the tasks (Map and Reduce)The runtime of the tasks (Map and Reduce)The minimum memory wasted by a task is equal to the difference between the container size and maximum task memory(peak memory) among all tasks. The resources wasted by the task is then the minimum memory wasted by the task multiplied by the duration of the task. The total resource wasted by the job then will be equal to the sum of wasted resources of all the tasks.  Let us define the following for each task: peak_memory_used := The upper bound on the memory used by the task. runtime := The run time of the task. The peak_memory_used for any task is calculated by finding out the maximum of physical memory(max_physical_memory) used by all the tasks and the virtual memory(virtual_memory) used by the task. Since peak_memory_used for each task is upper bounded by max_physical_memory, we can say for each task: peak_memory_used = Max(max_physical_memory, virtual_memory/2.1)Where 2.1 is the cluster memory factor. The minimum memory wasted by each task can then be calculated as: wasted_memory = Container_size - peak_memory_used The minimum resource wasted by each task can then be calculated as: wasted_resource = wasted_memory * runtime

运行时间

运行时间指标显示了作业运行的总时间。

计量统计

作业运行时间是作业提交到资源管理器和作业完成时的时间差。

范例

作业的提交时间为1461837302868 ms,结束时间为1461840952182 ms,作业的runtime时间是1461840952182 - 1461837302868 = 3649314 ms,即1.01小时。

等待时间

等待时间是作业处于等待状态消耗的时间

计量统计

123456789101112131415161718192021222324
For each task, let us define the following: ideal_start_time := The ideal time when all the tasks should have started finish_time := The time when the task finished task_runtime := The runtime of the task - Map tasksFor map tasks, we have ideal_start_time := The job submission time We will find the mapper task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last ) The total wait time of the job due to mapper tasks would be: mapper_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max) - Reduce tasksFor reducer tasks, we have ideal_start_time := This is computed by looking at the reducer slow start percentage (mapreduce.job.reduce.slowstart.completedmaps) and finding the finish time of the map task after which first reducer should have startedWe will find the reducer task with the longest runtime ( task_runtime_max) and the task which finished last ( finish_time_last ) The total wait time of the job due to reducer tasks would be: reducer_wait_time = finish_time_last - ( ideal_start_time + task_runtime_max)

启发式算法

Map-Reduce

Mapper数据倾斜

Mapper数据倾斜启发式算法能够显示作业是否发生数据倾斜。启发式算法会将所有Mapper分成两组,第一组的平均值会小于第二组。
例如,第一组有900个Mapper作业,每个Mapper作业平均数据量为7MB,而另一份包含1200个Mapper作业,且每个Mapper作业的平均数据量是500MB。

计算

首先通过递归算法计算两组平均内存消耗,来评估作业的等级。其误差为两组平均内存消耗的差除以这俩组最小的平均内存消耗的差的值。

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
Let us define the following variables,    deviation: the deviation in input bytes between two groups    num_of_tasks: the number of map tasks    file_size: the average input size of the larger group    num_tasks_severity: List of severity thresholds for the number of tasks. e.g., num_tasks_severity = {10, 20, 50, 100}    deviation_severity: List of severity threshold values for the deviation of input bytes between two groups. e.g., deviation_severity: {2, 4, 8, 16}    files_severity: The severity threshold values for the fraction of HDFS block size. e.g. files_severity = { ⅛, ¼, ½, 1}Let us define the following functions,    func avg(x): returns the average of a list x    func len(x): returns the length of a list x    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.We’ll compute two groups recursively based on average memory consumed by them. Let us call the two groups: group_1 and group_2Without loss of generality, let us assume that,    avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then,    deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))    file_size = avg(group_1)    num_of_tasks = len(group_0)The overall severity of the heuristic can be computed as,    severity = min(        getSeverity(deviation, deviation_severity)        , getSeverity(file_size,files_severity)        , getSeverity(num_of_tasks,num_tasks_severity)    )    ---误差(deviation):分成两部分后输入数据量的误差作业数量(num_of_tasks):map作业的数量文件大小(file_size):较大的那部分的平均输入数据量的大小作业数量的严重度(num_tasks_severity):一个List包含了作业数量的严重度阈值,例如num_tasks_severity = {10, 20, 50, 100}误差严重度(deviation severity):一个List包含了两部分Mapper作业输入数据差值的严重度阈值,例如deviation_severity: {2, 4, 8, 16}文件严重度(files_severity):一个List包含了文件大小占HDFS块大小比例的严重度阈值,例如files_severity = { ⅛, ¼, ½, 1}然后定义如下的方法,方法 avg(x):返回List x的平均值方法 len(x):返回List x的长度大小方法 min(x,y):返回x和y中较小的一个方法 getSeverity(x,y):比较x和y中的严重度阈值,返回严重度的值接下来,根据两个部分的平均内存消耗,进行递归计算。假设分成的两部分分别为group_1和group_2为了不失一般性,假设 avg(group_1) > ave(group_2) and len(group_1) < len(group_2)以及deviation = avg(group_1) - avg(group_2) / min(avg(group_1) - avg(group_2))file_size = avg(group_1)num_of_tasks = len(group_0)启发式算法的严重度可以通过下面的方法来计算:severity = min(getSeverity(deviation, deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))

参数配置

阈值参数deviation_severitynum_tasks_severityfiles_severity能够简单的进行配置。如果想进一步了解如何配置这些参数,可以点击这里进行查看。

Mapper GC

Mapper GC会分析任务的GC效率。它会计算出GC时间占所有CPU时间的百分比。

计算

启发式算法对Mapper GC严重度的计算按照如下过程进行。首先,计算出所有作业的平均的CPU使用时间、平均运行时间以及平均垃圾回收消耗的时间。我们要计算Mapper GC严重度的最小值,这个值可以通过平均运行时间和平均垃圾回收时间占平均CPU总消耗时间的比例来计算。

123456789101112131415161718
Let us define the following variables:    avg_gc_time: average time spent garbage collecting    avg_cpu_time: average cpu time of all the tasks    avg_runtime: average runtime of all the tasks    gc_cpu_ratio: avg_gc_time/ avg_cpu_time    gc_ratio_severity: List of severity threshold values for the ratio of  avg_gc_time to avg_cpu_time.    runtime_severity: List of severity threshold values for the avg_runtime.Let us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity of the heuristic can then be computed as,    severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)

参数配置

阈值参数gc_ratio_severityruntime_severity也是可以简单配置的。如果想进一步了解如何配置这些参数,可以参考这里。

Mapper内存消耗

此部分指标用来检查mapper的内存消耗。他会检查任务的消耗内存与容器请求到的内存比例。消耗的内存指任务最大消耗物理内存快照的平均值。容器请求的内存是作业mapreduce.map/reduce.memory.mb的配置值,是作业能请求到的最大物理内存。

计算

123456789101112131415161718
Let us define the following variables,    avg_physical_memory: Average of the physical memories of all tasks.    container_memory: Container memory    container_memory_severity: List of threshold values for the average container memory of the tasks.    memory_ratio_severity: List of threshold values for the ratio of avg_plysical_memory to container_memoryLet us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity can then be computed as,    severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)               , getSeverity(container_memory,container_memory_severity)              )

参数配置

阈值参数container_memory_severitymemory_ratio_severity也是可以简单配置的。如果想进一步了解如何配置这些参数,可以参考这里。

Mapper的运行速度

这部分分析Mapper代码的运行效率。通过这些分析可以知道mapper是否受限于CPU,或者处理的数据量过大。这个分析能够分析mapper运行速度快慢和处理的数据量大小之间的关系。

计算

这个启发式算法的严重度值,是mapper作业的运行速度的严重度和mapper作业的运行时间严重度中较小的一个。

1234567891011121314151617
Let us define the following variables,    median_speed: median of speeds of all the mappers. The speeds of mappers are found by taking the ratio of input bytes to runtime.    median_size: median of size of all the mappers    median_runtime: median of runtime of all the mappers.    disk_speed_severity: List of threshold values for the median_speed.    runtime_severity: List of severity threshold values for median_runtime.Let us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity of the heuristic can then be computed as,    severity = min(getSeverity(median_speed, disk_speed_severity), getSeverity(median_runtime, median_runtime_severity)

参数配置

阈值参数disk_speed_severityruntime_severity可以很简单的配置。如果想进一步的了解这些参数配置,可以点击这里查看。

Mapper溢出

这个启发式算法通过分析磁盘I/O来评判mapper的性能。mapper溢出比例(溢出的记录数/总输出的记录数)是衡量mapper性能的一个重要指标:如果这个值接近2,表示几乎每个记录都溢出了,并临时写到磁盘两次(其中一次发生在内存排序缓存溢出时,另一次发生在合并所有溢出的块时)。当这些发生时表明mapper输入输出的数据量过大了。

计算

123456789101112131415161718
Let us define the following parameters,    total_spills: The sum of spills from all the map tasks.    total_output_records: The sum of output records from all the map tasks.    num_tasks: Total number of tasks.    ratio_spills: total_spills/ total_output_records    spill_severity: List of the threshold values for ratio_spills    num_tasks_severity: List of threshold values for total number of tasks.Let us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity of the heuristic can then be computed as,severity = min(getSeverity(ratio_spills, spill_severity), getSeverity(num_tasks, num_tasks_severity)

参数配置

阈值spill_severitynum_tasks_severity可以简单的进行配置。如果想进一步了解配置参数的详细信息,可以点击这里查看。 here.

Mapper运行时间

这部分分析mapper的数量是否合适。通过分析结果,我们可以更好的优化任务中mapper的数量这个参数设置。有以下两种情况发生时,这个参数就需要优化了:

Mapper的运行时间很短。通常作业在以下情况下出现:mapper数量过多mapper的平均运行时间很短文件太小 大文件或不可分割文件块,通常作业在以下情况下出现:mapper数量太少mapper的平均运行时间太长文件过大 (个别达到 GB 级别)

计算

12345678910111213141516
Let us define the following variables,    avg_size: average size of input data for all the mappers    avg_time: average of runtime of all the tasks.    num_tasks: total number of tasks.    short_runtime_severity: The list of threshold values for tasks with short runtime    long_runtime_severity: The list of threshold values for tasks with long runtime.    num_tasks_severity: The list of threshold values for number of tasks.Let us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity of the heuristic can then be computed as,    short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))    severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)

参数配置

阈值short_runtime_severitylong_runtime_severity以及num_tasks_severity可以很简单的配置。如果想进一步了解参数配置的详细信息,可以点击这里查看。

Reducer数据倾斜

这部分分析每个Reduce中的数据是否存在倾斜情况。这部分分析能够发现Reducer中是否存在这种情况,将Reduce分为两部分,其中一部分的输入数据量是否明显大于另一部分的输入数据量。

计算

首先通过递归算法计算均值并基于每个组消耗的平均内存消耗将任务划分为两组来评估该算法的等级。误差表示为两个部分Reducer的平均内存消耗之差除以两个部分最小内存消耗之差得到的比例。

123456789101112131415161718192021222324252627
Let us define the following variables:  deviation: deviation in input bytes between two groups  num_of_tasks: number of reduce tasks  file_size: average of larger group  num_tasks_severity: List of severity threshold values for the number of tasks.  e.g. num_tasks_severity = {10,20,50,100}  deviation_severity: List of severity threshold values for the deviation of input bytes between two groups.  e.g. deviation_severity = {2,4,8,16}  files_severity: The severity threshold values for the fraction of HDFS block size  e.g. files_severity = { ⅛, ¼, ½, 1}Let us define the following functions:  func avg(x): returns the average of a list x  func len(x): returns the length of a list x  func min(x,y): returns minimum of x and y  func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.We’ll compute two groups recursively based on average memory consumed by them. Let us call the two groups: group_1 and group_2Without loss of generality, let us assume that:  avg(group_1) > avg(group_2) and len(group_1)< len(group_2) then,  deviation = avg(group_1) - avg(group_2) / min(avg(group_1)) - avg(group_2))  file_size = avg(group_1)  num_of_tasks = len(group_0)The overall severity of the heuristic can be computed as,   severity = min(getSeverity(deviation,deviation_severity),getSeverity(file_size,files_severity),getSeverity(num_of_tasks,num_tasks_severity))

参数配置

阈值deviation_severitynum_tasks_severityfiles_severity,可以很简单的进行配置。如果想进一步了解这些参数的配置,可以点击这里查看。

Reducer GC

这部分分析任务的GC效率,能够计算并告诉我们GC时间占所用CPU时间的比例。

计算

首先,会计算出所有任务的平均CPU消耗时间、平均运行时间以及平均垃圾回收所消耗的时间。然后,算法会根据平均运行时间以及垃圾回收时间占平均CPU时间的比值来计算出最低的严重等级。

1234567891011121314151617
Let us define the following variables:    avg_gc_time: average time spent garbage collecting    avg_cpu_time: average cpu time of all the tasks    avg_runtime: average runtime of all the tasks    gc_cpu_ratio: avg_gc_time/ avg_cpu_time    gc_ratio_severity: List of severity threshold values for the ratio of  avg_gc_time to avg_cpu_time.    runtime_severity: List of severity threshold values for the avg_runtime.Let us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity of the heuristic can then be computed as,    severity = min(getSeverity(avg_runtime, runtime_severity), getSeverity(gc_cpu_ratio, gc_ratio_severity)

参数配置

阈值gc_ratio_severityruntime_severity可以很简单的配置,如果想进一步了解参数配置的详细过程,可以点击这里查看。

Reducer内存消耗

这部分分析显示了任务的内存利用率。算法会比较作业消耗的内存以及容器要求的内存分配。消耗的内存是指每个作业消耗的最大内存的平均值。容器需求的内存是指任务配置的mapreduce.map/reduce.memory.mb,也就是任务能够使用最大物理内存。

计算

123456789101112131415161718
Let us define the following variables,    avg_physical_memory: Average of the physical memories of all tasks.    container_memory: Container memory    container_memory_severity: List of threshold values for the average container memory of the tasks.    memory_ratio_severity: List of threshold values for the ratio of avg_physical_memory to container_memoryLet us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity can then be computed as,    severity = min(getSeverity(avg_physical_memory/container_memory, memory_ratio_severity)               , getSeverity(container_memory,container_memory_severity)              )

参数配置

阈值container_memory_severitymemory_ratio_severity可以简单的进行配置。如果想进一步了解配置参数的详细信息,可以点击这里查看。

Reducer运行时间

这部分分析Reducer的运行效率,可以帮助我们更好的配置任务中reducer的数量。当出现以下两种情况时,说明Reducer的数量需要进行调优:

Reducer过多,hadoop任务可能的表现是:Reducer数量过多Reducer的运行时间很短Reducer过少,hadoop任务可能的表现是:Reducer数量过少Reducer运行时间很长

计算

12345678910111213141516171819
Let us define the following variables,    avg_size: average size of input data for all the mappers    avg_time: average of runtime of all the tasks.    num_tasks: total number of tasks.    short_runtime_severity: The list of threshold values for tasks with short runtime    long_runtime_severity: The list of threshold values for tasks with long runtime.    num_tasks_severity: The number of tasks.Let us define the following functions,    func min(x,y): returns minimum of x and y    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity of the heuristic can then be computed as,    short_task_severity = min(getSeverity(avg_time,short_runtime_severity), getSeverity(num_tasks, num_tasks_severity))    severity = max(getSeverity(avg_size, long_runtime_severity), short_task_severity)

参数配置

阈值参数short_runtime_severitylong_runtime_severity以及num_tasks_severity可以很简单的配置,如果想进一步了解参数配置的详细过程,可以点击这里查看。

清洗&排序

这部分分析reducer消耗的总时间以及reducer在进行清洗和排序时消耗的时间,通过这些分析,可以评估reducer的执行效率。

计算

123456789101112131415161718
Let’s define following variables,    avg_exec_time: average time spent in execution by all the tasks.    avg_shuffle_time: average time spent in shuffling.    avg_sort_time: average time spent in sorting.    runtime_ratio_severity: List of threshold values for the ratio of twice of average shuffle or sort time to average execution time.    runtime_severity: List of threshold values for the runtime for shuffle or sort stages. The overall severity can then be found as,severity = max(shuffle_severity, sort_severity)where shuffle_severity and sort_severity can be found as: shuffle_severity = min(getSeverity(avg_shuffle_time, runtime_severity), getSeverity(avg_shuffle_time*2/avg_exec_time, runtime_ratio_severity))sort_severity = min(getSeverity(avg_sort_time, runtime_severity), getSeverity(avg_sort_time*2/avg_exec_time, runtime_ratio_severity))

参数配置

阈值参数avg_exec_timeavg_shuffle_timeavg_sort_time可以很简单的进行配置。更多关于参数配置的相信信息可以点击这里查看。

Spark

Spark的事件日志限制

Spark事件日志处理器当前无法处理很大的日志文件。Dr-Elephant需要花很长的时间去处理一个很大的Spark时间日志文件,期间很可能会影响Dr-Elephant本身的稳定运行。因此,目前我们设置了一个日志大小限制(100MB),如果超过这个大小,会新起一个进程去处理这个日志。

计算

如果数据被限流了,那么启发式算法将评估为最严重等级CRITICAL,否则,就没有评估等级。

Spark负载均衡处理器

Map/Reduce任务的执行机制不同,Spark应用在启动后会一次性分配它所需要的所有资源,直到整个任务结束才会释放这些资源。根据这个机制,对Spark的处理器的负载均衡就显得非常重要,可以避免集群中个别节点压力过大。

计算

1234567891011121314151617181920212223242526272829
Let us define the following variables:        peak_memory: List of peak memories for all executors    durations: List of durations of all executors    inputBytes: List of input bytes of all executors    outputBytes: List of output bytes of all executors.    looser_metric_deviation_severity: List of threshold values for deviation severity, loose bounds.    metric_deviation_severity: List of threshold values for deviation severity, tight bounds. Let us define the following functions:    func getDeviation(x): returns max(|maximum-avg|, |minimum-avg|)/avg, where        x = list of values        maximum = maximum of values in x        minimum = minimum of values in x        avg = average of values in x    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.    func max(x,y): returns the maximum value of x and y.    func Min(l): returns the minimum of a list l.The overall severity can be found as,    severity = Min( getSeverity(getDeviation(peak_memory), looser_metric_deviation_severity),                getSeverity(getDeviation(durations),  metric_deviation_severity),               getSeverity(getDeviation(inputBytes), metric_deviation_severity),               getSeverity(getDeviation(outputBytes), looser_metric_deviation_severity).                )

参数配置

阈值参数looser_metric_deviation_severitymetric_deviation_severity可以简单的进行配置。如果想进一步了解参数配置的详细过程,可以点击这里查看。

Spark任务运行时间

这部分启发式算法对Spark任务的运行时间进行调优分析。每个Spark应用程序可以拆分成多个任务,每个任务又可以拆分成多个运行阶段。

计算

1234567891011121314151617
Let us define the following variables,    avg_job_failure_rate: Average job failure rate    avg_job_failure_rate_severity: List of threshold values for average job failure rateLet us define the following variables for each job,    single_job_failure_rate: Failure rate of a single job    single_job_failure_rate_severity: List of threshold values for single job failure rate.The severity of the job can be found as maximum of single_job_failure_rate_severity for all jobs and avg_job_failure_rate_severity.i.e. severity = max(getSeverity(single_job_failure_rate, single_job_failure_rate_severity),                    getSeverity(avg_job_failure_rate, avg_job_failure_rate_severity)                )where single_job_failure_rate is computed for all the jobs.

参数配置

阈值参数single_job_failure_rate_severityavg_job_failure_rate_severity可以很简单的进行配置。更多详细信息,可以点击这里查看。

Spark内存限制

目前,Spark应用程序缺少动态资源分配的功能。与Map/Reduce任务不同,能够为每个map/reduce进程分配所需要的资源,并且在执行过程中逐步释放占用的资源。而Spark在应用程序执行时,会一次性的申请所需要的所有资源,直到任务结束才释放这些资源。过多的内存使用会对集群节点的稳定性产生影响。所以,我们需要限制Spark应用程序能使用的最大内存比例。

计算

1234567891011121314151617181920
Let us define the following variables,    total_executor_memory: total memory of all the executors    total_storage_memory: total memory allocated for storage by all the executors    total_driver_memory: total driver memory allocated    peak_memory: total memory used at peak    mem_utilization_severity: The list of threshold values for the memory utilization.    total_memory_severity_in_tb: The list of threshold values for total memory.Let us define the following functions,    func max(x,y): Returns maximum of x and y.    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity can then be computed as,    severity = max(getSeverity(total_executor_memory,total_memory_severity_in_tb),                   getSeverity(peak_memory/total_storage_memory, mem_utilization_severity)               )

参数配置

阈值参数total_memory_severity_in_tbmem_utilization_severity可以很简单的配置。进一步了解,可以点击这里查看。

Spark阶段运行时间

Spark任务运行时间一样,Spark应用程序可以分为多个任务,每个任务又可以分为多个运行阶段。

计算

12345678910111213141516171819202122232425262728
Let us define the following variable for each spark job,    stage_failure_rate: The stage failure rate of the job    stagge_failure_rate_severity: The list of threshold values for stage failure rate.Let us define the following variables for each stage of a spark job,    task_failure_rate: The task failure rate of the stage    runtime: The runtime of a single stage    single_stage_tasks_failure_rate_severity: The list of threshold values for task failure of a stage    stage_runtime_severity_in_min: The list of threshold values for stage runtime.Let us define the following functions,    func max(x,y): returns the maximum value of x and y.    func getSeverity(x,y): Compares value x with severity threshold values in y and returns the severity.The overall severity can be found as:    severity_stage = max(getSeverity(task_failure_rate, single_stage_tasks_faioure_rate_severity),                   getSeverity(runtime, stage_runtime_severity_in_min)               )    severity_job = getSeverity(stage_failure_rate,stage_failure_rate_severity)    severity = max(severity_stage, severity_job)where task_failure_rate is computed for all the tasks.

参数配置

阈值参数single_stage_tasks_failure_rate_severitystage_runtime_severity_in_minstage_failure_rate_severity可以很简单的配置。进一步了解,请点击这里。

本章篇幅较长,一些专有名词及参数功能,可以在Dr-ElephantDashboard中查。

文章来源:

Author:hyperxu
link:http://www.hyperxu.com/2019/06/13/dr-elephant-6/