探索aarch64架构上使用ftrace的BPF LSM

译者注

笔者在MacBook M2上搭建Linux虚拟机上开发eBPF程序时,遇到一些LSM eBPF类型程序无法运行的问题。 在笔者尝试定位这些差异时,看到这篇文章,可以让大家更直观地了解LSM eBPF在ARM64、AMD64 内核上的差异。
原文地址:Exploring BPF LSM support on aarch64 with ftrace

原文

本博客文章是我们在Linux中对于aarch64BPF LSM支持的内部研究的摘要。如果你对内核代码库不熟悉,要开始查看内核源码是非常困难的,因此我们决定发布这篇文章,展示我们的方法,因为这对于想要探索内核内部的任何人都可能有所帮助。

简介

x86_64上,我们已经在使用BPF LSM,而在aarch64上,我们依赖于Kprobes,因此我们想知道内核中缺少了哪些功能,才能让这些功能在aarch64上可用。

我们曾多次深入研究内核源代码,但通常我们搜索的是已经存在的东西,以了解其工作原理。但在这种情况下,我们在寻找的是不存在的东西,我们追寻的是那些因为未实现而返回错误的内容。

回想起Steven Rostedt关于如何开始学习Linux内核的讲话,我们从ftrace(以及构建在跟踪基础设施上的工具)开始,以了解当我们将一个不受支持的BPF程序加载到内核时会发生什么。

问题

这是当我们尝试将一个BPF LSM程序加载到aarch64 5.15 Linux内核时,使用我们的软件pulsar时的输出:

    root@pine64-1:/home/exein# ./pulsar-enterprise-exec pulsard
    [2023-02-16T14:52:45Z INFO  pulsar::pulsard::daemon] Starting module process-monitor
    [2023-02-16T14:52:45Z INFO  pulsar::pulsard::daemon] Starting module file-system-monitor
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module network-monitor
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module logger
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module rules-engine
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module desktop-notifier
    [2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in file-system-monitor: failed program attach lsm path_mknod

        Caused by:
            0: `bpf_raw_tracepoint_open` failed
            1: No error information (os error 524)
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module anomaly-detection
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module malware-detection
    [2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in malware-detection: /var/lib/pulsar/malware_detection/models/parameters.json not found
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module platform-connector
    [2023-02-16T14:52:46Z INFO  platform_connector::client] Connected to https://platform-dev-instance.exein.io:8001/
    [2023-02-16T14:52:46Z INFO  pulsar::pulsard::daemon] Starting module threat-response
    [2023-02-16T14:52:46Z ERROR pulsar::pulsard::module_manager] Module error in network-monitor: failed program attach lsm socket_bind

        Caused by:
            0: `bpf_raw_tracepoint_open` failed
            1: No error information (os error 524)

我们在尝试加载与path_mknodLSM挂钩相关的BPF程序时,pulsar出现了错误524ENOTSUPP。让我们尝试深入研究这个问题。

注意: 在进行这项研究时,我们当时无法找到预先编译为启用BPFBTFaarch64,因此我们不得不编译一个自定义内核。我们还启用了跟踪选项和function_graph插件,以使用下面的工具。
所有的实验都是在一台装有定制Armbian镜像的Pine A64上进行的。
这些镜像具有带有标准Ubuntu 22.04 LTS Jammy用户空间的自定义内核。

工具

为了调查这个问题,我们使用了以下工具:

bpftrace:基于BPF的工具,使用自定义类C语言动态附加探针。 trace-cmd:围绕tracefs文件系统的包装器,与ftrace基础设施交互。

要使用这些工具,您需要在Linux内核中启用一些选项,请查阅官方文档获取完整的要求。

注意: 也可以使用其他工具来完成相同的工作,例如perf-tools中的funcgraphkprobe

Linux 5.15

现在我们开始使用这些工具来查看在内核5.15中尝试加载我们的BPF程序时会发生什么。

从这一点开始到本文末尾,我们将使用probe二进制文件代替pulsar,因为它更简单。为了简要概括其工作原理,以下是命令行帮助:

    exein@pine64-1:~$ ./probe 
    Test runner for eBPF programs

    Usage: probe [OPTIONS] <COMMAND>

    Commands:
      file-system-monitor  Watch file creations
      process-monitor      Watch process events (fork/exec/exit)
      network-monitor      Watch network events
      help                 Print this message or the help of the given subcommand(s)

    Options:
      -v, --verbose  
      -h, --help     Print help
      -V, --version  Print version

在这些示例中,我们将尝试加载file-system-monitor探针。

通过运行以下命令,我们可以看到__sys_bpf函数的函数图调用,这是BPF系统调用的入口点:

    trace-cmd record -p function_graph -g __sys_bpf ./probe file-system-monitor
    trace-cmd report

输出是一个非常庞大的函数图,太大了,无法在这里粘贴。由于我们遇到了错误,我们对程序停止前的最后几个函数感兴趣。以下是trace-cmd report输出的最后几行:

    ...
     tokio-runtime-w-1666  [003]  1318.058019: funcgraph_entry:                   |        bpf_trampoline_link_prog() {
     tokio-runtime-w-1666  [003]  1318.058020: funcgraph_entry:        2.292 us   |          bpf_attach_type_to_tramp();
     tokio-runtime-w-1666  [003]  1318.058024: funcgraph_entry:        1.250 us   |          mutex_lock();
     tokio-runtime-w-1666  [003]  1318.058028: funcgraph_entry:                   |          bpf_trampoline_update() {
     tokio-runtime-w-1666  [003]  1318.058030: funcgraph_entry:                   |            kmem_cache_alloc_trace() {
     tokio-runtime-w-1666  [003]  1318.058031: funcgraph_entry:        1.167 us   |              should_failslab();
     tokio-runtime-w-1666  [003]  1318.058036: funcgraph_exit:         6.792 us   |            }
     tokio-runtime-w-1666  [003]  1318.058039: funcgraph_entry:                   |            kmem_cache_alloc_trace() {
     tokio-runtime-w-1666  [003]  1318.058042: funcgraph_entry:        2.750 us   |              should_failslab();
     tokio-runtime-w-1666  [003]  1318.058046: funcgraph_exit:         6.417 us   |            }
     tokio-runtime-w-1666  [003]  1318.058048: funcgraph_entry:        2.708 us   |            bpf_jit_charge_modmem();
     tokio-runtime-w-1666  [003]  1318.058053: funcgraph_entry:                   |            bpf_jit_alloc_exec_page() {
     tokio-runtime-w-1666  [003]  1318.058055: funcgraph_entry:                   |              bpf_jit_alloc_exec() {
     tokio-runtime-w-1666  [003]  1318.058057: funcgraph_entry:                   |                vmalloc() {
     tokio-runtime-w-1666  [003]  1318.058059: funcgraph_entry:                   |                  __vmalloc_node() {
     tokio-runtime-w-1666  [003]  1318.058061: funcgraph_entry:                   |                    __vmalloc_node_range() {
     tokio-runtime-w-1666  [003]  1318.058064: funcgraph_entry:                   |                      __get_vm_area_node.constprop.64() {
     tokio-runtime-w-1666  [003]  1318.058067: funcgraph_entry:                   |                        kmem_cache_alloc_node_trace() {
     tokio-runtime-w-1666  [003]  1318.058069: funcgraph_entry:        1.459 us   |                          should_failslab();
     tokio-runtime-w-1666  [003]  1318.058073: funcgraph_exit:         6.292 us   |                        }
     tokio-runtime-w-1666  [003]  1318.058075: funcgraph_entry:                   |                        alloc_vmap_area() {
     tokio-runtime-w-1666  [003]  1318.058077: funcgraph_entry:                   |                          kmem_cache_alloc_node() {
     tokio-runtime-w-1666  [003]  1318.058079: funcgraph_entry:        1.167 us   |                            should_failslab();
     tokio-runtime-w-1666  [003]  1318.058085: funcgraph_exit:         7.625 us   |                          }
     tokio-runtime-w-1666  [003]  1318.058088: funcgraph_entry:                   |                          kmem_cache_alloc_node() {
     tokio-runtime-w-1666  [003]  1318.058089: funcgraph_entry:        1.208 us   |                            should_failslab();
     tokio-runtime-w-1666  [003]  1318.058092: funcgraph_exit:         4.584 us   |                          }
     tokio-runtime-w-1666  [003]  1318.058104: funcgraph_entry:                   |                          kmem_cache_free() {
     tokio-runtime-w-1666  [003]  1318.058107: funcgraph_entry:        2.084 us   |                            __slab_free();
     tokio-runtime-w-1666  [003]  1318.058110: funcgraph_exit:         5.667 us   |                          }
     tokio-runtime-w-1666  [003]  1318.058112: funcgraph_entry:        6.375 us   |                          insert_vmap_area.constprop.74();
     tokio-runtime-w-1666  [003]  1318.058119: funcgraph_exit:       + 44.667 us  |                        }
     tokio-runtime-w-1666  [003]  1318.058122: funcgraph_exit:       + 58.250 us  |                      }
     tokio-runtime-w-1666  [003]  1318.058124: funcgraph_entry:                   |                      __kmalloc_node() {
     tokio-runtime-w-1666  [003]  1318.058125: funcgraph_entry:        1.625 us   |                        kmalloc_slab();
     tokio-runtime-w-1666  [003]  1318.058128: funcgraph_entry:        1.167 us   |                        should_failslab();
     tokio-runtime-w-1666  [003]  1318.058131: funcgraph_exit:         7.208 us   |                      }
     tokio-runtime-w-1666  [003]  1318.058133: funcgraph_entry:                   |                      alloc_pages() {
     tokio-runtime-w-1666  [003]  1318.058135: funcgraph_entry:        1.583 us   |                        get_task_policy.part.48();
     tokio-runtime-w-1666  [003]  1318.058138: funcgraph_entry:        1.500 us   |                        policy_node();
     tokio-runtime-w-1666  [003]  1318.058141: funcgraph_entry:        1.209 us   |                        policy_nodemask();
     tokio-runtime-w-1666  [003]  1318.058143: funcgraph_entry:                   |                        __alloc_pages() {
     tokio-runtime-w-1666  [003]  1318.058145: funcgraph_entry:        1.458 us   |                          should_fail_alloc_page();
     tokio-runtime-w-1666  [003]  1318.058147: funcgraph_entry:                   |                          get_page_from_freelist() {
     tokio-runtime-w-1666  [003]  1318.058150: funcgraph_entry:        1.583 us   |                            prep_new_page();
     tokio-runtime-w-1666  [003]  1318.058153: funcgraph_exit:         5.459 us   |                          }
     tokio-runtime-w-1666  [003]  1318.058154: funcgraph_exit:       + 10.542 us  |                        }
     tokio-runtime-w-1666  [003]  1318.058155: funcgraph_exit:       + 22.083 us  |                      }
     tokio-runtime-w-1666  [003]  1318.058157: funcgraph_entry:                   |                      __cond_resched() {
     tokio-runtime-w-1666  [003]  1318.058158: funcgraph_entry:        1.833 us   |                        rcu_all_qs();
     tokio-runtime-w-1666  [003]  1318.058161: funcgraph_exit:         4.167 us   |                      }
     tokio-runtime-w-1666  [003]  1318.058166: funcgraph_entry:        5.542 us   |                      vmap_pages_range_noflush();
     tokio-runtime-w-1666  [003]  1318.058173: funcgraph_exit:       ! 112.375 us |                    }
     tokio-runtime-w-1666  [003]  1318.058175: funcgraph_exit:       ! 116.000 us |                  }
     tokio-runtime-w-1666  [003]  1318.058176: funcgraph_exit:       ! 119.292 us |                }
     tokio-runtime-w-1666  [003]  1318.058177: funcgraph_exit:       ! 122.542 us |              }
     tokio-runtime-w-1666  [003]  1318.058179: funcgraph_entry:                   |              find_vm_area() {
     tokio-runtime-w-1666  [003]  1318.058180: funcgraph_entry:        1.375 us   |                find_vmap_area();
     tokio-runtime-w-1666  [003]  1318.058183: funcgraph_exit:         4.333 us   |              }
     tokio-runtime-w-1666  [003]  1318.058185: funcgraph_entry:                   |              set_memory_x() {
     tokio-runtime-w-1666  [003]  1318.058186: funcgraph_entry:                   |                change_memory_common() {
     tokio-runtime-w-1666  [003]  1318.058188: funcgraph_entry:                   |                  find_vm_area() {
     tokio-runtime-w-1666  [003]  1318.058189: funcgraph_entry:        1.333 us   |                    find_vmap_area();
     tokio-runtime-w-1666  [003]  1318.058192: funcgraph_exit:         3.875 us   |                  }
     tokio-runtime-w-1666  [003]  1318.058193: funcgraph_entry:                   |                  vm_unmap_aliases() {
     tokio-runtime-w-1666  [003]  1318.058194: funcgraph_entry:                   |                    _vm_unmap_aliases.part.58() {
     tokio-runtime-w-1666  [003]  1318.058196: funcgraph_entry:        1.542 us   |                      rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058199: funcgraph_entry:        1.208 us   |                      rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058202: funcgraph_entry:        1.166 us   |                      rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058205: funcgraph_entry:        1.208 us   |                      rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058207: funcgraph_entry:        1.208 us   |                      mutex_lock();
     tokio-runtime-w-1666  [003]  1318.058210: funcgraph_entry:                   |                      purge_fragmented_blocks_allcpus() {
     tokio-runtime-w-1666  [003]  1318.058212: funcgraph_entry:        1.500 us   |                        rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058214: funcgraph_entry:        1.500 us   |                        rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058217: funcgraph_entry:        1.500 us   |                        rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058220: funcgraph_entry:        1.167 us   |                        rcu_read_unlock_strict();
     tokio-runtime-w-1666  [003]  1318.058222: funcgraph_exit:       + 11.917 us  |                      }
     tokio-runtime-w-1666  [003]  1318.058224: funcgraph_entry:                   |                      __purge_vmap_area_lazy() {
     tokio-runtime-w-1666  [003]  1318.058232: funcgraph_entry:                   |                        kmem_cache_free() {
     tokio-runtime-w-1666  [003]  1318.058234: funcgraph_entry:        1.250 us   |                          __slab_free();
     tokio-runtime-w-1666  [003]  1318.058237: funcgraph_exit:         4.791 us   |                        }
     tokio-runtime-w-1666  [003]  1318.058241: funcgraph_entry:        1.209 us   |                        __cond_resched_lock();
     tokio-runtime-w-1666  [003]  1318.058244: funcgraph_exit:       + 19.625 us  |                      }
     tokio-runtime-w-1666  [003]  1318.058245: funcgraph_entry:        1.167 us   |                      mutex_unlock();
     tokio-runtime-w-1666  [003]  1318.058247: funcgraph_exit:       + 53.042 us  |                    }
     tokio-runtime-w-1666  [003]  1318.058248: funcgraph_exit:       + 55.625 us  |                  }
     tokio-runtime-w-1666  [003]  1318.058250: funcgraph_entry:                   |                  __change_memory_common() {
     tokio-runtime-w-1666  [003]  1318.058251: funcgraph_entry:                   |                    apply_to_page_range() {
     tokio-runtime-w-1666  [003]  1318.058253: funcgraph_entry:                   |                      __apply_to_page_range() {
     tokio-runtime-w-1666  [003]  1318.058255: funcgraph_entry:        1.250 us   |                        pud_huge();
     tokio-runtime-w-1666  [003]  1318.058258: funcgraph_entry:        1.166 us   |                        pmd_huge();
     tokio-runtime-w-1666  [003]  1318.058260: funcgraph_entry:        1.208 us   |                        change_page_range();
     tokio-runtime-w-1666  [003]  1318.058263: funcgraph_exit:         9.834 us   |                      }
     tokio-runtime-w-1666  [003]  1318.058264: funcgraph_exit:       + 12.709 us  |                    }
     tokio-runtime-w-1666  [003]  1318.058266: funcgraph_exit:       + 15.459 us  |                  }
     tokio-runtime-w-1666  [003]  1318.058268: funcgraph_exit:       + 80.791 us  |                }
     tokio-runtime-w-1666  [003]  1318.058270: funcgraph_exit:       + 84.834 us  |              }
     tokio-runtime-w-1666  [003]  1318.058272: funcgraph_exit:       ! 218.500 us |            }
     tokio-runtime-w-1666  [003]  1318.058274: funcgraph_entry:                   |            __alloc_percpu_gfp() {
     tokio-runtime-w-1666  [003]  1318.058276: funcgraph_entry:                   |              pcpu_alloc() {
     tokio-runtime-w-1666  [003]  1318.058281: funcgraph_entry:        2.250 us   |                mutex_lock_killable();
     tokio-runtime-w-1666  [003]  1318.058290: funcgraph_entry:                   |                pcpu_find_block_fit() {
     tokio-runtime-w-1666  [003]  1318.058293: funcgraph_entry:        2.833 us   |                  pcpu_next_fit_region.constprop.38();
     tokio-runtime-w-1666  [003]  1318.058299: funcgraph_exit:         9.084 us   |                }
     tokio-runtime-w-1666  [003]  1318.058301: funcgraph_entry:                   |                pcpu_alloc_area() {
     tokio-runtime-w-1666  [003]  1318.058315: funcgraph_entry:        4.000 us   |                  pcpu_block_update_hint_alloc();
     tokio-runtime-w-1666  [003]  1318.058320: funcgraph_entry:        2.208 us   |                  pcpu_chunk_relocate();
     tokio-runtime-w-1666  [003]  1318.058324: funcgraph_exit:       + 22.625 us  |                }
     tokio-runtime-w-1666  [003]  1318.058327: funcgraph_entry:        1.208 us   |                mutex_unlock();
     tokio-runtime-w-1666  [003]  1318.058332: funcgraph_entry:        1.584 us   |                pcpu_memcg_post_alloc_hook();
     tokio-runtime-w-1666  [003]  1318.058335: funcgraph_exit:       + 58.833 us  |              }
     tokio-runtime-w-1666  [003]  1318.058336: funcgraph_exit:       + 61.834 us  |            }
     tokio-runtime-w-1666  [003]  1318.058338: funcgraph_entry:                   |            kmem_cache_alloc_trace() {
     tokio-runtime-w-1666  [003]  1318.058339: funcgraph_entry:        1.167 us   |              should_failslab();
     tokio-runtime-w-1666  [003]  1318.058342: funcgraph_exit:         4.458 us   |            }
     tokio-runtime-w-1666  [003]  1318.058359: funcgraph_entry:                   |            bpf_image_ksym_add() {
     tokio-runtime-w-1666  [003]  1318.058360: funcgraph_entry:                   |              bpf_ksym_add() {
     tokio-runtime-w-1666  [003]  1318.058363: funcgraph_entry:        1.583 us   |                __local_bh_enable_ip();
     tokio-runtime-w-1666  [003]  1318.058366: funcgraph_exit:         5.750 us   |              }
     tokio-runtime-w-1666  [003]  1318.058369: funcgraph_exit:         9.834 us   |            }
     tokio-runtime-w-1666  [003]  1318.058371: funcgraph_entry:        1.250 us   |            arch_prepare_bpf_trampoline();
     tokio-runtime-w-1666  [003]  1318.058373: funcgraph_entry:        2.292 us   |            kfree();
     tokio-runtime-w-1666  [003]  1318.058377: funcgraph_exit:       ! 348.625 us |          }
     tokio-runtime-w-1666  [003]  1318.058379: funcgraph_entry:        1.250 us   |          mutex_unlock();
     tokio-runtime-w-1666  [003]  1318.058382: funcgraph_exit:       ! 363.167 us |        }
     tokio-runtime-w-1666  [003]  1318.058384: funcgraph_entry:                   |        bpf_link_cleanup() {
     tokio-runtime-w-1666  [003]  1318.058386: funcgraph_entry:                   |          bpf_link_free_id.part.30() {
     tokio-runtime-w-1666  [003]  1318.058392: funcgraph_entry:                   |            call_rcu() {
     tokio-runtime-w-1666  [003]  1318.058396: funcgraph_entry:        1.834 us   |              rcu_segcblist_enqueue();
     tokio-runtime-w-1666  [003]  1318.058401: funcgraph_exit:         9.333 us   |            }
     tokio-runtime-w-1666  [003]  1318.058403: funcgraph_entry:        1.542 us   |            __local_bh_enable_ip();
     tokio-runtime-w-1666  [003]  1318.058406: funcgraph_exit:       + 19.542 us  |          }
     tokio-runtime-w-1666  [003]  1318.058408: funcgraph_entry:                   |          fput() {
     tokio-runtime-w-1666  [003]  1318.058409: funcgraph_entry:                   |            fput_many() {
     tokio-runtime-w-1666  [003]  1318.058411: funcgraph_entry:                   |              task_work_add() {
     tokio-runtime-w-1666  [003]  1318.058414: funcgraph_entry:        1.625 us   |                kick_process();
     tokio-runtime-w-1666  [003]  1318.058418: funcgraph_exit:         6.750 us   |              }
     tokio-runtime-w-1666  [003]  1318.058419: funcgraph_exit:       + 10.333 us  |            }
     tokio-runtime-w-1666  [003]  1318.058420: funcgraph_exit:       + 12.708 us  |          }
     tokio-runtime-w-1666  [003]  1318.058422: funcgraph_entry:        2.250 us   |          put_unused_fd();
     tokio-runtime-w-1666  [003]  1318.058426: funcgraph_exit:       + 41.416 us  |        }
     tokio-runtime-w-1666  [003]  1318.058428: funcgraph_entry:        1.292 us   |        mutex_unlock();
     tokio-runtime-w-1666  [003]  1318.058430: funcgraph_entry:        1.250 us   |        kfree();
     tokio-runtime-w-1666  [003]  1318.058433: funcgraph_exit:       ! 567.458 us |      }
     tokio-runtime-w-1666  [003]  1318.058435: funcgraph_entry:        2.125 us   |      __bpf_prog_put.isra.47();
     tokio-runtime-w-1666  [003]  1318.058438: funcgraph_exit:       ! 602.291 us |    }
     tokio-runtime-w-1666  [003]  1318.058439: funcgraph_exit:       ! 631.791 us |  }
```shell
这是<code>kernel/bpf/trampoline.c</code>中与最后执行的函数<code>bpf_trampoline_update</code>对应的源代码:
```c
    static int bpf_trampoline_update(struct bpf_trampoline *tr)
    {
        struct bpf_tramp_image *im;
        struct bpf_tramp_progs *tprogs;
        u32 flags = BPF_TRAMP_F_RESTORE_REGS;
        bool ip_arg = false;
        int err, total;

        tprogs = bpf_trampoline_get_progs(tr, &total, &ip_arg);
        if (IS_ERR(tprogs))
            return PTR_ERR(tprogs);

        if (total == 0) {
            err = unregister_fentry(tr, tr->cur_image->image);
            bpf_tramp_image_put(tr->cur_image);
            tr->cur_image = NULL;
            tr->selector = 0;
            goto out;
        }

        im = bpf_tramp_image_alloc(tr->key, tr->selector);
        if (IS_ERR(im)) {
            err = PTR_ERR(im);
            goto out;
        }

        if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
            tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
            flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;

        if (ip_arg)
            flags |= BPF_TRAMP_F_IP_ARG;

        err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
                          &tr->func.model, flags, tprogs,
                          tr->func.addr);
        if (err < 0)
            goto out;

        WARN_ON(tr->cur_image && tr->selector == 0);
        WARN_ON(!tr->cur_image && tr->selector);
        if (tr->cur_image)
            /* progs already running at this address */
            err = modify_fentry(tr, tr->cur_image->image, im->image);
        else
            /* first time registering */
            err = register_fentry(tr, im->image);
        if (err)
            goto out;
        if (tr->cur_image)
            bpf_tramp_image_put(tr->cur_image);
        tr->cur_image = im;
        tr->selector++;
    out:
        kfree(tprogs);
        return err;
    }

根据先前的输出,我们可以看到:

     tokio-runtime-w-1666  [003]  1318.058371: funcgraph_entry:        1.250 us   |            arch_prepare_bpf_trampoline();
     tokio-runtime-w-1666  [003]  1318.058373: funcgraph_entry:        2.292 us   |            kfree();

arch_prepare_bpf_trampolinekfree函数之间没有其他函数调用,所以很可能第一个函数在err变量中返回了错误代码。让我们来验证一下!

通过以下方式在shell中启动bpftace,我们可以捕获arch_prepare_bpf_trampoline函数的返回值并将其打印到控制台上:

    bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'

并且在另一个终端中启动probe后,我们从bpftace得到了以下输出:

    root@pine64-1:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval link: %d\n", retval); }'
    Attaching 1 probe...
    retval link: -524

这是因为内核5.15缺乏对aarch64架构的arch_prepare_bpf_trampoline实现,并使用了默认的占位符实现。

    int __weak
    arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
                    const struct btf_func_model *m, u32 flags,
                    struct bpf_tramp_links *tlinks,
                    void *orig_call)
    {
        return -ENOTSUPP;
    }

因此,这个功能在这个内核版本上是不受支持的。好消息是,多亏了这个补丁,它在6.x内核中得到了实现。

让我们移步到6.x内核。

Linux 6.1

如果我们尝试在内核 6.1 上运行 probe,我们会得到以下输出:

    root@pine64:/home/exein# ./probe file-system-monitor
    thread 'main' panicked at 'initialization failed: ProgramAttachError { program: "lsm path_mknod", program_error: SyscallError { call: "bpf_raw_tracepoint_open", io_error: Os { code: 524, kind: Uncategorized, message: "No error information" } } }', src/bin/probe.rs:72:43
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

对于内核版本6.1,我们仍然遇到了和5.15内核一样的错误!!!让我们找出其中的原因。

这次在arch_prepare_bpf_trampoline上运行bpftrace,我们得到了以下输出:

    root@pine64:/home/exein# bpftrace -e 'kretprobe:arch_prepare_bpf_trampoline { printf("retval tp link: %d\n", retval); }'
    Attaching 1 probe...
    retval tp link: 284

所以问题不在这里,这个函数不再返回错误了。让我们回到函数调用图。

这次我们启动trace-cmd,跳过一些函数以获得更清晰的输出:

    trace-cmd record \
        -p function_graph \
        -g bpf_trampoline_link_prog \
        -n bpf_jit_alloc_exec \
        -n kmalloc_trace \
        -n arch_prepare_bpf_trampoline \
        -n generic_handle_domain_irq \
        -n do_interrupt_handler \
        -n irq_exit_rcu \
        ./probe file-system-monitor

我们从trace-cmd report中获得以下输出:

    root@pine64:/home/exein# trace-cmd report 
    CPU 0 is empty
    CPU 1 is empty
    CPU 3 is empty
    cpus=4
     tokio-runtime-w-11886 [002] 193385.056283: funcgraph_entry:                   |  bpf_trampoline_link_prog() {
     tokio-runtime-w-11886 [002] 193385.056321: funcgraph_entry:      + 15.042 us  |    mutex_lock();
     tokio-runtime-w-11886 [002] 193385.056373: funcgraph_entry:                   |    __bpf_trampoline_link_prog() {
     tokio-runtime-w-11886 [002] 193385.056395: funcgraph_entry:      + 14.833 us  |      bpf_attach_type_to_tramp();
     tokio-runtime-w-11886 [002] 193385.056428: funcgraph_entry:                   |      bpf_trampoline_update.isra.23() {
     tokio-runtime-w-11886 [002] 193385.056459: funcgraph_entry:        2.917 us   |        bpf_jit_charge_modmem();
     tokio-runtime-w-11886 [002] 193385.056531: funcgraph_entry:                   |        find_vm_area() {
     tokio-runtime-w-11886 [002] 193385.056540: funcgraph_entry:        3.000 us   |          find_vmap_area();
     tokio-runtime-w-11886 [002] 193385.056547: funcgraph_exit:       + 16.208 us  |        }
     tokio-runtime-w-11886 [002] 193385.056554: funcgraph_entry:                   |        __alloc_percpu_gfp() {
     tokio-runtime-w-11886 [002] 193385.056563: funcgraph_entry:                   |          pcpu_alloc() {
     tokio-runtime-w-11886 [002] 193385.056568: funcgraph_entry:        4.875 us   |            mutex_lock_killable();
     tokio-runtime-w-11886 [002] 193385.056591: funcgraph_entry:                   |            pcpu_find_block_fit() {
     tokio-runtime-w-11886 [002] 193385.056599: funcgraph_entry:        8.625 us   |              pcpu_next_fit_region.constprop.38();
     tokio-runtime-w-11886 [002] 193385.056608: funcgraph_exit:       + 17.166 us  |            }
     tokio-runtime-w-11886 [002] 193385.056610: funcgraph_entry:                   |            pcpu_alloc_area() {
     tokio-runtime-w-11886 [002] 193385.056639: funcgraph_entry:        9.167 us   |              pcpu_block_update();
     tokio-runtime-w-11886 [002] 193385.056656: funcgraph_entry:        7.667 us   |              pcpu_block_update_hint_alloc();
     tokio-runtime-w-11886 [002] 193385.056671: funcgraph_entry:        7.750 us   |              pcpu_chunk_relocate();
     tokio-runtime-w-11886 [002] 193385.056679: funcgraph_exit:       + 69.667 us  |            }
     tokio-runtime-w-11886 [002] 193385.056682: funcgraph_entry:        7.042 us   |            mutex_unlock();
     tokio-runtime-w-11886 [002] 193385.056703: funcgraph_entry:        2.792 us   |            pcpu_memcg_post_alloc_hook();
     tokio-runtime-w-11886 [002] 193385.056712: funcgraph_exit:       ! 148.709 us |          }
     tokio-runtime-w-11886 [002] 193385.056719: funcgraph_exit:       ! 165.250 us |        }
     tokio-runtime-w-11886 [002] 193385.056866: funcgraph_entry:                   |        bpf_image_ksym_add() {
     tokio-runtime-w-11886 [002] 193385.056873: funcgraph_entry:                   |          bpf_ksym_add() {
     tokio-runtime-w-11886 [002] 193385.056882: funcgraph_entry:        2.750 us   |            __local_bh_disable_ip();
     tokio-runtime-w-11886 [002] 193385.056897: funcgraph_entry:        4.625 us   |            __local_bh_enable_ip();
     tokio-runtime-w-11886 [002] 193385.056905: funcgraph_exit:       + 32.459 us  |          }
     tokio-runtime-w-11886 [002] 193385.056922: funcgraph_entry:        7.584 us   |          perf_event_ksymbol();
     tokio-runtime-w-11886 [002] 193385.056944: funcgraph_exit:       + 78.417 us  |        }
     tokio-runtime-w-11886 [002] 193385.057492: funcgraph_entry:                   |        set_memory_ro() {
     tokio-runtime-w-11886 [002] 193385.057501: funcgraph_entry:                   |          change_memory_common() {
     tokio-runtime-w-11886 [002] 193385.057504: funcgraph_entry:                   |            find_vm_area() {
     tokio-runtime-w-11886 [002] 193385.057506: funcgraph_entry:        8.875 us   |              find_vmap_area();
     tokio-runtime-w-11886 [002] 193385.057518: funcgraph_exit:       + 14.250 us  |            }
     tokio-runtime-w-11886 [002] 193385.057522: funcgraph_entry:                   |            __change_memory_common() {
     tokio-runtime-w-11886 [002] 193385.057531: funcgraph_entry:                   |              apply_to_page_range() {
     tokio-runtime-w-11886 [002] 193385.057538: funcgraph_entry:                   |                __apply_to_page_range() {
     tokio-runtime-w-11886 [002] 193385.057544: funcgraph_entry:      + 12.791 us  |                  pud_huge();
     tokio-runtime-w-11886 [002] 193385.057559: funcgraph_entry:        2.708 us   |                  pmd_huge();
     tokio-runtime-w-11886 [002] 193385.057574: funcgraph_entry:      + 15.125 us  |                  change_page_range();
     tokio-runtime-w-11886 [002] 193385.057591: funcgraph_exit:       + 53.792 us  |                }
     tokio-runtime-w-11886 [002] 193385.057597: funcgraph_exit:       + 66.083 us  |              }
     tokio-runtime-w-11886 [002] 193385.057610: funcgraph_exit:       + 88.125 us  |            }
     tokio-runtime-w-11886 [002] 193385.057619: funcgraph_entry:                   |            vm_unmap_aliases() {
     tokio-runtime-w-11886 [002] 193385.057622: funcgraph_entry:                   |              _vm_unmap_aliases.part.77() {
     tokio-runtime-w-11886 [002] 193385.057625: funcgraph_entry:        9.125 us   |                mutex_lock();
     tokio-runtime-w-11886 [002] 193385.057637: funcgraph_entry:        3.084 us   |                purge_fragmented_blocks_allcpus();
     tokio-runtime-w-11886 [002] 193385.057643: funcgraph_entry:                   |                __purge_vmap_area_lazy() {
     tokio-runtime-w-11886 [002] 193385.057687: funcgraph_entry:                   |                  kmem_cache_free() {
     tokio-runtime-w-11886 [002] 193385.057693: funcgraph_entry:      + 13.250 us  |                    __slab_free();
     tokio-runtime-w-11886 [002] 193385.057705: funcgraph_exit:       + 18.750 us  |                  }
     tokio-runtime-w-11886 [002] 193385.057718: funcgraph_entry:        7.416 us   |                  __cond_resched_lock();
     tokio-runtime-w-11886 [002] 193385.057733: funcgraph_exit:       + 90.042 us  |                }
     tokio-runtime-w-11886 [002] 193385.057741: funcgraph_entry:        2.792 us   |                mutex_unlock();
     tokio-runtime-w-11886 [002] 193385.057747: funcgraph_exit:       ! 124.666 us |              }
     tokio-runtime-w-11886 [002] 193385.057749: funcgraph_exit:       ! 130.291 us |            }
     tokio-runtime-w-11886 [002] 193385.057756: funcgraph_entry:                   |            __change_memory_common() {
     tokio-runtime-w-11886 [002] 193385.057759: funcgraph_entry:                   |              apply_to_page_range() {
     tokio-runtime-w-11886 [002] 193385.057765: funcgraph_entry:                   |                __apply_to_page_range() {
     tokio-runtime-w-11886 [002] 193385.057768: funcgraph_entry:        4.125 us   |                  pud_huge();
     tokio-runtime-w-11886 [002] 193385.057778: funcgraph_entry:        8.750 us   |                  pmd_huge();
     tokio-runtime-w-11886 [002] 193385.057790: funcgraph_entry:        4.625 us   |                  change_page_range();
     tokio-runtime-w-11886 [002] 193385.057797: funcgraph_exit:       + 31.958 us  |                }
     tokio-runtime-w-11886 [002] 193385.057803: funcgraph_exit:       + 44.375 us  |              }
     tokio-runtime-w-11886 [002] 193385.057817: funcgraph_exit:       + 61.208 us  |            }
     tokio-runtime-w-11886 [002] 193385.057820: funcgraph_exit:       ! 319.292 us |          }
     tokio-runtime-w-11886 [002] 193385.057826: funcgraph_exit:       ! 333.667 us |        }
     tokio-runtime-w-11886 [002] 193385.057840: funcgraph_entry:                   |        set_memory_x() {
     tokio-runtime-w-11886 [002] 193385.057847: funcgraph_entry:                   |          change_memory_common() {
     tokio-runtime-w-11886 [002] 193385.057855: funcgraph_entry:                   |            find_vm_area() {
     tokio-runtime-w-11886 [002] 193385.057858: funcgraph_entry:        2.917 us   |              find_vmap_area();
     tokio-runtime-w-11886 [002] 193385.057870: funcgraph_exit:       + 14.375 us  |            }
     tokio-runtime-w-11886 [002] 193385.057876: funcgraph_entry:                   |            vm_unmap_aliases() {
     tokio-runtime-w-11886 [002] 193385.057879: funcgraph_entry:                   |              _vm_unmap_aliases.part.77() {
     tokio-runtime-w-11886 [002] 193385.057882: funcgraph_entry:        3.959 us   |                mutex_lock();
     tokio-runtime-w-11886 [002] 193385.057893: funcgraph_entry:        3.000 us   |                purge_fragmented_blocks_allcpus();
     tokio-runtime-w-11886 [002] 193385.057900: funcgraph_entry:        2.791 us   |                __purge_vmap_area_lazy();
     tokio-runtime-w-11886 [002] 193385.057907: funcgraph_entry:        2.709 us   |                mutex_unlock();
     tokio-runtime-w-11886 [002] 193385.057913: funcgraph_exit:       + 33.708 us  |              }
     tokio-runtime-w-11886 [002] 193385.057915: funcgraph_exit:       + 43.000 us  |            }
     tokio-runtime-w-11886 [002] 193385.057922: funcgraph_entry:                   |            __change_memory_common() {
     tokio-runtime-w-11886 [002] 193385.057925: funcgraph_entry:                   |              apply_to_page_range() {
     tokio-runtime-w-11886 [002] 193385.057930: funcgraph_entry:                   |                __apply_to_page_range() {
     tokio-runtime-w-11886 [002] 193385.057933: funcgraph_entry:        4.292 us   |                  pud_huge();
     tokio-runtime-w-11886 [002] 193385.057945: funcgraph_entry:        8.750 us   |                  pmd_huge();
     tokio-runtime-w-11886 [002] 193385.057956: funcgraph_entry:        3.958 us   |                  change_page_range();
     tokio-runtime-w-11886 [002] 193385.058037: funcgraph_exit:       + 32.083 us  |                }
     tokio-runtime-w-11886 [002] 193385.058089: funcgraph_entry:        7.667 us   |                irq_enter_rcu();
     tokio-runtime-w-11886 [002] 193385.058233: funcgraph_exit:       ! 308.041 us |              }
     tokio-runtime-w-11886 [002] 193385.058239: funcgraph_exit:       ! 316.709 us |            }
     tokio-runtime-w-11886 [002] 193385.058247: funcgraph_exit:       ! 400.417 us |          }
     tokio-runtime-w-11886 [002] 193385.058255: funcgraph_exit:       ! 415.000 us |        }
     tokio-runtime-w-11886 [002] 193385.058555: funcgraph_entry:        8.250 us   |        irq_enter_rcu();
     tokio-runtime-w-11886 [002] 193385.058958: funcgraph_entry:                   |        kallsyms_lookup_size_offset() {
     tokio-runtime-w-11886 [002] 193385.058974: funcgraph_entry:      + 36.333 us  |          get_symbol_pos();
     tokio-runtime-w-11886 [002] 193385.059017: funcgraph_exit:       + 59.750 us  |        }
     tokio-runtime-w-11886 [002] 193385.059043: funcgraph_entry:                   |        kfree() {
     tokio-runtime-w-11886 [002] 193385.059057: funcgraph_entry:        3.000 us   |          __kmem_cache_free();
     tokio-runtime-w-11886 [002] 193385.059065: funcgraph_exit:       + 22.833 us  |        }
     tokio-runtime-w-11886 [002] 193385.059073: funcgraph_exit:       # 2644.708 us |      }
     tokio-runtime-w-11886 [002] 193385.059079: funcgraph_exit:       # 2706.292 us |    }
     tokio-runtime-w-11886 [002] 193385.059095: funcgraph_entry:        2.792 us   |    mutex_unlock();
     tokio-runtime-w-11886 [002] 193385.059101: funcgraph_exit:       # 2870.416 us |  }

这次程序已经通过了arch_prepare_bpf_trampolineset_memory_roset_memory_x,我们看到的最后一个函数是kallsyms_lookup_size_offset

正如我们在kernel/bpf/trampoline.c中的bpf_trampoline_update函数中所看到的,这里并没有明确调用kallsyms_lookup_size_offset

    static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
    {

    // ... OTHER CODE ...

    #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
    again:
        if ((tr->flags & BPF_TRAMP_F_SHARE_IPMODIFY) &&
            (tr->flags & BPF_TRAMP_F_CALL_ORIG))
            tr->flags |= BPF_TRAMP_F_ORIG_STACK;
    #endif

        err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
                          &tr->func.model, tr->flags, tlinks,
                          tr->func.addr);
        if (err < 0)
            goto out;

        set_memory_ro((long)im->image, 1);
        set_memory_x((long)im->image, 1);

        WARN_ON(tr->cur_image && tr->selector == 0);
        WARN_ON(!tr->cur_image && tr->selector);
        if (tr->cur_image)
            /* progs already running at this address */
            err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
        else
            /* first time registering */
            err = register_fentry(tr, im->image);

    #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
        if (err == -EAGAIN) {
            /* -EAGAIN from bpf_tramp_ftrace_ops_func. Now
             * BPF_TRAMP_F_SHARE_IPMODIFY is set, we can generate the
             * trampoline again, and retry register.
             */
            /* reset fops->func and fops->trampoline for re-register */
            tr->fops->func = NULL;
            tr->fops->trampoline = 0;

            /* reset im->image memory attr for arch_prepare_bpf_trampoline */
            set_memory_nx((long)im->image, 1);
            set_memory_rw((long)im->image, 1);
            goto again;
        }
    #endif
        if (err)
            goto out;

        if (tr->cur_image)
            bpf_tramp_image_put(tr->cur_image);
        tr->cur_image = im;
        tr->selector++;
    out:
        /* If any error happens, restore previous flags */
        if (err)
            tr->flags = orig_flags;
        kfree(tlinks);
        return err;
    }
```shell

> **注意:** <code>bpf_trampoline_update</code>的实现与之前的内核5.15稍有不同。

<code>kallsyms_lookup_size_offset</code>的调用被隐藏在另一个函数内部。我们在函数图中看不到它,因为编译器将其内联了。

看起来<code>kallsyms_lookup_size_offset</code>是由<code>ftrace_location</code>调用的:
```c
    unsigned long ftrace_location(unsigned long ip)
    {
        struct dyn_ftrace *rec;
        unsigned long offset;
        unsigned long size;

        rec = lookup_rec(ip, ip);
        if (!rec) {
            if (!kallsyms_lookup_size_offset(ip, &size, &offset))
                goto out;

            /* map sym+0 to __fentry__ */
            if (!offset)
                rec = lookup_rec(ip, ip + size - 1);
        }

        if (rec)
            return rec->ip;

    out:
        return 0;
    }

ftrace_locationregister_fentry调用,而register_fentry在调用ftrace_location之后,在struct bpf_trampoline *trfops字段上包含了一次检查。

    /* first time registering */
    static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
    {
        void *ip = tr->func.addr;
        unsigned long faddr;
        int ret;

        faddr = ftrace_location((unsigned long)ip);
        if (faddr) {
            if (!tr->fops)
                return -ENOTSUPP;
            tr->func.ftrace_managed = true;
        }

        if (bpf_trampoline_module_get(tr))
            return -ENOENT;

        if (tr->func.ftrace_managed) {
            ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 1);
            ret = register_ftrace_direct_multi(tr->fops, (long)new_addr);
        } else {
            ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
        }

        if (ret)
            bpf_trampoline_module_put(tr);
        return ret;
    }

确实,如果tr->fopsfalse,该函数将返回错误-ENOTSUPP

让我们找出tr->fops是在哪里初始化的。

如果我们是正确的,那么创建trampoline的地方应该在bpf_trampoline_lookup函数内部。

    static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
    {
        struct bpf_trampoline *tr;
        struct hlist_head *head;
        int i;

        mutex_lock(&trampoline_mutex);
        head = &trampoline_table[hash_64(key, TRAMPOLINE_HASH_BITS)];
        hlist_for_each_entry(tr, head, hlist) {
            if (tr->key == key) {
                refcount_inc(&tr->refcnt);
                goto out;
            }
        }
        tr = kzalloc(sizeof(*tr), GFP_KERNEL);
        if (!tr)
            goto out;
    #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
        tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL);
        if (!tr->fops) {
            kfree(tr);
            tr = NULL;
            goto out;
        }
        tr->fops->private = tr;
        tr->fops->ops_func = bpf_tramp_ftrace_ops_func;
    #endif

        tr->key = key;
        INIT_HLIST_NODE(&tr->hlist);
        hlist_add_head(&tr->hlist, head);
        refcount_set(&tr->refcnt, 1);
        mutex_init(&tr->mutex);
        for (i = 0; i < BPF_TRAMP_MAX; i++)
            INIT_HLIST_HEAD(&tr->progs_hlist[i]);
    out:
        mutex_unlock(&trampoline_mutex);
        return tr;
    }

在分配之后,只有在出现CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS标志时,才会填充trampoline的fops字段。这个标志依赖于HAVE_CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS标志,而这个标志在aarch64上不存在。

结论

当前情况下,由于缺少_ftrace直接调用_功能,无法在code>aarch64上使用BPF LSM。幸运的是,当前的mainline分支已经合并了一个[补丁](https://lore.kernel.org/bpf/20230207182135.2671106-5-revest@chromium.org/T/),该补丁将在

文章来源:

Author:CFC4N
link:https://www.cnxct.com/exploring-bpf-lsm-support-on-aarch64-with-ftrace/