0. 运行环境

最近在工作中,碰到了一个折磨我们半个多月的一个问题。我们存储服务器端运行centos 7, Linux内核版本为3.10.0-229.el7.x86_64,通过qlogic FC HBA卡对外提供块存储服务,存储服务器同时提供快照服务,快照周期是15秒钟每次。在Initiator端还对有快照的卷持续发起大量IO请求的情况下,我们发现长时间的测试会导致内核崩溃。

1. 问题现象

此外,我们还发现使用iSCSI服务在上面同样压力的情况下,没有问题。但是一旦使用用qlogic HBA卡访问存储池,运行6~10个小时后,上面问题必现。同时,dmesg中出现很多list_del corruption。部分日志如下:

8862 <4>1 2017-11-20T16:13:18.725115+08:00 localhost kernel - - - list_del corruption, ffff881d52b36890->next is LIST_POISON1 (dead000000100100)
28863 <4>1 2017-11-20T16:13:18.725148+08:00 localhost kernel - - - Modules linked in: target_core_file nfsd auth_rpcgss nfs_acl lockd sunrpc fuse arxcis(OF) ext4 mbcache jbd2 dccp_diag dccp tcp_diag udp_diag         inet_diag unix_diag af_packet_diag netlink_diag xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack         ipt_REJECT iptable_filter tun bridge stp llc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid1 ses enclosure iscsi_target_mod(OF) tcm_qla2xxx(F) coretemp kvm_intel iTCO_wdt iTCO_vendor_support         kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr mei_me lpc_ich mei mfd_core i2c_i801 shpchp wmi qla2xxx(F) target_core_iblock target_core_pscsi targe        t_core_mod acpi_power_meter acpi_pad scsi_transport_fc
28864 <4>1 2017-11-20T16:13:18.727391+08:00 localhost kernel - - -  scsi_tgt sg(F) PlxSvc_dbg(OF) Plx8000_NT_dbg(OF) Plx8000_DMA_dbg(OF) ipmi_watchdog(F) ipmi_poweroff(F) ipmi_si(F) ipmi_devintf(F) ipmi_msgha        ndler(F) blktap(OF) uinput ip_tables xfs libcrc32c sd_mod crc_t10dif ast syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm ixgbe ahci libahci igb mpt3sas libata crct10dif_pclmul mdio crct10dif_co        mmon crc32c_intel ptp raid_class pps_core scsi_transport_sas i2c_algo_bit i2c_core dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: arxcis]
  28865 <4>1 2017-11-20T16:13:18.727418+08:00 localhost kernel - - - CPU: 30 PID: 262 Comm: kworker/u66:1 Tainted: GF          O--------------   3.10.0-229.el7.x86_64+ #1

2. 分析过程

根据内核日志,可以看到是list_del在删去一个节点的时候,发现这个节点已经被删除掉,于是导致内核corruption。这个list_del是由target core mode驱动中的target_tmr_work工作线程调用,这个线程在客户端发现IO超时后发出Lun Reset请求后会被触发。

参考《spec4r11》:

5.6.10.4.2

Failed persistent reservation preempt

If the preempting I_T nexus’ PREEMPT service action or PREEMPT AND ABORT service action fails (e.g.,repeated TASK SET FULL status, repeated BUSY status, SCSI transport protocol time-out, or time-out due to the task set being blocked due to failed initiator port or failed SCSI initiator device), the application client may send a LOGICAL UNIT RESET task management function to the failing logical unit to remove blocking tasks and then
reissue the preempting service action.

以及SCSI规范:《scsi primarey command spec》:

5.5.1 Reservations overview
Reservations may be used to allow a device server to execute commands from a selected set of initiator ports and reject commands from initiator ports outside the selected set of initiator ports. The device server uniquely identifies initiator ports using protocol specific mechanisms. Application clients may add or remove initiator ports from the selected set using reservation commands.

特别是下面的这句话:

If the application clients do not cooperate in the reservation protocol, data may be unexpectedly modified and deadlock conditions may occur.

The scope of a reservation shall be one of the following:
a) Logical unit reservations – a logical unit reservation restricts access to the entire logical unit; and
b) Element reservations

根据上面的描述,不难看到客户端发起的LUN Reset命令要求它兼容reservation协议, 内核驱动和应用程序一起协助才能让reservation 工作正常。否则就会出现数据被异常地修改,导致死锁或者其他异常。显然,本例子中的现象看上去也是数据被异常修改导致内核corruption,据此推断应该是客户端内部发起的reservation 命令和存储服务器这边不兼容所致。

3. 解决方法

根据上面的分析和猜想,一个可行的work around是在这个存储服务器暂时不让它响应LUN Reset命令,据此修改了LIO 中 target_core_mod驱动中下面的代码:

@@ -2870,7 +2870,10 @@ static void target_tmr_work(struct work_struct *work)
     tmr->response = TMR_TASK_MGMT_FUNCTION_NOT_SUPPORTED;
     break;
     case TMR_LUN_RESET:
    -ret = core_tmr_lun_reset(dev, tmr, NULL, NULL);
    +//ret = core_tmr_lun_reset(dev, tmr, NULL, NULL);
    +//tmr->response = (!ret) ? TMR_FUNCTION_COMPLETE :
    +//        TMR_FUNCTION_REJECTED;
    +ret = TMR_FUNCTION_REJECTED;
     tmr->response = (!ret) ? TMR_FUNCTION_COMPLETE :
                  TMR_FUNCTION_REJECTED;
     break;

然后重新编译对应的target_core_mod.ko,替换之前的.ko, 在同样的压测下做长时间的测试,发现问题消失。后来即便把压力加大了最大,仍然没有再现问题。

4. 总结