Back to Linux Kernel

See Also: Startup ProcesssysctlTCP

Linux Kernel Parameters

Linux内核包含很多参数,对系统性能、可靠性有很大的影响。对工作和学习过程中遇到的问题和解决办法,进行一些整理和记录

1. 内核参数的设置

通过给内核传递参数,用于控制其行为方式。有三种办法:

1.1. Building the kernel

See also: https://wiki.archlinux.org/index.php/Kernels/Traditional_compilation

1.2. Starting the kernel

See also: https://wiki.archlinux.org/index.php/Kernels/Traditional_compilation

1.2.1. GRUB

1.3. At runtime

See also: https://wiki.archlinux.org/index.php/Kernels/Traditional_compilation

1.3.1. 使用 /sys 文件系统访问 Linux 内核

/sys/kernel

这里是内核所有可调整参数的位置,目前只有 uevent_helper, kexec_loaded, mm, 和新式的 slab 分配器等几项较新的设计在使用它

sysctl (/proc/sys/kernel) 接口

其它内核可调整参数

1.3.2. sysctl.conf

/etc/sysctl.conf是针对整个系统参数配置。

可以使用 sysctl 命令来查看和更新 /etc/sysctl.conf 中的配置。

1.3.3. limits.conf

/etc/security/limits.conf 文件实际是Linux PAM(插入式认证模块,Pluggable Authentication Modules)中 pam_limits.so 的配置文件。

* soft nofile 1024000
* hard nofile 1024000

The pam_limits.so module applies ulimit limits, nice priority and number of simultaneous login sessions limit to user login sessions. This description of the configuration file syntax applies to the /etc/security/limits.conf file and *.conf files in the /etc/security/limits.d directory.

当前登录会话的配置不能超过这个配置里面的值,以防止突破系统的承载能力,对系统访问资源有一定保护作用。

## 查看系统总限制:/proc/sys/fs/file-max
$ cat /proc/sys/fs/file-max
1024000
## 查看系统目前使用的文件句柄数量
$ cat /proc/sys/fs/file-nr
1056    0       1024000
## 按句柄使用量查看进程
$ lsof -n |awk '{print $2}'|sort|uniq -c |sort -nr|more

2. 内核参数

2.1. /proc/sys/kernel

See also: https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

==============================================================

core_uses_pid:

The default coredump filename is "core".  By setting
core_uses_pid to 1, the coredump filename becomes core.PID.
If core_pattern does not include "%p" (default does not)
and core_uses_pid is set, then .PID will be appended to
the filename.

==============================================================

shmall:

This parameter sets the total amount of shared memory pages that
can be used system wide. Hence, SHMALL should always be at least
ceil(shmmax/PAGE_SIZE).

If you are not sure what the default PAGE_SIZE is on your Linux
system, you can run the following command:

# getconf PAGE_SIZE

==============================================================

shmmax:

This value can be used to query and set the run time limit
on the maximum shared memory segment size that can be created.
Shared memory segments up to 1Gb are now supported in the
kernel.  This value defaults to SHMMAX.

2.1.1. 使用sysrq

使用sysrq组合键是了解系统目前运行情况,具体配置相当复杂,线上系统建议关闭。

See also: https://www.kernel.org/doc/Documentation/sysrq.txt

Here is the list of possible values in /proc/sys/kernel/sysrq:
   0 - disable sysrq completely
   1 - enable all functions of sysrq
  >1 - bitmask of allowed sysrq functions (see below for detailed function
       description):
          2 =   0x2 - enable control of console logging level
          4 =   0x4 - enable control of keyboard (SAK, unraw)
          8 =   0x8 - enable debugging dumps of processes etc.
         16 =  0x10 - enable sync command
         32 =  0x20 - enable remount read-only
         64 =  0x40 - enable signalling of processes (term, kill, oom-kill)
        128 =  0x80 - allow reboot/poweroff
        256 = 0x100 - allow nicing of all RT tasks

2.2. /proc/sys/fs

See also: https://www.kernel.org/doc/Documentation/sysctl/fs.txt

2.2.1. 1. /proc/sys/fs

使用fs.file-max来设置系统最大句柄数;使用fs.file-nr来查看当前使用、未释放和系统最大句柄数

file-max & file-nr:

The value in file-max denotes the maximum number of file-
handles that the Linux kernel will allocate. When you get lots
of error messages about running out of file handles, you might
want to increase this limit.

Historically,the kernel was able to allocate file handles
dynamically, but not to free them again. The three values in
file-nr denote the number of allocated file handles, the number
of allocated but unused file handles, and the maximum number of
file handles. Linux 2.6 always reports 0 as the number of free
file handles -- this is not an error, it just means that the
number of allocated file handles exactly matches the number of
used file handles.

Attempts to allocate more file descriptors than file-max are
reported with printk, look for "VFS: file-max limit <number>
reached".

2.2.2. 3. /proc/sys/fs/mqueue

POSIX message queues filesystem 相关的设置

$ sudo sysctl -a -e |grep fs.mqueue
fs.mqueue.msg_default = 10
fs.mqueue.msg_max = 10
fs.mqueue.msgsize_default = 8192
fs.mqueue.msgsize_max = 8192
fs.mqueue.queues_max = 256

2.3. /proc/sys/net/

See also:

2.3.1. Socket buffer

/proc/sys/net/core/rmem_max
/proc/sys/net/core/rmem_default
/proc/sys/net/core/wmem_max
/proc/sys/net/core/wmem_default
/proc/sys/net/ipv4/tcp_mem
/proc/sys/net/ipv4/tcp_rmem
/proc/sys/net/ipv4/tcp_wmem

socket_buffer_memory_allocation.png

2.3.2. TCP/IP

参考TCP协议说明, See also:

ip_local_port_range - 2 INTEGERS
        Defines the local port range that is used by TCP and UDP to
        choose the local port. The first number is the first, the
        second the last local port number.
        If possible, it is better these numbers have different parity.
        (one even and one odd values)
        The default values are 32768 and 60999 respectively.

TCP variables:

somaxconn - INTEGER
        Limit of socket listen() backlog, known in userspace as SOMAXCONN.
        Defaults to 128.  See also tcp_max_syn_backlog for additional tuning
        for TCP sockets.

tcp_max_orphans - INTEGER
        Maximal number of TCP sockets not attached to any user file handle,
        held by system. If this number is exceeded orphaned connections are
        reset immediately and warning is printed. This limit exists
        only to prevent simple DoS attacks, you _must_ not rely on this
        or lower the limit artificially, but rather increase it
        (probably, after increasing installed memory),
        if network conditions require more than default value,
        and tune network services to linger and kill such states
        more aggressively. Let me to remind again: each orphan eats
        up to ~64K of unswappable memory.

tcp_max_syn_backlog - INTEGER
        Maximal number of remembered connection requests, which have not
        received an acknowledgment from connecting client.
        The minimal value is 128 for low memory machines, and it will
        increase in proportion to the memory of machine.
        If server suffers from overload, try increasing this number.

tcp_max_tw_buckets - INTEGER
        Maximal number of timewait sockets held by system simultaneously.
        If this number is exceeded time-wait socket is immediately destroyed
        and warning is printed. This limit exists only to prevent
        simple DoS attacks, you _must_ not lower the limit artificially,
        but rather increase it (probably, after increasing installed memory),
        if network conditions require more than default value.

tcp_mem - vector of 3 INTEGERs: min, pressure, max
        min: below this number of pages TCP is not bothered about its
        memory appetite.

        pressure: when amount of memory allocated by TCP exceeds this number
        of pages, TCP moderates its memory consumption and enters memory
        pressure mode, which is exited when memory consumption falls
        under "min".

        max: number of pages allowed for queueing by all TCP sockets.

        Defaults are calculated at boot time from amount of available
        memory.

tcp_tw_recycle - BOOLEAN
        Enable fast recycling TIME-WAIT sockets. Default value is 0.
        It should not be changed without advice/request of technical
        experts.

tcp_tw_reuse - BOOLEAN
        Allow to reuse TIME-WAIT sockets for new connections when it is
        safe from protocol viewpoint. Default value is 0.
        It should not be changed without advice/request of technical
        experts.


tcp_keepalive_time - INTEGER
        How often TCP sends out keepalive messages when keepalive is enabled.
        Default: 2hours.

tcp_keepalive_probes - INTEGER
        How many keepalive probes TCP sends out, until it decides that the
        connection is broken. Default value: 9.

tcp_keepalive_intvl - INTEGER
        How frequently the probes are send out. Multiplied by
        tcp_keepalive_probes it is time to kill not responding connection,
        after probes started. Default value: 75sec i.e. connection
        will be aborted after ~11 minutes of retries.

3. 内核参数调优

vi /etc/sysctl.conf && sysctl -p

3.1. 流传在网上的Linux内核优化参数

   1 #优化TCP
   2 vi /etc/sysctl.conf
   3 #禁用包过滤功能 
   4 net.ipv4.ip_forward = 0  
   5 #启用源路由核查功能 
   6 net.ipv4.conf.default.rp_filter = 1  
   7 #禁用所有IP源路由 
   8 net.ipv4.conf.default.accept_source_route = 0  
   9 #使用sysrq组合键是了解系统目前运行情况,为安全起见设为0关闭
  10 kernel.sysrq = 0  
  11 #控制core文件的文件名是否添加pid作为扩展
  12 kernel.core_uses_pid = 1  
  13 #开启SYN Cookies,当出现SYN等待队列溢出时,启用cookies来处理
  14 net.ipv4.tcp_syncookies = 1  
  15 #每个消息队列的大小(单位:字节)限制
  16 kernel.msgmnb = 65536  
  17 #整个系统最大消息队列数量限制
  18 kernel.msgmax = 65536  
  19 #单个共享内存段的大小(单位:字节)限制,计算公式64G*1024*1024*1024(字节)
  20 kernel.shmmax = 68719476736  
  21 #所有内存大小(单位:页,1页 = 4Kb),计算公式16G*1024*1024*1024/4KB(页)
  22 kernel.shmall = 4294967296  
  23 #timewait的数量,默认是180000
  24 net.ipv4.tcp_max_tw_buckets = 6000  
  25 #开启有选择的应答
  26 net.ipv4.tcp_sack = 1  
  27 #支持更大的TCP窗口. 如果TCP窗口最大超过65535(64K), 必须设置该数值为1
  28 net.ipv4.tcp_window_scaling = 1  
  29 #TCP读buffer
  30 net.ipv4.tcp_rmem = 4096 131072 1048576
  31 #TCP写buffer
  32 net.ipv4.tcp_wmem = 4096 131072 1048576   
  33 #为TCP socket预留用于发送缓冲的内存默认值(单位:字节)
  34 net.core.wmem_default = 8388608
  35 #为TCP socket预留用于发送缓冲的内存最大值(单位:字节)
  36 net.core.wmem_max = 16777216  
  37 #为TCP socket预留用于接收缓冲的内存默认值(单位:字节)  
  38 net.core.rmem_default = 8388608
  39 #为TCP socket预留用于接收缓冲的内存最大值(单位:字节)
  40 net.core.rmem_max = 16777216
  41 #每个网络接口接收数据包的速率比内核处理这些包的速率快时,允许送到队列的数据包的最大数目
  42 net.core.netdev_max_backlog = 262144  
  43 #web应用中listen函数的backlog默认会给我们内核参数的net.core.somaxconn限制到128,而nginx定义的NGX_LISTEN_BACKLOG默认为511,所以有必要调整这个值
  44 net.core.somaxconn = 262144  
  45 #系统中最多有多少个TCP套接字不被关联到任何一个用户文件句柄上。这个限制仅仅是为了防止简单的DoS攻击,不能过分依靠它或者人为地减小这个值,更应该增加这个值(如果增加了内存之后)
  46 net.ipv4.tcp_max_orphans = 3276800  
  47 #记录的那些尚未收到客户端确认信息的连接请求的最大值。对于有128M内存的系统而言,缺省值是1024,小内存的系统则是128
  48 net.ipv4.tcp_max_syn_backlog = 262144  
  49 #时间戳可以避免序列号的卷绕。一个1Gbps的链路肯定会遇到以前用过的序列号。时间戳能够让内核接受这种“异常”的数据包。这里需要将其关掉
  50 net.ipv4.tcp_timestamps = 0  
  51 #为了打开对端的连接,内核需要发送一个SYN并附带一个回应前面一个SYN的ACK。也就是所谓三次握手中的第二次握手。这个设置决定了内核放弃连接之前发送SYN+ACK包的数量
  52 net.ipv4.tcp_synack_retries = 1  
  53 #在内核放弃建立连接之前发送SYN包的数量
  54 net.ipv4.tcp_syn_retries = 1  
  55 #开启TCP连接中time_wait sockets的快速回收
  56 net.ipv4.tcp_tw_recycle = 1  
  57 #开启TCP连接复用功能,允许将time_wait sockets重新用于新的TCP连接(主要针对time_wait连接)
  58 net.ipv4.tcp_tw_reuse = 1  
  59 #1st低于此值,TCP没有内存压力,2nd进入内存压力阶段,3rdTCP拒绝分配socket(单位:内存页)
  60 net.ipv4.tcp_mem = 94500000 915000000 927000000   
  61 #如果套接字由本端要求关闭,这个参数决定了它保持在FIN-WAIT-2状态的时间。对端可以出错并永远不关闭连接,甚至意外当机。缺省值是60 秒。2.2 内核的通常值是180秒,你可以按这个设置,但要记住的是,即使你的机器是一个轻载的WEB服务器,也有因为大量的死套接字而内存溢出的风险,FIN- WAIT-2的危险性比FIN-WAIT-1要小,因为它最多只能吃掉1.5K内存,但是它们的生存期长些。
  62 net.ipv4.tcp_fin_timeout = 15  
  63 #表示当keepalive起用的时候,TCP发送keepalive消息的频度(单位:秒)
  64 net.ipv4.tcp_keepalive_time = 30
  65 #对外连接端口范围
  66 net.ipv4.ip_local_port_range = 2048 65000
  67 #表示文件句柄的最大数量
  68 fs.file-max = 102400

3.2. 线上应用高并发@4.4.5-15.26.amzn1.x86_64/EC2 Large(4core,16G)

   1 # liyan add @20160905
   2 #系统最大文件句柄数量, sysctl -n fs.file-nr查看当前使用值、未释放值和最大值
   3 fs.file-max = 1024000
   4 #############################################
   5 # Increase Socket(TCP&UDP) Buffer Sizes
   6 #############################################
   7 # The maximum receive socket buffer size in bytes
   8 net.core.rmem_max = 26214400
   9 # The maximum send socket buffer size in bytes
  10 net.core.wmem_max= 26214400
  11 # The default values for receive/send socket buffer
  12 net.core.rmem_default= 26214400
  13 net.core.wmem_default= 26214400
  14 #############################################
  15 # TCP Tuning Buffer Sizes and so on
  16 #############################################
  17 # buffered by the network card
  18 net.core.netdev_max_backlog = 262144
  19 # Limit of socket listen() backlog
  20 net.core.somaxconn = 262144
  21 # each 16384 sockets cost ~1G Memory
  22 net.ipv4.tcp_max_orphans = 65536
  23 # Maximal number of remembered connection requests
  24 net.ipv4.tcp_max_syn_backlog = 262144
  25 # don't go back after an idle period
  26 net.ipv4.tcp_slow_start_after_idle = 0
  27 # Maximal number of TIME-WAIT sockets(Since Linux 4.1, the way TIME-WAIT sockets are tracked has been modified to increase performance and parallelism. The death row is now just a hash table.)
  28 net.ipv4.tcp_max_tw_buckets = 200000
  29 # Fast recycling TIME-WAIT sockets, harm for NAT devices.
  30 net.ipv4.tcp_tw_recycle = 0
  31 # Allow to reuse TIME-WAIT sockets for new connections, useless for server side
  32 net.ipv4.tcp_tw_reuse = 1
  33 # TCP Keepalive Control
  34 net.ipv4.tcp_fin_timeout = 8
  35 # 在内核放弃建立连接之前发送SYN包的数量,默认是6次
  36 net.ipv4.tcp_syn_retries = 3
  37 # 回应前面SYN的ACK数量,内核放弃连接之前发送包总数=SYN+ACK,默认是5次
  38 net.ipv4.tcp_synack_retries = 2
  39 # Enable timestamps as defined in RFC1323, TCP Extensions for High Performance
  40 net.ipv4.tcp_timestamps = 1
  41 # TCP Memory Control
  42 net.ipv4.tcp_mem = 131072  262144  524288
  43 net.ipv4.tcp_rmem = 8760  256960  4088000
  44 net.ipv4.tcp_wmem = 8760  256960  4088000
  45 # default is 7200s
  46 net.ipv4.tcp_keepalive_time = 30
  47 # default 9 times
  48 net.ipv4.tcp_keepalive_probes = 3
  49 net.ipv4.ip_local_port_range = 8000 65000
  50 # ipv6 disable
  51 net.ipv6.conf.all.disable_ipv6 = 1


# vi /etc/sysctl.conf
# sysctl -p
# sysctl -a |grep tcp_max_tw_buckets
net.ipv4.tcp_max_tw_buckets = 6000

3.3. 测试机高并发@OS X 10.11.5 (15F34)

以下配置让MBP可以开启1000+连接

➜  ~ uname -r
15.5.0
➜  ~ cat /etc/sysctl.conf 
# https://rolande.wordpress.com/2014/05/17/performance-tuning-the-network-stack-on-mac-os-x-part-2/
kern.maxfiles=65536
kern.maxfilesperproc=65536
kern.ipc.somaxconn=2048
net.inet.tcp.rfc1323=1
net.inet.tcp.win_scale_factor=4
net.inet.tcp.sendspace=1042560
net.inet.tcp.recvspace=1042560
net.inet.tcp.mssdflt=1448
net.inet.tcp.v6mssdflt=1412
net.inet.tcp.msl=15000
net.inet.tcp.always_keepalive=0
net.inet.tcp.delayed_ack=3
net.inet.tcp.slowstart_flightsize=20
net.inet.tcp.local_slowstart_flightsize=9
net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1
net.inet.icmp.icmplim=50

➜  ~ ulimit -n 10240
➜  stsdk wrk -t4 -c1000 -d100s --timeout 30s --latency -sreport.lua --latency http://t.li3huo.com/
...
  4 threads and 1000 connections
...
  Latency Distribution
     50%   64.56ms
     75%  198.56ms
     90%  434.45ms
     99%    1.29s
  923659 requests in 1.67m, 400.60MB read
Requests/sec:   9232.56
Transfer/sec:      4.01MB

3.3.1. Change ulimit settings

https://blogs.progarya.dk/blog/how-to-persist-ulimit-settings-in-osx/

For 10.9 (Mavericks), 10.10 (Yosemite), 10.11 (El Capitan), and 10.12 (Sierra):You have to create a file at /Library/LaunchDaemons/limit.maxfiles.plist (owner: root:wheel, mode: 0644):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
        "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>limit.maxfiles</string>
    <key>ProgramArguments</key>
    <array>
      <string>launchctl</string>
      <string>limit</string>
      <string>maxfiles</string>
      <string>262144</string>
      <string>524288</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>ServiceIPC</key>
    <false/>
  </dict>
</plist>

➜  ~ sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
Password:
➜  ~ ulimit -n                                                         
1024
➜  ~ launchctl limit maxfiles                                          
        maxfiles    262144         524288

4. Reference


CategoryLinux

MainWiki: Kernel_Parameters (last edited 2016-09-05 00:18:32 by twotwo)