sysctl命令和kernel参数

Last Updated: 2023-03-27 09:41:05 Monday

-- TOC --

Linux Kernel运行时涉及到很多参数,它们都在/proc这个伪文件目录里面,我们可以用sysctl命令对这些参数进行操作,调整kernel的运行时状态。

A very interesting part of /proc is the directory /proc/sys. This is not only a source of information, it also allows you to change parameters within the kernel. Be very careful when attempting this. You can optimize your system, but you can also cause it to crash. Never alter kernel parameters on a production system. Set up a development machine and test to make sure that everything works the way you want it to. You may have no alternative but to reboot the machine once an error has been made.

sysctl命令

sysctl - configure kernel parameters at runtime

修改默认配置文件 /etc/sysctl.conf, 重启后参数依然有效!

修改配置文件后,可以不重启,生效命令:

$ sudo sysctl -p[FILE]

显示当前所有的配置项:

$ sudo sysctl -a

用sysctl查询单个配置项:

$ sudo sysctl kernel.printk
kernel.printk = 3       4       1       7

或者加上-n,只显示value:

$ sudo sysctl -n kernel.hostname  # disable printing key
K

修改单个配置项:

$ sudo sysctl -w kernel.printk='4 4 1 7'

kernel.printk

cat /proc/sys/kernel/printk ,控制printk信息输出到TTY的level,以及默认信息的level。

此配置项4个数字的含义为:

  1. 可以输出到tty的level上限,即只有小于这个level的信息才能够在tty上显示出来;
  2. 默认信息的level,默认信息就是在调用printk的使用,使用KERN_DEFAULT,或者不适用任何macro;
  3. 最小log level,我理解kernel一定会将level0的信息打印出来,不管怎么配置;
  4. boot-time-default log level。

The result shows the current, default, minimum and boot-time-default log levels.

kernel.printk_ratelimit

Some warning messages are rate limited. printk_ratelimit specifies the minimum length of time between these messages (in seconds). The default value is 5 seconds.

A value of 0 will disable rate limiting.

kernel.printk_ratelimit_burst

While long term we enforce one message per printk_ratelimit seconds, we do allow a burst of messages to pass through. printk_ratelimit_burst specifies the number of messages we can send before ratelimiting kicks in.

The default value is 10 messages.

这个burst的意思,应该不是有多少条log可以在printk_ratelimit之间发出来!体会一下burst的含义...

以上两个配置项,对pr_*_ratelimited系列macro有效。

kernel.randomize_va_space

Linux下配置ASLR

vm.swappiness

swappiness和swap空间

net.core.somaxconn

somaxconn - INTEGER

Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 4096. (Was 128 before linux-5.4) See also tcp_max_syn_backlog for additional tuning for TCP sockets.

listen接口有个backlog参数,真正起效果的队列长度是min(backlog,somaxconn),这个配置项设置了每个socket对应的两个队列的最大值。(看网上很多人都把这个配置项修改为16K,16384)

系统中每个socket有2个队列,半连接(SYN Queue)和全连接(Accept Queue),从4.3之后,这两个队列的大小是一样的,都在listen接口通过backlog设置。

查看每个socket的两个队列长度,使用ss命令

net.core.netdev_budget

Maximum number of packets taken from all interfaces in one polling cycle (NAPI poll). In one polling cycle interfaces which are registered to polling are probed in a round-robin manner. Also, a polling cycle may not exceed netdev_budget_usecs microseconds, even if netdev_budget has not been exhausted.

在一个poll cycle内,用round-robin的方式,接收所有注册的interface上的报文的最大值。貌似默认值是300。

net.core.netdev_budget_usecs

Maximum number of microseconds in one NAPI polling cycle. Polling will exit when either netdev_budget_usecs have elapsed during the poll cycle or the number of packets processed reaches netdev_budget.

一个poll cycle的最长时间,貌似默认是2000,即2ms。

net.core.netdev_max_backlog

Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. (貌似默认值是1000)

表示当网络接口接收数据包的速率比内核处理这些包的速率快时,允许此接收队列数据包的最大数目。

时间net.core.netdev_budget_usecs到,或者net.core.netdev_budget到,都会导致一个poll cycle结束,来不及接收的数据包,放入net.core.netdev_max_backlog中。

net.ipv4.ip_default_ttl

ip_default_ttl - INTEGER

Default value of TTL field (Time To Live) for outgoing (but not forwarded) IP packets. Should be between 1 and 255 inclusive. Default: 64 (as recommended by RFC1700)

协议栈发出的IP报文默认的TTL值。(raw ip不受此限制)

net.ipv4.ip_local_port_range

ip_local_port_range - 2 INTEGERS

Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first, the second the last local port number. If possible, it is better these numbers have different parity (one even and one odd value). Must be greater than or equal to ip_unprivileged_port_start. The default values are 32768 and 60999 respectively.

Linux系统自动分配的port number就在这个范围中。

net.ipv4.ip_unprivileged_port_start

ip_unprivileged_port_start - INTEGER

This is a per-namespace sysctl. It defines the first unprivileged port in the network namespace. Privileged ports require root or CAP_NET_BIND_SERVICE in order to bind to them. To disable all privileged ports, set this to 0. They must not overlap with the ip_local_port_range.

Default: 1024

用户程序可以使用的最小port number由它定义,但sudo权限可以不受限制。

net.ipv4.ip_local_reserved_ports

ip_local_reserved_ports - list of comma separated ranges

Specify the ports which are reserved for known third-party applications. These ports will not be used by automatic port assignments (e.g. when calling connect() or bind() with port number 0). Explicit port allocation behavior is unchanged.

The format used for both input and output is a comma separated list of ranges (e.g. 1,2-4,10-10 for ports 1, 2, 3, 4 and 10). Writing to the file will clear all previously reserved ports and update the current list with the one given in the input.

Note that ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments.

You can reserve ports which are not in the current ip_local_port_range, e.g.::

$ cat /proc/sys/net/ipv4/ip_local_port_range
32000       60999
$ cat /proc/sys/net/ipv4/ip_local_reserved_ports
8080,9148

although this is redundant. However such a setting is useful if later the port range is changed to a value that will include the reserved ports. Also keep in mind, that overlapping of these ranges may affect probability of selecting ephemeral ports which are right after block of reserved ports.

Default: Empty

Linux内核就是一个可裁剪可配置的核心软件模块。

net.ipv4.ip_forward

ip_forward - BOOLEAN

Forward Packets between interfaces.

This variable is special, its change resets all configuration parameters to their default state (RFC1122 for hosts, RFC1812 for routers)

如果Linux系统被用来作为gateway设备,需要开启此参数。

net.ipv4.icmp_echo_ignore_all

If set non-zero, then the kernel will ignore all ICMP ECHO requests sent to it.

Default: 0

将次开关设置为1,就可以让Linux系统不再回应ping request!默认是开启的。

net.ipv4.tcp_syn_retries

tcp_syn_retries - INTEGER

Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 63seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 127seconds.

初始的SYN默认重传6次,这么长的连接超时基本上是无法接受的,因此代码一般都要设置一个很短的超时时间,修改这个配置项的意义也不是很大。

net.ipv4.tcp_synack_retries

tcp_synack_retries - INTEGER

Number of times SYNACKs for a passive TCP connection attempt will be retransmitted. Should not be higher than 255. Default value is 5, which corresponds to 31seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for a passive TCP connection will happen after 63seconds.

TCP Server回应SYN的SYNACK,默认最大重传5次。可以考虑将次配置改小,以应对在SYN攻击时,减少资源的占用时长,比如改为2。

注意,TCP的ACK报文是不会有重传的,当ACK 丢失了,就由对方重传对应的报文。

net.ipv4.tcp_max_syn_backlog (?)

Maximal number of remembered connection requests (SYN_RECV), which have not received an acknowledgment from connecting client.

This is a per-listener limit.

The minimal value is 128 for low memory machines, and it will increase in proportion to the memory of machine. If server suffers from overload, try increasing this number.

Remember to also check /proc/sys/net/core/somaxconn. A SYN_RECV request socket consumes about 304 bytes of memory.

网上有文章说这个配置项已经废弃不用了,影响SYN队列长度的因素,是listen接口的backlog参数,和net.core.somaxconn配置项。不过128这个数字是在调用listen不填写backlog时的默认值!

net.ipv4.tcp_abort_on_overflow

tcp_abort_on_overflow - BOOLEAN

If listening service is too slow to accept new connections, reset them. Default state is FALSE. It means that if overflow occurred due to a burst, connection will recover. Enable this option only if you are really sure that listening daemon cannot be tuned to accept connections faster. Enabling this option can harm clients of your server.

默认是False,server发生tcp overflow(Accept队列)的时候,不发送RST,只是丢弃,这样client不会被RST,还有机会再连上。

net.ipv4.tcp_syncookies

这个功能不是TCP标准!

tcp_syncookies - INTEGER

Only valid when the kernel was compiled with CONFIG_SYN_COOKIES Send out syncookies when the syn backlog queue of a socket overflows. This is to prevent against the common 'SYN flood attack' Default: 1

Note, that syncookies is fallback facility. It MUST NOT be used to help highly loaded servers to stand against legal connection rate. If you see SYN flood warnings in your logs, but investigation shows that they occur because of overload with legal connections, you should tune another parameters until this warning disappear. See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.

syncookies seriously violate TCP protocol, do not allow to use TCP extensions, can result in serious degradation of some services (f.e. SMTP relaying), visible not by you, but your clients and relays, contacting you. While you see SYN flood warnings in logs not being really flooded, your server is seriously misconfigured.

If you want to test which effects syncookies have to your network connections you can set this knob to 2 to enable unconditionally generation of syncookies.

开启此功能要编译内核!不建议在正常高负载的server上开启此配置项,这个功能不是TCP标准,并且不允许使用TCP的各项扩展功能!可以将其设置为2来进行测试,2表示始终使用syn cookie。

详细参考:SYN Cookie

net.ipv4.tcp_tw_reuse

tcp_tw_reuse - INTEGER

Enable reuse of TIME-WAIT sockets for new connections when it is safe from protocol viewpoint.

It should not be changed without advice/request of technical experts.

Default: 2

此参数仅在调用connect接口的时候有用,因此server一般不需要考虑开启。当开启了net.ipv4.tcp_tw_reuse时,并且之前的连接启用了TCP timestadmp选项时(用于PAWS),才允许重用TIME_WAIT套接字。(看kernel代码,查找sysctl_tcp_tw_resue)有tcp_timestamps的保护,就不怕收到之前连接滞留在网络中的报文,收到也是直接丢弃。

net.ipv4.tcp_max_tw_buckets

tcp_max_tw_buckets - INTEGER

Maximal number of timewait sockets held by system simultaneously. If this number is exceeded time-wait socket is immediately destroyed and warning is printed. This limit exists only to prevent simple DoS attacks, you must not lower the limit artificially, but rather increase it (probably, after increasing installed memory), if network conditions require more than default value.

这个参数才是在服务器上配置的。超过此限制后,系统资源立即被回收(应该是先收回等待时间长的),其效果是控制一个TIME_WAIT的数量上限。系统只维持一个TIME_WAIT的上限。如果降低这个配置为0,是否就相当于服务器再无TIME_WAIT?

net.ipv4.tcp_fin_timeout

tcp_fin_timeout - INTEGER

The length of time an orphaned (no longer referenced by any application) connection will remain in the FIN_WAIT_2 state before it is aborted at the local end. While a perfectly valid "receive only" state for an un-orphaned connection, an orphaned connection in FIN_WAIT_2 state could otherwise wait forever for the remote to close its end of the connection.

Cf. tcp_max_orphans

Default: 60 seconds

处于FIN_WAIT_2状态的socket的最大等待时间,默认60秒与TIME_WAIT默认时间长度一样。

net.ipv4.tcp_timestamps

tcp_timestamps - INTEGER

Enable timestamps as defined in RFC1323.

Default: 1

默认开启tcp timestamps功能。

net.ipv4.tcp_sack

tcp_sack - BOOLEAN

Enable select acknowledgments (SACKS).

默认开启tcp sack功能。

net.ipv4.tcp_dsack

tcp_dsack - BOOLEAN

Allows TCP to send "duplicate" SACKs.

默认开启tcp dsack功能。

net.ipv4.tcp_window_scaling

tcp_window_scaling - BOOLEAN

Enable window scaling as defined in RFC1323.

默认开启tcp window scaling功能。

net.ipv4.tcp_keepalive_time

tcp_keepalive_time - INTEGER

How often TCP sends out keepalive messages when keepalive is enabled.

Default: 2hours.

TCP保活机制的启动时间。即TCP连接没有数据流超过这个时间,就启动TCP Keepalive机制,但使用这个机制还有个前提,即代码中创建的socket,要开启SO_KEEPALIVE

net.ipv4.tcp_keepalive_probes

tcp_keepalive_probes - INTEGER

How many keepalive probes TCP sends out, until it decides that the connection is broken. Default value: 9.

TCP保活机制启动后,发送探测报文的数量。

net.ipv4.tcp_keepalive_intvl

tcp_keepalive_intvl - INTEGER

How frequently the probes are send out. Multiplied by tcp_keepalive_probes it is time to kill not responding connection, after probes started. Default value: 75sec i.e. connection will be aborted after ~11 minutes of retries.

TCP保活机制启动后,发送探测报文的间隔。

net.ipv4.tcp_slow_start_after_idle

tcp_slow_start_after_idle - BOOLEAN

If set, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at the current RTO. If unset, the congestion window will not be timed out after an idle period.

Default: 1

当idle一个RTO的时间后,重启慢启动,默认开启。为了降低延时,可以考虑关闭。

本文链接:https://cs.pynote.net/sf/linux/sys/202112141/

-- EOF --

-- MORE --