Hardware
No matter what operating system you choose, the machine you run on will determine the theoretical speed limit you can expect to achieve. When people talk about how fast a system is they always mention CPU clock speed. We would expect an AMD64 2.4GHz to run faster than a Pentium3 1.0 GHz, but CPU speed is not the key, motherboard bus speed is.In terms of a firewall or bridge we are looking to move data through the system as fast as possible. This means we need to have a PCI bus that is able to move data quickly between network interfaces. To do this the machine must have a wide bus and high bus speed. CPU clock speed is a very minor part of the equation.
The quality of a network card is key to high though put. As a very general rule, using the on-board network card is going to be much slower than an add in PCI card. The reason is that most desktop motherboard manufacturers use cheap on-board network chip sets that use CPU processing time instead of handling TCP traffic by themselves. This leads to very slow network performance and high CPU load.
A gigabit network controller built on board using the CPU will slow the entire system down. More than likely the system will not even be able to sustain 100MB speeds while also pegging the CPU at 100%. A network controller that is able to negotiate as a gigabit is _very_ different from a controller that can transfer a gigabit of data per second.
Ideally you want to use a server based add on card with a TCP offload engine or TCP accelerator. We have seen very good speeds with the Intel Pro/1000 MT series (em4) cards. They are not too expensive and all OS's have support.
Not to say that all on-board chip sets are bad. Supermicro server boards use an Intel 82546EB Gigabit Ethernet Controller on their server motherboards. It offers two(2) copper gigabit ports through a single chip set offering a 133MHz PCI-X, 128 bit wide bus, pre-fetching up to 64 packet descriptors and has two 64 KB on-chip packet buffers. This is an exceptionally fast chip and it saves space by being built onto the server board.
Now, in order to move data in and out of the network cards as fast as possible we need a bus with a wide bit rate and high clock speed. For example, a PCI-X 64bit slot is wider than a PCI-X 32bit as is a 66MHz bus is faster than a 33MHz bus. Wide is good, fast is good, but wide and fast are better.
The equation to calculate the theoretical speed of a PCI or PCI-X slot is the following:
(bus speed in MHz) * (bus width in bits) / 8 = speed in Megabytes/second 66 MHz * 32 bit / 8 = 264 Megabytes/secondFor example, if we have a motherboard with a 32bit wide bus running at 66MHz then the theoretical max speed we can push data through the slot is 66*32/8= 264 Megabytes/second. With a server class board we could use a 64bit slot running at 133MHz and reach speeds of 133*64/8= 1064 Megabytes/second.
Now that you have the max speed of the single PCI slot we need to understand this number represents the max speed of the bus if nothing else is using the PCI bus. Since all PCI cards and built on-board chips use the same bus then they must also be taken into account. If we have two network cards each using a 64bit, 133MHz slot then each slot will get to use 50% of the total speed of the PCI bus. Each card can do 133*64/8= 1064 Megabytes/second and if both network cards are being used at once, like on a firewall, then each card can use 1064/2= 532 Megabytes/second max. This is still well above the maximum speed of a gigabit connection which can move 1000/8= 128 Megabytes/second.
Look at the specifications or motherboard you expect to use and the above equation to get a rough idea of the speeds you can expect out of the box. Hardware speed is the key to a fast firewall. Before setting up your new system and possibly wasting hours wondering why it is not reaching your speed goals, make sure you understand the limitations of the hardware. Do not expect throughput out of your system hardware that it is _not_ capable of.
For example, when using a four port network card on a machine, consider the bandwidth of the adapter slot you put it into. Standard PCI is a 32 bit wide interface and the bus speed is 66MHz or 133 MHz. This bandwidth is shared across all devices on the same bus. PCI-e is a serial connection with 2.5 GHz frequency in both directions for a 1x slot. The effective maximum bandwidth is 2Gbps bidirectional. So, if you decide to support 4, 1Gbps connections on one card it might be best to do it with a PCI-e 4x or faster slot and card.
How much ram do I need for a firewall?
For a standard OpenBSD firewall one(1) gigabyte of ram is more than enough. In fact, unless you are running many memory hungry services you will actually use less than 100 megabytes of ram at any one time. On our testing system we had eight(8) gig available, but OpenBSD will only recognize 3.1 gig of that no matter if you use the i386 or AMD64 kernel. One of the few times you may need more ram is if your firewall is going to load tables in Pf with tens of thousands of entries. These days ram is cheap, but there is no need to put four(4) to eight(8) gigabytes in the machine as it will only go to waste.You can reduce the power consumption of your firewall and keep track of system temperatures by using Power Management with apmd and Sensorsd hardware monitor (sensorsd.conf).
Is a Maximum Transmission Unit (MTU) over 1500 really better?
It is sometimes recommend to set the MTU of your network interface over a default value of 1500. Users of jumbo frames can set the MTU as high as 9000. The MTU value tells the network card to send a Ethernet frame of the value specified in bytes. While this may be useful when connecting two hosts directly together using the same MTU, it is a lot less useful when connecting through a switch which does not support a larger MTU.When a switch or a machine receives a MTU that is larger then they are able to forward they must fragment the packets. This takes time and is very inefficient. The throughput you may gain when connecting to similar high MTU machines you will loose when connecting to any 1500 MTU machine.
Either way, increasing the MTU is may not be necessary. 930Mb/s can be attained at the normal 1500 byte MTU setting with the following network tweaks.
When trying to attain maximum throughput, the most important options involve TCP window sizes and send/receive space buffers.
OpenBSD network stack "speed_tweaks"
First, make sure you are running the latest version of OpenBSD. Not necessarily the bleeding edge -current tree, the -stable tree will work just fine. As of OpenBSD v4.7 there have been a lot of work done to remove many of the bottlenecks in the network code and how Pf handles traffic.Second, make sure you have applied any patches to the system according to the OpenBSD page. We have a patch guide if you need it, Patching OpenBSD kernel and packages.
The following options are put in the /etc/sysctl.conf file. They will increase the network buffer sizes and allow TCP window scaling. Understand that these settings are at the upper extreme. We found them perfectly suited in a production environment which can saturate a gigabit link. You may not need to set each of the values this high, but that is up to your environment and testing methods. Summery explanations of each line follow each option.
### Calomel.org OpenBSD /etc/sysctl.conf ## ddb.panic=0 # do not enter ddb consol on kernel panic, reboot if possible kern.maxclusters=128000 # Cluster allocation limit machdep.allowaperture=2 # Access the X Window System if you need it, otherwise set to 0 net.inet.icmp.errppslimit=1000 # Maximum number of outgoing ICMP error messages per second net.inet.icmp.rediraccept=0 # Deny icmp redirects net.inet.ip.forwarding=1 # Permit forwarding (routing) of packets if this is a firewall net.inet.ip.ifq.maxlen=512 # Maximum allowed input queue length (256*number of interfaces) net.inet.ip.mtudisc=0 # TCP MTU (Maximum Transmission Unit) discovery off since our mss is small enough net.inet.ip.ttl=254 # the TTL should match what we have for "min-ttl" in scrub rule in pf.conf net.inet.ipcomp.enable=1 # IP Payload Compression protocol (IPComp) reduces the size of IP datagrams net.inet.tcp.ackonpush=1 # acks for packets with the push bit set should not be delayed net.inet.tcp.ecn=1 # Explicit Congestion Notification enabled net.inet.tcp.mssdflt=1472 # maximum segment size (1472 from scrub pf.conf) net.inet.tcp.recvspace=262144 # Increase TCP "recieve" windows size to increase performance net.inet.tcp.rfc1323=1 # RFC1323 TCP window scaling net.inet.tcp.rfc3390=1 # RFC3390 for TCP window increasing net.inet.tcp.sack=1 # TCP Selective ACK (SACK) Packet Recovery net.inet.tcp.sendspace=262144 # Increase TCP "send" windows size to increase performance net.inet.udp.recvspace=262144 # Increase UDP "recieve" windows size to increase performance net.inet.udp.sendspace=262144 # Increase UDP "send" windows size to increase performance vm.swapencrypt.enable=1 # encrypt pages that go to swap ### CARP options if needed # net.inet.carp.arpbalance=0 # CARP load-balance # net.inet.carp.log=2 # Log CARP state changes # net.inet.carp.preempt=1 # Enable CARP interfaces to preempt each other (0 -> 1) # net.inet.ip.forwarding=1 # Enable packet forwarding through the firewall (0 -> 1)
You can apply each of these settings manually by using sysctl on the command line. For example, "sysctl kern.maxclusters=128000" will set the kern.maxclusters variable until the machine is rebooted. By setting the variables manually you can test each of them to see if they will help your machine.
For more information about OpenBSD's Pf firewall and HFSC quality of service options check out our PF Config (pf.conf) and PF quality of service HFSC "how to's".
Testing and verifying network speeds (UPDATED)
Continuing with OpenBSD v4.5, a lot of work has been done on the single and multi-core kernels focused on speed and efficiency improvements. Since many OpenBSD machines will be used as a firewall or bridge we wanted to see what type of speeds we could expect passing through the machine. Lets take a look at the single and multi core kernel, the effects of using PF enabled or disabled and the effect of the our "speed tweaks" listed in the section above.The testing hardware
To do our testing we will use the latest patches applied to the latest distribution. Our test setup consists of two(2) identical boxes containing an Intel Core 2 Quad (Q9300), eight(8) gigs of ram and an Intel PRO/1000 MT (CAT5e copper) network card. The cards were put in a 64bit PCI-X slot running at 133 MHz. The boxes are connected to each other by an Extreme Networks Summit X450a-48t gigabit switch using 12' unshielded CAT6 cable.
The testing software
The following iperf options were used on the machines we will call test0 and test1. We will be sustaining a full speed transfer for 30 seconds and take the average speed in Mbits/sec as the result. Iperf is available through the OpenBSD repositories using "pkg_add iperf".
## iperf listening server root@test1: iperf -s ## iperf sending client root@test0: iperf -i 1 -t 30 -c test1The PF rules
The following minimal PF rules were used if PF was enabled (pf=YES)
# pfctl -sr scrub in all fragment reassemble pass in all flags S/SA keep state block drop in on ! lo0 proto tcp from any to any port = 6000
Test 1: No Speed Tweaks. Using the GENERIC and GENERIC.MP kernel (patched -stable) with the default tcp window sizes we are able to sustain over 300 Mbits/sec (37 Megabytes/sec). Since the link was at gigabit (1000 Mbits/sec maximum) we are using less then 40% of our network line speed.
bsd.single_processor_patched pf=YES speed_tweaks=NO [ 1] 0.0-30.0 sec 1.10 GBytes 315 Mbits/sec bsd.single_processor_patched pf=NO speed_tweaks=NO [ 1] 0.0-30.0 sec 1.24 GBytes 356 Mbits/sec bsd.multi_processor_patched pf=YES speed_tweaks=NO [ 4] 0.0-30.2 sec 1.13 GBytes 321 Mbits/sec bsd.multi_processor_patched pf=NO speed_tweaks=NO [ 4] 0.0-30.0 sec 1.28 GBytes 368 Mbits/sec
According to the results the network utilization was quite poor. We are able to push data across the network at less than half of its capacity (Gigabit=1000Mbit/s and we used 368Mbit/s or 36%). For most uses on a home network with a cable modem or FIOS you will not notice. But, what if you have access to a high speed gigabit or 10 gigabit network?
Test 2: Calomel.org Speed Tweaks. Using the GENERIC and GENERIC.MP (patched -stable) kernel we are able to sustain around 800 Mbits/sec, almost three(3) times the default speeds.
bsd.single_processor_patched pf=YES speed_tweaks=YES [ 1] 0.0-30.0 sec 2.95 GBytes 845 Mbits/sec bsd.single_processor_patched pf=NO speed_tweaks=YES [ 1] 0.0-30.0 sec 3.25 GBytes 868 Mbits/sec bsd.multi_processor_patched pf=YES speed_tweaks=YES [ 4] 0.0-30.0 sec 2.69 GBytes 772 Mbits/sec bsd.multi_processor_patched pf=NO speed_tweaks=YES [ 4] 0.0-30.2 sec 2.82 GBytes 803 Mbits/sec
These results are much better. We are utilizing more than 80% of a gigabit network. This means we can sustain over 100 megabytes per second on our network. Both the single processors and multi processor kernels performed efficiently. The use of PF reduced our throughput only minimally.
Why do these "speed tweaks" work? What is the theory?
The dominant protocol used on the Internet today is TCP, a "reliable" "window-based" protocol. The best possible network performance is achieved when the network pipe between the sender and the receiver is kept full of data. Take a look at the excellent study done at the Pittsburgh Supercomputing Center titled, "Enabling High Performance Data Transfers". They cover bandwidth delay products (BDP), buffers, maximum TCP buffer (memory) space, socket buffer sizes, TCP large window extensions (RFC1323), TCP selective acknowledgments option (SACK, RFC2018) and path MTU theory.Should we use the GENERIC or GENERIC.MP kernel?
As of OpenBSD v4.5 you are welcome to use either one. Both kernels performed exceptionally well in our speeds tests.Despite the recent development of multiple processors support in the OpenBSD, the kernel still operates as if were running on a single processor system. On a SMP system only one processor is able to run the kernel at any point in time, a semantic which is enforced by a Big Giant Lock. The Big Giant Lock (BGL) works like a token. If the kernel is being run under one CPU then it has the BGL and thus the kernel can _not_ be run on a second cpu. The network stack and thusly pf and pfsync run in the kernel and so under the Big Giant Lock.
If you have access to a multi core machine and are expecting to use programs that will take advantage of the cores then the multi core board is a good choice. PF is _not_ a multi core program so it will not benefit from multi core kernel. For example an intrusion detection app, monitoring script or real time network reporting tool. Truthfully, if you have multiple cores then use them.
Your firewall is one of the most important machines on the network. Keep the system time up to date with OpenNTPD "how to" (ntpd.conf), monitor your hardware with S.M.A.R.T. - Monitoring hard drive health and keep track of any changed files with a custom Intrusion Detection (IDS) using mtree. If you need to verify a hard drive for bad sectors check out Badblocks hard drive validation/wipe.
Other Operating System Software
The next few sections are going to be dedicated to different operating systems Other then OpenBSD. Each OS has some way in which you can increase the overall throughput of the system. Just scroll to the OS you are most interested in.FreeBSD network stack
### Calomel.org FreeBSD /etc/sysctl.conf ## kern.ipc.maxsockbuf=262144 # Maximum window size net.inet.tcp.sendspace=65536 # Increase TCP windows size to increase performance net.inet.tcp.recvspace=65536 # " net.inet.tcp.rfc1323=1 # RFC1323 TCP window scaling kern.ipc.nmbclusters=32768 # Buffers
RedHat or CentOS Linux network stack
### Calomel.org RedHat or CentOS Linux /etc/sysctl.conf ## # some of the defaults may be different for your kernel call this file with # sysctl -pthese are just suggested values that worked well to # increase throughput in several network benchmark tests, ### IPV4 specific settings # turns TCP timestamp support off, default 1, reduces CPU use net.ipv4.tcp_timestamps = 0 # turn SACK support on -- you probably want this off for 10GigE net.ipv4.tcp_sack = 1 # scaling support net.ipv4.tcp_window_scaling=1 # on systems with a VERY fast bus to memory interface this is the big plus # sets min/default/max TCP read buffer, default 4096 87380 174760 # setting to 100M - 10M is too small for cross country (chsmall) net.ipv4.tcp_rmem = 1000000 1000000 1000000 # sets min/pressure/max TCP write buffer, default 4096 16384 131072 net.ipv4.tcp_wmem = 1000000 1000000 1000000 # sets min/pressure/max TCP buffer space, default 31744 32256 32768 net.ipv4.tcp_mem = 150000000 150000000 150000000 ### CORE settings (for socket and UDP effect) # maximum receive socket buffer size, default 131071 net.core.rmem_max = 1000000 # maximum send socket buffer size, default 131071 net.core.wmem_max = 1000000 # default receive socket buffer size, default 65535 net.core.rmem_default = 2524287 # default send socket buffer size, default 65535 net.core.wmem_default = 2524287 # maximum amount of option memory buffers, default 10240 net.core.optmem_max = 2524287 # number of unprocessed input packets before kernel starts dropping them, default 300 net.core.netdev_max_backlog = 300000 # enable window scaling RFC1323 TCP window scaling net.ipv4.tcp_window_scaling=1
Suse or openSUSE Linux network stack
### Calomel.org Suse or openSUSE Linux /etc/sysctl.conf ## # some of the defaults may be different for your kernel call this file with # sysctl -pthese are just suggested values that worked well to # increase throughput in several network benchmark tests, # packet reordering in a network can be interpreted as packet loss # and increasing the value of this parameter should improve performance net.ipv4.tcp_reordering = 20 # Sets the Maximum Socket Send Buffer for TCP Protocol net.ipv4.tcp_wmem = 8192 87380 16777216 # Sets the Maximum Socket Receive Buffer for TCP Protocol net.ipv4.tcp_rmem = 8192 87380 16777216 # Enables/Disables the behavior of cache performance characteristics connection net.ipv4.tcp_no_metrics_save = 1 # You can set this to one of the manu available high speed congestion variants like "cubic" or "hs-tcp" net.ipv4.tcp_congestion_control = cubic # sets the Maximum Socket Send Buffer for all protocols net.core.wmem_max = 16777216 # Sets the Maximum Socket Receive Buffer for all protocols net.core.rmem_max = 16777216
Windows XP/2000 Server/Server 2003 network stack
Edit the registry using "regedit" and look for the following section:HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\ParametersNow add the following values:
- Add a registry DWORD named TcpWindowSize with a value of 131400 (click on 'decimal').
- Add a registry DWORD named Tcp1323Opts with a value of 3. This will enable rfc1323 scaling and timestamps.
- Add a registry DWORD named ForwardBufferMemory with a value of 80000. Increase TCP windows size
- Add a registry DWORD named NumForwardPackets with a value of 60000. Increase buffer for forwarded packets.
No comments:
Post a Comment