Loading...
 

SysAdmin Blog

SysAdmin Blog

Linux ATA bus errors with ASMedia ASM1062 PCIe card

Alexander Bochmann Saturday 22 of October, 2016
I recently added a cheap ASM1062 2-port SATA card to my Linux box at home, since it's Asus C8HM70-I board only has two SATA ports, and I wanted to use an additional small SSD as boot device.

With my disks hooked up to the new card, I started to get SATA errors when there was moderate write load:

kernel log
ata5.00: exception Emask 0x10 SAct 0x7c000000 SErr 0x400000 action 0x6 frozen
ata5.00: irq_stat 0x08000000, interface fatal error
ata5: SError: { Handshk }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/00:d0:00:2b:6f/0a:00:ac:00:00/40 tag 26 ncq 1310720 out
         res 40/00:f4:00:53:6f/00:00:ac:00:00/40 Emask 0x10 (ATA bus error)
ata5.00: status: { DRDY }
ata5.00: failed command: WRITE FPDMA QUEUED
ata5.00: cmd 61/00:d8:00:35:6f/0a:00:ac:00:00/40 tag 27 ncq 1310720 out
         res 40/00:f4:00:53:6f/00:00:ac:00:00/40 Emask 0x10 (ATA bus error)
[..]
ata5: hard resetting link
ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata5.00: configured for UDMA/133
ata5: EH complete


I'm not yet ready to blame the card itself, since I remembered I recycled a pair of rather old SATA cables to connect the drives, and the card supports SATA 6G... The mainboard itself has just one SATA 6G connector, and with that I used different cables that clip into the port, but the clip mechanic doesn't work with the connectors on the ASMedia card.

For now, I turned the SATA link speed down to 3G by adding an libata.force parameter to the kernel command line:

libata.force=5:3.0G,6:3.0G

(5 and 6 are corresponding to ata5 and ata6 from the libata kernel messages.)

This seems to work as a stopgap measure - the bus errors haven't reappeared since.

Before:

ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

With libata.force:

ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)



syslog-ng and RcvbufErrors on Linux

Alexander Bochmann Tuesday 10 of May, 2016
We're running a syslog-ng installation to collect syslog data from quite a lot of systems (and then selectively feed them into our Splunk installation). Almost all of these send syslog via UDP.

Recently, when adding a couple more machines, I noticed that the syslog server is dropping UDP datagrams:

udp RcvbufErrors
# netstat -su | grep -A6 "^Udp:"
Udp:
    518026364 packets received
    36078 packets to unknown port received.
    23164168 packet receive errors
    1248583 packets sent
    RcvbufErrors: 23164167
UdpLite:


Yikes!

This is mentioned in the syslog-ng OSE docs, but it seems no one here ever got to that section, including myself.

So, in that context I learned about the so-rcvbuf() parameter to the udp() source in syslog-ng, and the Linux kernel net.core.rmem_max sysctl...

Kernel configuration
# sysctl -w net.core.rmem_max=16777216

(add the same parameter to /etc/sysctl.conf)

syslog-ng.conf
source s_net {  
                udp(ip(0.0.0.0) port(514) so-rcvbuf(8388608)); 
};

(There's no reason why so-rcvbuf() couldn't be the same as rmem_max, and neither needs to be a multiple of 1024 - both just bad habits of mine...)

Don't increase net.core.rmem_default, as that would make the Linux kernel use a bigger buffer for every UDP socket being created on the system.

The RcvbufErrors counter hasn't been increasing since that change, but I'll add monitoring for that, so drops won't go unnoticed in the future.

killing your network with Cisco ASA 9.x identity NAT and proxy arp

Alexander Bochmann Sunday 17 of April, 2016
I was about to prepare a longer blog post on one of the pitfalls when migrating the NAT ruleset of an older Cisco ASA to a 9.x release - but as it turns out, the problem is already documented pretty well by Cisco, if you know what to look for...

With "Twice NAT", as implemented in 9.x software versions, an ASA firewall in routed mode will automatically do proxy ARP for all addresses covered by a NAT rule, to attract traffic for them. This is usually an intended effect, unless you're configuring Identity NAT rules (used to inhibit address translation for certain source/destination pairs) that cover address space locally connected to the firewall. This was not a problem with NAT exempt rules on older ASA software, but if such a rule is used now without the no-proxy-arp parameter, the ASA will act as a blackhole for traffic on on the local network segment, by sending proxy-ARP replies for addresses it doesn't own.

In Proxy ARP Problems with Identity NAT (cache), Cisco illustrates the problem with this diagram:

image copied from vendor documentation, (c) Cisco
image copied from vendor documentation, (c) Cisco


Yeah, don't do that. Always consider whether no-proxy-arp is required for a NAT rule before it's being deployed.

(Also see ASA FAQ: Why does the ASA reply to ARP requests for other IP addresses in the subnet? (cache).)

Cyanogenmod 12.1 device encryption fails after wiping filesystems with TWRP

Alexander Bochmann Wednesday 25 of November, 2015
I recently bought a 2nd hand Android mobile (Samsung) to install Cyanogenmod on. The process is quite straightforward from the documentation on the CM website. I installed TWRP using Heimdall and wiped the system partitions from the recovery before installing CM 12.1.

Once running Cyanogenmod, I wasn't able to activate device encryption though. Unsuccessfully tried several of the tips out there, like disabling Selinux before starting the encryption process. After retrying with an active adb logcat, I found this message in the log:

E/Cryptfs (  183): Orig filesystem overlaps crypto footer region.  Cannot encrypt in place.

...which in turn lead me to this thread on the Cyanogenmod forums (cache). The hint to resize the data partition is correct, but it's not actually required to reformat the filesystem, as Android comes with a resize2fs. So I booted into TWRP recovery and connected to the system via adb shell. Turns out that /data is mounted on /dev/block/mmcblk0p24:

# df
[..]
/dev/block/mmcblk0p24
                       5584700    931020   4653680  17% /data
/dev/block/mmcblk0p24
                       5584700    931020   4653680  17% /sdcard

After unmounting /data and /sdcard, I had a quick look at the partition with tune2fs:

# tune2fs -l /dev/block/mmcblk0p24
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name:   
Last mounted on:          /data
Filesystem UUID:          17e3f4bc-acf2-631e-af53-921ea0c9e21a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode filetype extent sparse_super large_file uninit_bg
Filesystem flags:         unsigned_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Remount read-only
Filesystem OS type:       Linux
Inode count:              355520
Block count:              1421307
Reserved block count:     0
Free blocks:              1163420
Free inodes:              353516
First block:              0
Block size:               4096
Fragment size:            4096
[..]

So, 1421307 blocks of 4096 bytes. Since the forum thread was not quite clear on how much space is required to facilitate encryption, I decided to shrink the filesystem by 8 blocks (32k):

# e2fsck -fy /dev/block/mmcblk0p24 
# resize2fs /dev/block/mmcblk0p24 1421299

...rebooted into CM, and successfully activated system encryption without further problems.

Checkpoint vpn debug - Cannot signal vpnd: No such process

Alexander Bochmann Wednesday 05 of August, 2015
Jotting this down as I've found no useful reference to the above error message on the net:

Trying to enable IKE debugging on a Checkpoint FW1 using vpn debug ikeon results in the error message Cannot signal vpnd: No such process

This happens when the PID in in /opt/CPsuite-Rxx/fw1/tmp/vpnd.pid has been overwritten. I've seen various people (including myself) run into this because they had typed vpnd debug instead of vpn debug at some point...

Solution: Overwrite vpnd.pid with the correct PID

pgrep vpnd > $FWDIR/tmp/vpnd.pid

FortiOS 5.2 upgrade problems on Fortigate 80C

Alexander Bochmann Sunday 05 of July, 2015
Recently I tried upgrading my Fortigate 80C firewall to a current FortiOS (5.2.3) following the - supposedly - supported upgrade path, from 5.0.10.

Unfortunately I ran into the ehci_hcd 5035: fatal error that's been mentioned on the Fortinet forums in various places (here, for example) - system doesn't boot. Good thing it's possible to easily fall back to the previous release by booting from the backup partition. When you're connected to the console port, that is.

Today I found out FortiOS 5.2.3 can be installed after wiping the internal flash from the bootloader, using a serial console. My Fortigate had originally been installed with some FortiOS 4 release - I assume the boot disk layout has changed somewhere between releases, and the new image just doesn't fit.

Prerequisites:
  • a tftp server configured for an address in the 192.168.1.0/24 network on interface 1 of your Fortigate to hold the new firmware image (mine wasn't, and I had to quickly shuffle some things around to recover from that)...
  • an USB stick with the current configuration to import after the upgrade has finished (or just put it on the tftp server, too)

First, select

[F]: Format boot device.

from the bootloader menu. As soon as that is finished, use

[G]: Get firmware image from TFTP server.

to fetch the new firmware image via tftp. The system will reboot with a default configuration. Log in with the admin account (no password) and restore your configuration from the USB stick:

config global
execute restore config usb <filename>

Done.

ATEN support - a positive surprise

Alexander Bochmann Friday 17 of October, 2014
In my previous post - quite some time ago - I was bitching about the scarcely documented RADIUS authorization function on one of our ATEN SN0116 serial console servers.

Some time later, I found out that RADIUS authentication doesn't quite work - only one user could log on when using RADIUS, and subsequent login attempts were denied. At the same time, there was no such problem when using local user accounts on the system.

Initially, I didn't bother opening a support request with ATEN for this issue, because I didn't expect anything to happen (no support contracts and all). Nevertheless, just before giving up on the RADIUS functionality, I registered one of our systems and described the problem.

After a bit of back and forth with the support representative on the ticket, I quickly got through both the "reporting this to engineering" and "engineering acknowledged the problem" stages, to get an updated software only a couple of days later.

It's not yet up on the ATEN firmware download page, but I assume the upcoming version (the next after v3.1.303) will have the relevant bug fix.

At our company, we've been quite exhausted by our support experiences with the likes of Cisco and Juniper in the past months, so this was really a pleasant experience for a change. Thank you, ATEN.

RADIUS authorization on an ATEN SN0116 serial console server

Alexander Bochmann Friday 07 of March, 2014
For several days I've been puzzled by the documentation for RADIUS authorization on an ATEN Altusen SN0116 serial console server (cache), using Windows NPS as RADIUS server. The ATEN docs unhelpfully state, "On the RADIUS server, set the access rights for each user according to the attribute information in the table, below" - and then there's a list of flags that specify the authorization options.

The docs fail to mention in which RADIUS attribute these authorization flags are supposed to be returned to the console server, though.

After some twiddling, it turns out that the flags should to be placed (in Microsoft NPS terms) as string into a vendor-specific attribute with vendor-code 0 and vendor-attribute 0. Additionally, if the RADIUS policy configuration contains several vendor-specific attributes, it seems that the ATEN device only parses the first one that's returned by the server.

NPS configuration to make this work looks something like this:

MS NPS RADIUS configuration for ATEN serial console server
MS NPS RADIUS configuration for ATEN serial console server

upgrading from php 5.2 to php 5.5 in one step...

Alexander Bochmann Friday 24 of January, 2014
...is actually not that hard (at least with something like my site, which is of limited complexity after all).

I ran into just a handfull of problems:
  • php does not start at all after Zend Opcache is enabled (php.ini with zend_extension=opcache.so, opcache.enable=1, etc.)
Fatal Error Unable to allocate shared memory segment of 134217728 bytes: shmat: Cannot allocate memory (12)

On my not-quite-current OpenBSD system, the kern.shminfo.shmmax sysctl was set rather low by default. For Opcache to work, it needs to be large enough to provide the amount of memory configured in opcache.memory_consumption (which is actually the value from the error message).
The one time fix is running sysctl -w kern.shminfo.shmmax=134217728, and otherwise to add the equivalent setting to sysctl.conf
  • error log is full of PHP Strict Standards and PHP Deprecated messages, and tweaking error_reporting in php.ini does not help at all
This actually cost me quite some time to fix until it finally clicked after I repeatedly read over "Prior to PHP 5.4.0 E_STRICT was not included within E_ALL" in the php documentation: Some of the (old) PHP applications I run set their own error reporting values, and one offender in particular had @error_reporting(E_ALL); in it's index.php... Changing that to @error_reporting(E_ALL ^ E_NOTICE ^ E_STRICT ^ E_DEPRECATED); keeps the log file clean (but I guess I'm not going to upgrade past php 5.5 anytime soon).
  • PHP Warning: strftime(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function.
That was unexpected, but easy to get rid of - obviously by actually setting date.timezone to a reasonable value in php.ini (CET in my case).