OpenSUSE 42.3, Lenovo ThinkPad P70, data corruption issues?

Discussion in 'Linux Compatibility and Software' started by rlk, Aug 29, 2017.

  1. rlk

    rlk Notebook Consultant

    Reputations:
    17
    Messages:
    187
    Likes Received:
    65
    Trophy Points:
    41
    I recently bought a refurb ThinkPad P70 (Xeon E3-1505Mv5, nVidia M4000M, UHD screen, 2x16GB RAM), and am having an odd data corruption issue with openSUSE Leap 42.3 (either the stock 4.4.79+ or a more recent 4.12). Specifically, anything that uses sockets -- network or UNIX domain -- gets sporadic data corruption. Sometimes it's one or two flipped bits, sometimes it's more than just a few bits in a byte. There appears to be some rough clustering, but there are no sequences of bytes that are bad.

    One pattern I did pick out is that the bottom 5 bits of all of the affected file offsets (using scp to copy the files) are all ones.

    This happens with both inbound, outbound, and local rsync (since rsync has some additional checks), so if I rsync from a remote machine, rsync between two directories on the local machine (which uses a UNIX domain socket), or rsync to a remote machine, I get the same kinds of comparisons or protocol failures. I've seen it with inbound ftp (downloading RPMs, where of course there are checksums), too.

    I have not seen any errors with simple file copy using cp -r or tar through a pipe.

    This does not happen if I boot Knoppix 7.7.1 (based on kernel 4.7.9).

    I have run a full pass of memtest86 with no errors. I have tried using just one DIMM at a time and changing which slot I use, and using the rear panel slots vs. the under-keyboard slots; no change in the symptoms. The BIOS is up to date, presumably with the microcode fix.

    I am presently running diagnostics; that will take a while longer. I am also going to try a vanilla kernel (provided by openSUSE RPMs) to see if that makes a difference.

    What I have to decide is whether I return it (and most likely eat a 15% restocking fee; I didn't see any indication of this under Windows, which I don't plan to use but I kept the SSD with it installed), get the mobo replaced under warranty (have to ship to Lenovo, presumably at my expense), or find a solution on my own to this. I have not been able to find anything on the net about a problem like this, either. It's an odd one; the symptoms look generally memory-ish, but it happened with two different DIMMs in different slots, and it's only happening with sockets. It's also apparently happening above the transport layer, since TCP checksums aren't catching it.

    Anyone have any thoughts here?
     
  2. rlk

    rlk Notebook Consultant

    Reputations:
    17
    Messages:
    187
    Likes Received:
    65
    Trophy Points:
    41
    So, some more information overnight.

    It appears that if I remove the xf86-video-nouveau package and use the vanilla kernel (4.12.9-1.gf2ab6ba-vanilla), this problem goes away. This seems distinctly odd, and this holds even if I boot to runlevel 3 and never start the X server to begin with (and blacklist the nouveau kernel driver). However, with either nouveau installed or using the default kernel of the same vintage I have the data corruption issue I described.

    I ran a full pass of the Lenovo diagnostics in addition to memtest86, and found nothing. But it has me at a loss for explanation. The failure is robust against SSD and memory configuration, and appears confined to something both very specific and very general (use of sockets). It also happens regardless of whether I have hyperthreading enabled in the BIOS (and the BIOS is up to date in any event). But with two software changes, one of which should be completely unrelated, the problem appears to reliably go away.

    This is making me nervous; if I can't find an explanation and fix, I'll certainly have to return the machine even if I have to eat the restocking fee.
     
  3. rlk

    rlk Notebook Consultant

    Reputations:
    17
    Messages:
    187
    Likes Received:
    65
    Trophy Points:
    41
    OK, to close this off: it turned out to be a hardware problem with the motherboard that could be reproduced under Windows, although it was more difficult (perhaps because I couldn't get as much throughput through the 1Gb NIC). The nouveau driver and kernel version turned out to be red herrings; I was able to reproduce it with those changes.

    It took two tries through Lenovo warranty service to get fixed. The first time they simply reimaged it, evidently without reading my notes. The second time I made sure they understood what was going on, and it came back with a new motherboard. Lenovo's depot warranty service is pretty good; they overnight you a box with packaging instructions and materials and a mailing label, you take it to Fedex where it's overnighted to Lenovo, they fix it (both times only a day or so, although they say 5 business days), and they overnight it back.

    I don't particularly like having a newly refurbed laptop come with this, but it did need fairly extensive stress testing to trigger. I can't be certain that this never happens to other brands of laptops, and I'm otherwise quite happy with the machine. The display is much better than my M6500, and the keyboard also appears to be more robust. It's markedly faster for my main processing-intensive work, namely image editing; CPU alone is about 2.5x, with the GPU involved it's more like 6x. Still having software issues getting the M4000M to operate as it really should, but I'd probably have the same issue with just about any laptop of this class due to the weirdness of dual GPUs and Optimus.

    I've ordered a bracket for the 2.5" bay (why on earth do you need a special bracket just to install that?); I can't completely switch over until I have that, but at present I'm quite happy with the P70.
     
Loading...

Share This Page