RAID6-2016

Aus OrgaMon Wiki
Zur Navigation springen Zur Suche springen
  • Server "raib2" im RZ Ubstadt
  • Raid-6 aus 8x 2 TB Platten, dazu habe ich 9x Festplatten gekauft:
    • 2 TB Kapazität - 3,5" Formfaktor - SATA 6Gb/s - 7200 U/min
    • 3 verschiedene Hersteller wegen der vergrößerten Varianz des Zeitpunktes des Ausfall

Serien Nummern

  • WD Purple
    • S/N: WCC4M0SC7C9R
    • S/N: WCC4M0SC7AR1 +'12.2017
    • S/N: WCC4M0XEZ7CH
  • SEAGATE BARRACUDA
    • S/N: Z4Z2W81E
    • S/N: Z4Z32SNR
    • S/N: Z4Z2XNWC
  • Toshiba DT01ACA
    • S/N: X5RAD3XGSTZ5
    • S/N: X5RAD2GGSTZ5
    • S/N: Y5GHNDBTSTZ5 +'03.2018
  • Das sind 18 TB, das kostet zusammen 684.93 € (Stand Feb 2016).
  • 8 der Platten verwende ich im RAID, eine Platte lege ich daneben für den Fall der Fälle
    • im Lager: Toshiba DT01ACA S/N: X5RAD3XGSTZ5
  • 2018: 2 Platten für zusammen 130 Euro eingebaut:
    • scsi-SATA_WDC_WD20EFRX-68E_WD-WCC4M4KXAVNT
    • scsi-SATA_TOSHIBA_HDWD120_X7T0V1LAS

Block B (oben)

Block A (unten)

Die Tabelle beschreibt die Anordnung von unten nach oben!

HDD - Cage

Eigenschaften/Bay X3 A1 A2 A3 A4 X1 X2 S1 B1 B2 B3 B4
laufende Nummer sas 1 2 3 4 a b c 5 6 7 8
SATA-Kabel rot B-P4 B-P3 B-P2 B-P1 A-P1 A-P2 SSD rot rot rot rot
Modell HUS724020ALS640 DT01ACA2 WD20PURX-64P ST2000DM001-1ER1 DT01ACA2 WD20EFRX-68E HDWD120 SSD 850 ST2000DM001-1ER1 DT01ACA2 WD20PURX-64P ST2000DM001-1ER1
Serial P6K4TJSV X5RAD2GGS WD-WCC4M0SC7AR1 Z4Z32SNR X5RAD3XGS WD-WCC4M4KXAVNT X7T0V1LAS S1SMNSAG110166X Z4Z2XNWC Y5GHNDBTS WD-WCC4M0SC7C9R Z4Z2W81E
/dev/disk/by-id scsi-SHGST_HUS724020ALS640_P6K4TJSV scsi-SATA_TOSHIBA_DT01ACA2_X5RAD2GGS scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0SC7AR1 scsi-SATA_ST2000DM001-1ER1_Z4Z32SNR scsi-SATA_TOSHIBA_DT01ACA2_X5RAD3XGS scsi-SATA_WDC_WD20EFRX-68E_WD-WCC4M4KXAVNT scsi-SATA_TOSHIBA_HDWD120_X7T0V1LAS ata-Samsung_SSD_850_PRO_128GB_S1SMNSAG110166X ata-ST2000DM001-1ER164_Z4Z2XNWC ata-TOSHIBA_DT01ACA200_Y5GHNDBTS ata-WDC_WD20PURX-64P6ZY0_WD-WCC4M0SC7C9R ata-ST2000DM001-1ER164_Z4Z2W81E
Devicename sdk sdk sde sdd sdc sda sdb sdf sdj sdi sdg sdh
RAID-Device SPARE 7 REMOVED 5 4 6 SPARE SYSTEM 3 2 0 1

Setup Aufbau

Produktiv

Störung vom 28.12.2017

[243680.637402] aacraid: Host adapter abort request (0,2,3,0)
[243691.068772] sd 0:2:3:0: [sdi] tag#1 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
[243691.068778] sd 0:2:3:0: [sdi] tag#1 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[243691.068786] blk_update_request: I/O error, dev sdi, sector 2064
[243691.068788] md: super_written gets error=-5
[243691.068793] md/raid:md127: Disk failure on sdi1, disabling device.
               md/raid:md127: Operation continuing on 7 devices.
[243801.115324] aacraid: Host adapter abort request timed out
[243801.115334] aacraid: Host adapter abort request (0,2,3,0)
[243801.115384] aacraid: Host adapter reset request. SCSI hang ?
[243921.593220] aacraid: Host adapter reset request timed out
[243921.593230] sd 0:2:3:0: Device offlined - not ready after error recovery
[243921.593233] sd 0:2:3:0: Device offlined - not ready after error recovery
[243921.593248] sd 0:2:3:0: [sdi] tag#8 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
[243921.593252] sd 0:2:3:0: [sdi] tag#8 CDB: Read(10) 28 00 04 a0 c4 00 00 02 00 00
[243921.593256] blk_update_request: I/O error, dev sdi, sector 77644800
[243921.593289] sd 0:2:3:0: [sdi] tag#11 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
[243921.593292] sd 0:2:3:0: [sdi] tag#11 CDB: Read(10) 28 00 04 a0 c6 00 00 02 00 00
[243921.593294] blk_update_request: I/O error, dev sdi, sector 77645312
[416403.254386] hrtimer: interrupt took 29227 ns
[853039.443372] sd 0:2:3:0: rejecting I/O to offline device
[853039.443402] sd 0:2:3:0: rejecting I/O to offline device
[853039.443411] sd 0:2:3:0: rejecting I/O to offline device
[853039.443418] sd 0:2:3:0: rejecting I/O to offline device
[853039.443426] sd 0:2:3:0: rejecting I/O to offline device
[853039.443433] sd 0:2:3:0: rejecting I/O to offline device
[853039.443440] sd 0:2:3:0: rejecting I/O to offline device
[853039.443448] sd 0:2:3:0: rejecting I/O to offline device
[853039.443455] sd 0:2:3:0: rejecting I/O to offline device
[853039.443633] sd 0:2:3:0: rejecting I/O to offline device
[853039.443646] sd 0:2:3:0: rejecting I/O to offline device
[853039.443653] sd 0:2:3:0: rejecting I/O to offline device
[853039.443660] sd 0:2:3:0: rejecting I/O to offline device
[853039.443667] sd 0:2:3:0: rejecting I/O to offline device
[853039.443674] sd 0:2:3:0: rejecting I/O to offline device
[853039.443681] sd 0:2:3:0: rejecting I/O to offline device
[853039.443687] sd 0:2:3:0: rejecting I/O to offline device 
  • Ich wollte die serial ID der Platte rausfinden, ähm, jedoch hwinfo --disk lieferte bei der Platte nur noch:

28: IDE 23.0: 10600 Disk

 [Created at block.245]
 Unique ID: ipPt.uEhVIzZ7wdA
 Parent ID: B35A.VPIkJrtnW73
 SysFS ID: /class/block/sdi
 SysFS BusID: 0:2:3:0
 SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0
 Hardware Class: disk
 Model: "WDC WD20PURX-64P"
 Vendor: "WDC"
 Device: "WD20PURX-64P"
 Revision: "0A80"
 Driver: "aacraid", "sd"
 Driver Modules: "aacraid", "sd_mod"
 Device File: /dev/sdi
 Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0XEZ7CH, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0
 Device Number: block 8:128-8:143
 Drive status: no medium
 Config Status: cfg=new, avail=yes, need=no, active=unknown
 Attached to: #15 (Serial Attached SCSI controller)
  • es hätte aber angezeigt werden sollen:

28: IDE 23.0: 10600 Disk

 [Created at block.245]
 Unique ID: ipPt.dZvPpEVVaL9
 Parent ID: B35A.VPIkJrtnW73
 SysFS ID: /class/block/sdi
 SysFS BusID: 0:2:3:0
 SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0
 Hardware Class: disk
 Model: "WDC WD20PURX-64P"
 Vendor: "WDC"
 Device: "WD20PURX-64P"
 Revision: "0A80"
 Serial ID: "WD-WCC4M0XEZ7CH"
 Driver: "aacraid", "sd"
 Driver Modules: "aacraid", "sd_mod"
 Device File: /dev/sdi
 Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0XEZ7CH, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0
 Device Number: block 8:128-8:143
 Geometry (Logical): CHS 243201/255/63
 Size: 3907029168 sectors a 512 bytes
 Capacity: 1863 GB (2000398934016 bytes)
 Config Status: cfg=new, avail=yes, need=no, active=unknown
 Attached to: #15 (Serial Attached SCSI controller)
  • ich suche also die Platte "WD-WCC4M0XEZ7CH"

raus und ersetzt durch:

28: IDE 23.0: 10600 Disk

 [Created at block.245]
 Unique ID: ipPt.IyRYgsTsxUD
 Parent ID: B35A.VPIkJrtnW73
 SysFS ID: /class/block/sdi
 SysFS BusID: 0:2:3:0
 SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0
 Hardware Class: disk
 Model: "TOSHIBA DT01ACA2"
 Vendor: "TOSHIBA"
 Device: "DT01ACA2"
 Revision: "ABB0"
 Serial ID: "X5RAD3XGS"
 Driver: "aacraid", "sd"
 Driver Modules: "aacraid", "sd_mod"
 Device File: /dev/sdi
 Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_TOSHIBA_DT01ACA2_X5RAD3XGS, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0
 Device Number: block 8:128-8:143
 Geometry (Logical): CHS 243201/255/63
 Size: 3907029168 sectors a 512 bytes
 Capacity: 1863 GB (2000398934016 bytes)
 Config Status: cfg=new, avail=yes, need=no, active=unknown
 Attached to: #15 (Serial Attached SCSI controller)
  • ich schaue mal nach wie der Status des Array ist:

raib2:~ # mdadm --detail /dev/md127 /dev/md127:

          Version : 1.2
    Creation Time : Fri Oct 28 11:41:55 2016
       Raid Level : raid6
       Array Size : 11720294400 (11177.34 GiB 12001.58 GB)
    Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
     Raid Devices : 8
    Total Devices : 7
      Persistence : Superblock is persistent
    Intent Bitmap : Internal
      Update Time : Thu Dec 28 14:39:27 2017
            State : clean, degraded
   Active Devices : 7
  Working Devices : 7
   Failed Devices : 0
    Spare Devices : 0
           Layout : left-symmetric
       Chunk Size : 512K

Consistency Policy : bitmap


             Name : raib2:0  (local to host raib2)
             UUID : 500aa0db:5aca5187:5617c3ff:dc97c2c4
           Events : 10316
   Number   Major   Minor   RaidDevice State
      0       8       17        0      active sync   /dev/sdb1
      1       8       33        1      active sync   /dev/sdc1
      2       8       49        2      active sync   /dev/sdd1
      3       8       65        3      active sync   /dev/sde1
      -       0        0        4      removed
      5       8      113        5      active sync   /dev/sdh1
      6       8       97        6      active sync   /dev/sdg1
      7       8       81        7      active sync   /dev/sdf1
 
  • also das defekte device ist nun 100% "removed!"
  • dann reicht ein hinzufügen eines Spare,
 mdadm /dev/md127 --add-spare /dev/sdi1
 
  • nach dem rebuild - der durch obigen Befehl automatisch startet, da ja ein device "fehlt", wird es automatisch als vollwertiges "U"-Device hinzugefügt!

Störung vom 06.03.2018

Feb 10 09:15:25 raib2 kernel: ata5.00: exception Emask 0x0 SAct 0x38 SErr 0x0 action 0x0
Feb 10 09:15:25 raib2 kernel: ata5.00: irq_stat 0x40000008
Feb 10 09:15:25 raib2 kernel: ata5.00: failed command: READ FPDMA QUEUED
Feb 10 09:15:25 raib2 kernel: ata5.00: cmd 60/f0:20:00:d2:11/01:00:00:00:00/40 tag 4 ncq dma 253952 in
                                       res 51/40:80:70:d2:11/00:01:00:00:00/40 Emask 0x409 (media error) <F>
Feb 10 09:15:25 raib2 kernel: ata5.00: status: { DRDY ERR }
Feb 10 09:15:25 raib2 kernel: ata5.00: error: { UNC }
Feb 10 09:15:25 raib2 kernel: ata5.00: configured for UDMA/133
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 Sense Key : Medium Error [current]
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 CDB: Read(10) 28 00 00 11 d2 00 00 01 f0 00
Feb 10 09:15:25 raib2 kernel: blk_update_request: I/O error, dev sdd, sector 1167984
Feb 10 09:15:25 raib2 kernel: ata5: EH complete
Feb 10 09:15:29 raib2 kernel: ata5.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
Feb 10 09:15:29 raib2 kernel: ata5.00: irq_stat 0x40000008
Feb 10 09:15:29 raib2 kernel: ata5.00: failed command: READ FPDMA QUEUED
Feb 10 09:15:29 raib2 kernel: ata5.00: cmd 60/08:08:70:d2:11/00:00:00:00:00/40 tag 1 ncq dma 4096 in
                                       res 51/40:08:70:d2:11/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Feb 10 09:15:29 raib2 kernel: ata5.00: status: { DRDY ERR }
Feb 10 09:15:29 raib2 kernel: ata5.00: error: { UNC }
Feb 10 09:15:29 raib2 kernel: ata5.00: configured for UDMA/133
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 Sense Key : Medium Error [current]
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 00 11 d2 70 00 00 08 00
Feb 10 09:15:29 raib2 kernel: blk_update_request: I/O error, dev sdd, sector 1167984
Feb 10 09:15:29 raib2 kernel: ata5: EH complete
Feb 10 09:15:45 raib2 kernel: md/raid:md127: read error corrected (8 sectors at 1165936 on sdd1)
  • smartd hat den Read error auch mitbekommen
Feb 10 09:42:03 raib2 smartd[2004]: Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors
Feb 10 09:42:03 raib2 smartd[2004]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 253 to 200
Feb 10 09:42:04 raib2 smartd[2004]: Device: /dev/sdd [SAT], ATA error count increased from 0 to 2
  • es ist die disk "ata-TOSHIBA_DT01ACA200_Y5GHNDBTS"
  • mdadm hatte niemals gesagt, dass eben eine Platte rausgeschmissen wurde
  • Bei einem Neustart hiess es dann einfach
Feb 10 13:00:02 raib2 mdadm[3784]: NewArray event detected on md device /dev/md127
Feb 10 13:00:02 raib2 mdadm[3784]: DegradedArray event detected on md device /dev/md127
  • meine Lösung was es 2 Spares in das Array zu bringen:
  • scsi-SATA_WDC_WD20EFRX-68E_WD-WCC4M4KXAVNT
mdadm /dev/md127 --add-spare /dev/sdf1
mdadm /dev/md127 --add-spare /dev/sdg1
  • Nach dem ersten Befehl lief sofort die Recovery los, so wie ich das erwartet hatte
  • Ich habe mir das kurz angesehen, er war aber noch bei "0%"
  • Der 2. Befehl brauchte sehr lange - ich glaube 40 Sekunden bis er angenommen wurde, danach war aber alles OK (Recovery und ein Spare!)
  • Doch nun kan es zu einer weiteren Störung

Störung vom 06.03.2018

Mar 07 19:24:34 raib2 kernel: md: md127: recovery done.
Mar 07 19:24:35 raib2 kernel: md: recovery of RAID array md127
  • also unmittelbar nachdem die erste recovery fertig war startete eine 2. und zwar ging es wieder um RAID-Drive 6 - sehr komisch - scheint mir ein Bug in der md-Software zu sein
  • ich werde einfach die recovery aussitzen und dann das remove-te Drive zu einem Spare machen!