RAID6-2016: Unterschied zwischen den Versionen
Zur Navigation springen
Zur Suche springen
(169 dazwischenliegende Versionen von 2 Benutzern werden nicht angezeigt) | |||
Zeile 1: | Zeile 1: | ||
* | {| | ||
| | |||
* | | | ||
* Sicherungsserver "raib2" im RZ Ubstadt | |||
* 2 TB Device-Kapazität - 3,5" Formfaktor - SATA 6Gb/s - 7200 U/min, ext4 64bit | |||
* 4 verschiedene Hersteller wegen der vergrößerten Varianz des Zeitpunktes eines Ausfalles | |||
* Fractal Design Define 7 | |||
* BeQuiet modualres Netzteil | |||
|} | |||
== | == Serien Nummern == | ||
* HGST | |||
** S/N: P6K4TJSV sdk | |||
* WD | |||
** S/N: WCC4M0SC7C9R sdc | |||
** <s>S/N: WCC4M0SC7AR1</s> †'09.2019 | |||
** <s>S/N: WCC4M0XEZ7CH</s> †'12.2017 | |||
** S/N: WCC4M4KXAVNT sdb | |||
* SEAGATE | |||
** S/N: Z4Z2W81E sdd | |||
** S/N: Z4Z32SNR sdi | |||
** S/N: Z4Z2XNWC sdf | |||
* Toshiba | |||
** S/N: X7T0V1LAS sdg | |||
** S/N: X5RAD3XGSTZ5 sdh | |||
** S/N: X5RAD2GGSTZ5 sdj | |||
** <s>S/N: Y5GHNDBTSTZ5</s> †'03.2018 †'11.2019 | |||
** S/N: Z8K7WMMAS *'11.2019 sde | |||
* Die Platten sind in 3 HDD-Cages zu max 4 Drives | |||
* Das Bild zeigt 2 der 3 Cages | |||
[[Datei:Raid6-A+B.jpg|350px]] | |||
== Array Layout == | |||
{|class="wikitable" | |||
|Lage\Eigenschaften | |||
!Rolle | |||
!Device | |||
!Serialnummer | |||
!Label | |||
|- | |||
!10 | |||
|0 | |||
|sdb | |||
|P6K4TJSV | |||
|[[Datei:10-raib2.jpg|40px]] | |||
|- | |||
!9 | |||
|4 | |||
|sda | |||
|X7T0V1LAS | |||
|[[Datei:09-raib2.jpg|40px]] | |||
|- | |||
!8 | |||
|1 | |||
|sdc | |||
|WD-WCC4M4KXAVNT | |||
|[[Datei:08-raib2.jpg|40px]] | |||
|- | |||
!7 | |||
|3 | |||
|sdk | |||
|Z4Z2XNWC | |||
|[[Datei:07-raib2.jpg|40px]] | |||
|- | |||
!6 | |||
|2 | |||
|sdk | |||
|WD-WCC4M0SC7C9R | |||
|[[Datei:06-raib2.jpg|40px]] | |||
|- | |||
!5 | |||
|spare | |||
|sde | |||
|Z4Z2W81E | |||
|[[Datei:05-raib2.jpg|40px]] | |||
|- | |||
!4 | |||
|5 | |||
|sdb | |||
|X5RAD3XGS | |||
|[[Datei:04-raib2.jpg|40px]] | |||
|- | |||
!3 | |||
|6 | |||
|sdd | |||
|Z4Z32SNR | |||
|[[Datei:03-raib2.jpg|40px]] | |||
|- | |||
!2 | |||
|8 | |||
|sdd | |||
|Z8K7WMMAS | |||
|[[Datei:02-raib2.jpg|40px]] | |||
|- | |||
!1 | |||
|7 | |||
|sdf | |||
|X5RAD2GGS | |||
|[[Datei:01-raib2.jpg|40px]] | |||
|} | |||
== Logbuch == | |||
=== Initialisierung 2016 === | |||
[[Datei:20160202 141516.jpg|350px]] | |||
* Einkauf von 9x2 TB = 18 TB, das kostet zusammen 684.93 € (Stand Feb 2016). | |||
* 8 der Platten verwende ich im RAID, eine Platte lege ich daneben für den Fall der Fälle | |||
** im Lager: Toshiba DT01ACA S/N: X5RAD3XGSTZ5 | |||
=== Xsilence bis 2020 === | |||
{| | |||
|[[Datei:20191128 181955.jpg|200px]] | |||
|[[Datei:20180328 172106.jpg|200px]] | |||
|[[Datei:20191128 105954.jpg|200px]] | |||
|} | |||
=== Störung vom 28.12.2017 === | |||
[243680.637402] aacraid: Host adapter abort request (0,2,3,0) | |||
[243691.068772] sd 0:2:3:0: [sdi] tag#1 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK | |||
[243691.068778] sd 0:2:3:0: [sdi] tag#1 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 | |||
[243691.068786] blk_update_request: I/O error, dev sdi, sector 2064 | |||
[243691.068788] md: super_written gets error=-5 | |||
[243691.068793] md/raid:md127: Disk failure on sdi1, disabling device. | |||
md/raid:md127: Operation continuing on 7 devices. | |||
[243801.115324] aacraid: Host adapter abort request timed out | |||
[243801.115334] aacraid: Host adapter abort request (0,2,3,0) | |||
[243801.115384] aacraid: Host adapter reset request. SCSI hang ? | |||
[243921.593220] aacraid: Host adapter reset request timed out | |||
[243921.593230] sd 0:2:3:0: Device offlined - not ready after error recovery | |||
[243921.593233] sd 0:2:3:0: Device offlined - not ready after error recovery | |||
[243921.593248] sd 0:2:3:0: [sdi] tag#8 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK | |||
[243921.593252] sd 0:2:3:0: [sdi] tag#8 CDB: Read(10) 28 00 04 a0 c4 00 00 02 00 00 | |||
[243921.593256] blk_update_request: I/O error, dev sdi, sector 77644800 | |||
[243921.593289] sd 0:2:3:0: [sdi] tag#11 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK | |||
[243921.593292] sd 0:2:3:0: [sdi] tag#11 CDB: Read(10) 28 00 04 a0 c6 00 00 02 00 00 | |||
[243921.593294] blk_update_request: I/O error, dev sdi, sector 77645312 | |||
[416403.254386] hrtimer: interrupt took 29227 ns | |||
[853039.443372] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443402] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443411] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443418] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443426] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443433] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443440] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443448] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443455] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443633] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443646] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443653] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443660] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443667] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443674] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443681] sd 0:2:3:0: rejecting I/O to offline device | |||
[853039.443687] sd 0:2:3:0: rejecting I/O to offline device | |||
* Ich wollte die serial ID der Platte rausfinden, ähm, jedoch hwinfo --disk lieferte bei der Platte nur noch: | |||
( | 28: IDE 23.0: 10600 Disk | ||
[Created at block.245] | |||
Unique ID: ipPt.uEhVIzZ7wdA | |||
Parent ID: B35A.VPIkJrtnW73 | |||
SysFS ID: /class/block/sdi | |||
SysFS BusID: 0:2:3:0 | |||
SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0 | |||
Hardware Class: disk | |||
Model: "WDC WD20PURX-64P" | |||
Vendor: "WDC" | |||
Device: "WD20PURX-64P" | |||
Revision: "0A80" | |||
Driver: "aacraid", "sd" | |||
Driver Modules: "aacraid", "sd_mod" | |||
Device File: /dev/sdi | |||
Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0XEZ7CH, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0 | |||
Device Number: block 8:128-8:143 | |||
Drive status: no medium | |||
Config Status: cfg=new, avail=yes, need=no, active=unknown | |||
Attached to: #15 (Serial Attached SCSI controller) | |||
* | * es hätte aber angezeigt werden sollen: | ||
28: IDE 23.0: 10600 Disk | |||
[Created at block.245] | |||
Unique ID: ipPt.dZvPpEVVaL9 | |||
Parent ID: B35A.VPIkJrtnW73 | |||
SysFS ID: /class/block/sdi | |||
SysFS BusID: 0:2:3:0 | |||
SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0 | |||
Hardware Class: disk | |||
Model: "WDC WD20PURX-64P" | |||
Vendor: "WDC" | |||
Device: "WD20PURX-64P" | |||
Revision: "0A80" | |||
Serial ID: "WD-WCC4M0XEZ7CH" | |||
Driver: "aacraid", "sd" | |||
Driver Modules: "aacraid", "sd_mod" | |||
Device File: /dev/sdi | |||
Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0XEZ7CH, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0 | |||
Device Number: block 8:128-8:143 | |||
Geometry (Logical): CHS 243201/255/63 | |||
Size: 3907029168 sectors a 512 bytes | |||
Capacity: 1863 GB (2000398934016 bytes) | |||
Config Status: cfg=new, avail=yes, need=no, active=unknown | |||
Attached to: #15 (Serial Attached SCSI controller) | |||
* ich suche also die Platte "WD-WCC4M0XEZ7CH" | |||
raus und ersetzt durch: | |||
28: IDE 23.0: 10600 Disk | |||
[Created at block.245] | |||
Unique ID: ipPt.IyRYgsTsxUD | |||
Parent ID: B35A.VPIkJrtnW73 | |||
SysFS ID: /class/block/sdi | |||
SysFS BusID: 0:2:3:0 | |||
SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0 | |||
Hardware Class: disk | |||
Model: "TOSHIBA DT01ACA2" | |||
Vendor: "TOSHIBA" | |||
Device: "DT01ACA2" | |||
Revision: "ABB0" | |||
Serial ID: "X5RAD3XGS" | |||
Driver: "aacraid", "sd" | |||
Driver Modules: "aacraid", "sd_mod" | |||
Device File: /dev/sdi | |||
Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_TOSHIBA_DT01ACA2_X5RAD3XGS, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0 | |||
Device Number: block 8:128-8:143 | |||
Geometry (Logical): CHS 243201/255/63 | |||
Size: 3907029168 sectors a 512 bytes | |||
Capacity: 1863 GB (2000398934016 bytes) | |||
Config Status: cfg=new, avail=yes, need=no, active=unknown | |||
Attached to: #15 (Serial Attached SCSI controller) | |||
* | * ich schaue mal nach wie der Status des Array ist: | ||
- | raib2:~ # mdadm --detail /dev/md127 | ||
/dev/md127: | |||
Version : 1.2 | |||
Creation Time : Fri Oct 28 11:41:55 2016 | |||
Raid Level : raid6 | |||
Array Size : 11720294400 (11177.34 GiB 12001.58 GB) | |||
Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB) | |||
Raid Devices : 8 | |||
Total Devices : 7 | |||
Persistence : Superblock is persistent | |||
Intent Bitmap : Internal | |||
Update Time : Thu Dec 28 14:39:27 2017 | |||
State : clean, degraded | |||
Active Devices : 7 | |||
Working Devices : 7 | |||
Failed Devices : 0 | |||
Spare Devices : 0 | |||
== | Layout : left-symmetric | ||
Chunk Size : 512K | |||
Consistency Policy : bitmap | |||
Name : raib2:0 (local to host raib2) | |||
UUID : 500aa0db:5aca5187:5617c3ff:dc97c2c4 | |||
Events : 10316 | |||
Number Major Minor RaidDevice State | |||
0 8 17 0 active sync /dev/sdb1 | |||
1 8 33 1 active sync /dev/sdc1 | |||
2 8 49 2 active sync /dev/sdd1 | |||
3 8 65 3 active sync /dev/sde1 | |||
- 0 0 4 removed | |||
5 8 113 5 active sync /dev/sdh1 | |||
6 8 97 6 active sync /dev/sdg1 | |||
7 8 81 7 active sync /dev/sdf1 | |||
* also das defekte device ist nun 100% "removed!" | |||
* dann reicht ein hinzufügen eines Spare, | |||
mdadm /dev/md127 --add-spare /dev/sdi1 | |||
* nach dem rebuild - der durch obigen Befehl automatisch startet, da ja ein device "fehlt", wird es automatisch als vollwertiges "U"-Device hinzugefügt! | |||
=== Störung vom 06.03.2018 === | |||
==== A ==== | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: exception Emask 0x0 SAct 0x38 SErr 0x0 action 0x0 | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: irq_stat 0x40000008 | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: failed command: READ FPDMA QUEUED | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: cmd 60/f0:20:00:d2:11/01:00:00:00:00/40 tag 4 ncq dma 253952 in | |||
res 51/40:80:70:d2:11/00:01:00:00:00/40 Emask 0x409 (media error) <F> | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: status: { DRDY ERR } | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: error: { UNC } | |||
Feb 10 09:15:25 raib2 kernel: ata5.00: configured for UDMA/133 | |||
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE | |||
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 Sense Key : Medium Error [current] | |||
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed | |||
Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 CDB: Read(10) 28 00 00 11 d2 00 00 01 f0 00 | |||
Feb 10 09:15:25 raib2 kernel: blk_update_request: I/O error, dev sdd, sector 1167984 | |||
Feb 10 09:15:25 raib2 kernel: ata5: EH complete | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: irq_stat 0x40000008 | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: failed command: READ FPDMA QUEUED | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: cmd 60/08:08:70:d2:11/00:00:00:00:00/40 tag 1 ncq dma 4096 in | |||
res 51/40:08:70:d2:11/00:00:00:00:00/40 Emask 0x409 (media error) <F> | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: status: { DRDY ERR } | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: error: { UNC } | |||
Feb 10 09:15:29 raib2 kernel: ata5.00: configured for UDMA/133 | |||
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE | |||
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 Sense Key : Medium Error [current] | |||
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed | |||
Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 00 11 d2 70 00 00 08 00 | |||
Feb 10 09:15:29 raib2 kernel: blk_update_request: I/O error, dev sdd, sector 1167984 | |||
Feb 10 09:15:29 raib2 kernel: ata5: EH complete | |||
Feb 10 09:15:45 raib2 kernel: md/raid:md127: read error corrected (8 sectors at 1165936 on sdd1) | |||
* smartd hat den Read error auch mitbekommen | |||
Feb 10 09:42:03 raib2 smartd[2004]: Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors | |||
Feb 10 09:42:03 raib2 smartd[2004]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 253 to 200 | |||
Feb 10 09:42:04 raib2 smartd[2004]: Device: /dev/sdd [SAT], ATA error count increased from 0 to 2 | |||
* es ist die disk "ata-TOSHIBA_DT01ACA200_Y5GHNDBTS" | |||
* mdadm hatte niemals gesagt, dass eben eine Platte rausgeschmissen wurde | |||
* Bei einem Neustart hiess es dann einfach | |||
Feb 10 13:00:02 raib2 mdadm[3784]: NewArray event detected on md device /dev/md127 | |||
Feb 10 13:00:02 raib2 mdadm[3784]: DegradedArray event detected on md device /dev/md127 | |||
* meine Lösung was es 2 Spares in das Array zu bringen: | |||
* 2018: 2 Platten für zusammen 130 Euro eingebaut: | |||
** scsi-SATA_WDC_WD20EFRX-68E_WD-WCC4M4KXAVNT | |||
** scsi-SATA_TOSHIBA_HDWD120_X7T0V1LAS | |||
mdadm /dev/md127 --add-spare /dev/sdf1 | |||
mdadm /dev/md127 --add-spare /dev/sdg1 | |||
* Nach dem ersten Befehl lief sofort die Recovery los, so wie ich das erwartet hatte | |||
* Ich habe mir das kurz angesehen, er war aber noch bei "0%" | |||
* Der 2. Befehl brauchte sehr lange - ich glaube 40 Sekunden bis er angenommen wurde, danach war aber alles OK (Recovery und ein Spare!) | |||
* Doch nun kam es zu einer weiteren Störung | |||
==== B ==== | |||
Mar 07 19:24:34 raib2 kernel: md: md127: recovery done. | |||
Mar 07 19:24:35 raib2 kernel: md: recovery of RAID array md127 | |||
* also unmittelbar nachdem die erste recovery fertig war startete eine 2. und zwar ging es wieder um RAID-Drive 6 - sehr komisch - scheint mir ein Bug in der md-Software zu sein | |||
* ich werde einfach die recovery aussitzen und dann das remove-te Drive zu einem Spare machen! | |||
=== Ausbau am 26.03.2018 === | |||
* Ausbau "Adaptec" da ich diesen aktiv kühlen muss | |||
* Einbau AOC SAS MC | |||
* Kauf von | |||
** Kabel SFF-8087 auf 4x SATA | |||
** Kabel SFF-8087 auf 4x SAS mit 5.25" Stromstecker | |||
* Somit war der Einbau des "herumliegen" Fehlkaufes - eines SAS Platte "" - möglich | |||
* Nunmehr 11 Platten im System | |||
* Raid-Grösse auf 9 Platten vergrössert | |||
* Anzahl der Spares auf 2 erhöht | |||
* alle Platten wieder im Gehäuse untergebracht | |||
=== Störung vom 30.09.2019 === | |||
* Nach dem reboot sind alle Platten da, aber "sdk1" hat die Rolle (5) von "sdi5" geklaut | |||
* Jetzt denkt er es ist ein raid0-System es steht auf inactive | |||
/dev/md0: | |||
Version : 1.2 | |||
Raid Level : raid0 | |||
Total Devices : 11 | |||
Persistence : Superblock is persistent | |||
State : inactive | |||
Name : raib2:0 (local to host raib2) | |||
UUID : 500aa0db:5aca5187:5617c3ff:dc97c2c4 | |||
Events : 61369 | |||
Number Major Minor RaidDevice | |||
- 8 17 - /dev/sdb1 | |||
- 8 33 - /dev/sdc1 | |||
- 8 49 - /dev/sdd1 | |||
- 8 65 - /dev/sde1 | |||
- 8 81 - /dev/sdf1 | |||
- 8 97 - /dev/sdg1 | |||
- 8 113 - /dev/sdh1 | |||
- 8 129 - /dev/sdi1 | |||
- 8 145 - /dev/sdj1 | |||
- 8 161 - /dev/sdk1 | |||
- 8 177 - /dev/sdl1 | |||
* Ich habe mit mdadm --examine /dev/sd*1 jede einzelne Rolle angesehen | |||
* dabei war "k" eigentlich im Verbund nicht mehr dabei, hatte aber die identität "5" wie "i" | |||
* Erste Idee war also dass ich "k" ganz abschalte!! "WD-WCC4M0SC7AR1" | |||
* OK, dass war das "böse" Drive zumindest mal weg, das Problem war aber nicht gelöst | |||
* Dann habe ich das Array neu re-created: | |||
* mdadm --create --assume-clean --verbose /dev/md0 --level=6 --raid-devices=9 /dev/sd[cdefghijk]1 | |||
* Ich habe noch gesehen dass ein Lesecheck eigentlich 100% Fehler ergibt | |||
* Ich habe gesehen dass er gar nicht die alte Reihenfolge der Platten benutzt hat, das ist ja dumm | |||
* Es passte also nix zusammen, ext4.fsck ergab Millionen Fehler im Dateisystem, es war hoffnungslos | |||
* Ich habe dieses Array verloren, da ich nicht weitergeforscht habe wie man beim "re-create" die Rollen der Partitionen beibehalten kann, dann wäre das sicher gutgegangen | |||
* Nicht so schlimm: weil das nur ein Backupsystem war | |||
* Nun raid "Mon Sep 30 11:27:05 2019" | |||
=== Störung vom 27.11.2019 === | |||
* Y5GHNDBTS, B2 die Toshiba macht nach einem Wiedereinbau 2018 erneut Probleme | |||
* sie war auch für das Rebuild des Arrays verantwortlich | |||
[Wed Nov 27 14:54:39 2019] ata5.00: exception Emask 0x0 SAct 0x7ff0003f SErr 0x0 action 0x0 | |||
[Wed Nov 27 14:54:39 2019] ata5.00: irq_stat 0x40000008 | |||
[Wed Nov 27 14:54:39 2019] ata5.00: failed command: READ FPDMA QUEUED | |||
[Wed Nov 27 14:54:39 2019] ata5.00: cmd 60/40:00:00:d8:50/05:00:00:00:00/40 tag 0 ncq dma 688128 in | |||
res 51/40:40:00:d8:50/00:05:00:00:00/40 Emask 0x409 (media error) <F> | |||
[Wed Nov 27 14:54:39 2019] ata5.00: status: { DRDY ERR } | |||
[Wed Nov 27 14:54:39 2019] ata5.00: error: { UNC } | |||
[Wed Nov 27 14:54:39 2019] ata5.00: configured for UDMA/133 | |||
[Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE | |||
[Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 Sense Key : Medium Error [current] | |||
[Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed | |||
[Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 CDB: Read(10) 28 00 00 50 d8 00 00 05 40 00 | |||
[Wed Nov 27 14:54:39 2019] blk_update_request: I/O error, dev sde, sector 5298176 | |||
[Wed Nov 27 14:54:39 2019] ata5: EH complete | |||
* Routinemässig führte ich einen kompletten Lesecheck durch | |||
* sde hat hierauf durch diverse Lesefehler geglänzt, das hab ich mi dmesg gesehen | |||
* der Read-Mismatch Count war auf 24 gestiegen | |||
* die Temperatur von sde war 4 Grad höher als der Durchschnitt der anderen Platten | |||
* md hat diverse Lesefehler durch Überschreiben scheinbar gelöst | |||
* Ich habe den "check" unterbrochen (ich denke mit "frozen") und ein --replace gemacht | |||
[Wed Nov 27 14:36:08 2019] md/raid:md0: read error corrected (8 sectors at 5257216 on sde1) | |||
[Wed Nov 27 14:36:08 2019] md/raid:md0: read error corrected (8 sectors at 5260120 on sde1) | |||
[Wed Nov 27 14:39:48 2019] md/raid:md0: read error corrected (8 sectors at 17007032 on sde1) | |||
[Wed Nov 27 14:39:48 2019] md/raid:md0: read error corrected (8 sectors at 17000048 on sde1) | |||
[Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17007096 on sde1) | |||
[Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17018104 on sde1) | |||
[Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17018232 on sde1) | |||
[Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17018256 on sde1) | |||
[Wed Nov 27 14:41:17 2019] md: md0: data-check interrupted. | |||
[Wed Nov 27 14:41:28 2019] md/raid:md0: read error corrected (8 sectors at 18543120 on sde1) | |||
[Wed Nov 27 14:53:13 2019] md: recovery of RAID array md0 | |||
[Wed Nov 27 14:55:10 2019] md/raid:md0: read error corrected (8 sectors at 5282632 on sde1) | |||
[Wed Nov 27 14:55:10 2019] md/raid:md0: read error corrected (8 sectors at 5282640 on sde1) | |||
[Wed Nov 27 14:55:11 2019] md/raid:md0: read error corrected (8 sectors at 5282392 on sde1) | |||
[Wed Nov 27 14:56:56 2019] md/raid:md0: read error corrected (8 sectors at 17023768 on sde1) | |||
[Wed Nov 27 14:59:22 2019] md/raid:md0: read error corrected (8 sectors at 18601648 on sde1) | |||
[Wed Nov 27 14:59:22 2019] md/raid:md0: read error corrected (8 sectors at 18604832 on sde1) | |||
[Wed Nov 27 14:59:29 2019] md/raid:md0: read error corrected (8 sectors at 18604992 on sde1) | |||
[Wed Nov 27 14:59:48 2019] md/raid:md0: read error corrected (8 sectors at 18598848 on sde1) | |||
[Wed Nov 27 14:59:49 2019] md/raid:md0: read error corrected (8 sectors at 18598856 on sde1) | |||
[Wed Nov 27 14:59:49 2019] md/raid:md0: read error corrected (8 sectors at 18598880 on sde1) | |||
[Wed Nov 27 14:59:49 2019] md/raid:md0: read error corrected (8 sectors at 18598928 on sde1) | |||
[Wed Nov 27 15:00:13 2019] md/raid:md0: read error corrected (8 sectors at 18612776 on sde1) | |||
* gebe diese Platte entgültig ins Recycling | |||
=== Ausbau am 28.11.2019 === | |||
* Entsorge WCC4M0SC7AR1 | |||
* Entsorge Y5GHNDBTSTZ5 | |||
* Einbau des 3. Cages, dadurch Frontblende nicht mehr verwendbar | |||
* Neukauf Z8K7WMMAS als spare (sde) | |||
=== Störung vom 29.11.2019 === | |||
* Es ergeben sich Fehler beim Scrubbing | |||
* Ich mache nach "check" jetzt einen "repair"-Lauf | |||
raib2:~ # cat /sys/block/md0/md/mismatch_cnt | |||
88 | |||
raib2:~ # cat /proc/mdstat | |||
Personalities : [raid6] [raid5] [raid4] | |||
md0 : active raid6 sdf1[0] sdg1[3] sdi1[4] sdk1[8] sdh1[9] sdj1[7] sde1[10](S) sdd1[6] sdb1[1] sdc1[5] | |||
13673676800 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU] | |||
[==>..................] resync = 14.1% (276989284/1953382400) finish=203.8min speed=137064K/sec | |||
bitmap: 0/15 pages [0KB], 65536KB chunk | |||
unused devices: <none> | |||
* Ergebnis eines weiteren "repair"-Laufes nach den "88" Fehlern | |||
* Diesmal gab es "0" Fehler, die Störung ist behoben | |||
=== Störung vom 11.12.2019 === | |||
* Nach einem Absturz der Host-Controller Software im Kernel 5.3 bot sind folgendes Bild | |||
* Der Kernel war sehr neu - also habe ich wieder openSUSE 15.1 installiert | |||
* Das Array war inactive und es war der level "raid0" angegeben | |||
/dev/md0: | |||
Version : 1.2 | |||
Raid Level : raid0 | |||
Total Devices : 10 | |||
Persistence : Superblock is persistent | |||
State : inactive | |||
Working Devices : 10 | |||
Name : raib2:0 (local to host raib2) | |||
UUID : d014324b:85ea6d08:42120868:6465e2b2 | |||
Events : 12812 | |||
Number Major Minor RaidDevice | |||
- 8 161 - /dev/sdk1 | |||
- 8 145 - /dev/sdj1 | |||
- 8 129 - /dev/sdi1 | |||
- 8 113 - /dev/sdh1 | |||
- 8 97 - /dev/sdg1 | |||
- 8 81 - /dev/sdf1 | |||
- 8 65 - /dev/sde1 | |||
- 8 49 - /dev/sdd1 | |||
- 8 33 - /dev/sdc1 | |||
- 8 17 - /dev/sdb1 | |||
* Es war nicht mehr "md0" sondern "md127" | |||
* Alle 11 Drives hatten KEINE Rolle, sondern nur "-" | |||
* Ich habe bei allen Drives "examine" gemacht - es war bei manchen "AAAAAAAAA" bei manchen ".AA.AAAAA" | |||
* Aber die Kern-Daten, also Rolle und raid6 war alles OK! | |||
* aber sonst war eigentlich alles gut! | |||
mdadm --stop /dev/md127 | |||
mdadm --assemble --force --uuid d014324b:85ea6d08:42120868:6465e2b2 /dev/md0 | |||
* Ohne "force" ging es nicht da waren fast alle Drives Busy | |||
* Hurra das array war 100% (sogar "clean") wieder da, sogar ohn --run | |||
* Habe der Sache mistraut und erst mal einen Lesecheck gemacht | |||
mdadm --action=check /dev/md0 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294080-1061294088 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294088-1061294096 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294096-1061294104 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294104-1061294112 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294112-1061294120 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294120-1061294128 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294128-1061294136 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294136-1061294144 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294144-1061294152 | |||
[Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294152-1061294160 | |||
mismatch_count=13776 | |||
mdadm --action=repair /dev/md0 | |||
=== Störung 17.12.2019 === | |||
[Tue Dec 17 11:24:44 2019] sas: Enter sas_scsi_recover_host busy: 6 failed: 6 | |||
[Tue Dec 17 11:24:44 2019] sas: trying to find task 0xffff8808518fec00 | |||
[Tue Dec 17 11:24:44 2019] sas: sas_scsi_find_task: aborting task 0xffff8808518fec00 | |||
[Tue Dec 17 11:24:44 2019] sas: sas_scsi_find_task: task 0xffff8808518fec00 is aborted | |||
[Tue Dec 17 11:24:44 2019] sas: sas_eh_handle_sas_errors: task 0xffff8808518fec00 is aborted | |||
[Tue Dec 17 11:24:44 2019] sas: trying to find task 0xffff88085292dc00 | |||
[Tue Dec 17 11:24:44 2019] sas: sas_scsi_find_task: aborting task 0xffff88085292dc00 | |||
[Tue Dec 17 11:25:04 2019] ../drivers/scsi/mvsas/mv_sas.c 1330:TMF task[1] timeout. | |||
[Tue Dec 17 11:25:04 2019] ../drivers/scsi/mvsas/mv_sas.c 1555:mvs_abort_task:rc= 5 | |||
[Tue Dec 17 11:25:04 2019] sas: sas_scsi_find_task: querying task 0xffff88085292dc00 | |||
[Tue Dec 17 11:25:25 2019] ../drivers/scsi/mvsas/mv_sas.c 1330:TMF task[80] timeout. | |||
[Tue Dec 17 11:25:25 2019] ../drivers/scsi/mvsas/mv_sas.c 1477:mvs_query_task:rc= 5 | |||
[Tue Dec 17 11:25:25 2019] sas: sas_scsi_find_task: task 0xffff88085292dc00 failed to abort | |||
[Tue Dec 17 11:25:25 2019] sas: task 0xffff88085292dc00 is not at LU: I_T recover | |||
[Tue Dec 17 11:25:25 2019] sas: I_T nexus reset for dev 5000cca028b1d079 | |||
[Tue Dec 17 11:25:27 2019] ../drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[1]:rc= 0 | |||
[Tue Dec 17 11:25:27 2019] sas: task done but aborted | |||
[Tue Dec 17 11:25:27 2019] BUG: unable to handle kernel paging request at ffffffff815a1d70 | |||
[Tue Dec 17 11:25:27 2019] IP: native_queued_spin_lock_slowpath+0x161/0x190 | |||
[Tue Dec 17 11:25:27 2019] PGD 200e067 P4D 200e067 PUD 200f063 PMD 14000e1 | |||
[Tue Dec 17 11:25:27 2019] Oops: 0003 [#1] SMP PTI | |||
[Tue Dec 17 11:25:27 2019] CPU: 2 PID: 247 Comm: scsi_eh_0 Not tainted 4.12.14-lp151.28.36-default #1 openSUSE Leap 15.1 | |||
[Tue Dec 17 11:25:27 2019] Hardware name: Supermicro Super Server/X11SSL-F, BIOS 2.2a 05/24/2019 | |||
[Tue Dec 17 11:25:27 2019] task: ffff880853ce0100 task.stack: ffffc900038dc000 | |||
[Tue Dec 17 11:25:27 2019] RIP: 0010:native_queued_spin_lock_slowpath+0x161/0x190 | |||
[Tue Dec 17 11:25:27 2019] RSP: 0018:ffffc900038dfcb0 EFLAGS: 00010082 | |||
[Tue Dec 17 11:25:27 2019] RAX: ffffffff815a1d70 RBX: ffff8808518fec00 RCX: ffff8808779250c0 | |||
[Tue Dec 17 11:25:27 2019] RDX: 0000000000002048 RSI: 000000008125d290 RDI: ffff8808518fec08 | |||
[Tue Dec 17 11:25:27 2019] RBP: ffff8808529c0000 R08: 00000000000c0000 R09: 000000061a54a000 | |||
[Tue Dec 17 11:25:27 2019] R10: ffff8808529c0f40 R11: 0000000000000001 R12: ffff8808529c00b0 | |||
[Tue Dec 17 11:25:27 2019] R13: 0000000000000002 R14: 0000000000000002 R15: ffff8808539b8c00 | |||
[Tue Dec 17 11:25:27 2019] FS: 0000000000000000(0000) GS:ffff880877900000(0000) knlGS:0000000000000000 | |||
[Tue Dec 17 11:25:27 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | |||
[Tue Dec 17 11:25:27 2019] CR2: ffffffff815a1d70 CR3: 000000000200a001 CR4: 00000000003606e0 | |||
[Tue Dec 17 11:25:27 2019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | |||
[Tue Dec 17 11:25:27 2019] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | |||
[Tue Dec 17 11:25:27 2019] Call Trace: | |||
[Tue Dec 17 11:25:27 2019] _raw_spin_lock+0x1d/0x20 | |||
[Tue Dec 17 11:25:27 2019] mvs_slot_complete+0x8c/0x5d0 [mvsas] | |||
[Tue Dec 17 11:25:27 2019] mvs_int_rx+0x68/0x130 [mvsas] | |||
[Tue Dec 17 11:25:27 2019] mvs_do_release_task+0x37/0x100 [mvsas] | |||
[Tue Dec 17 11:25:27 2019] mvs_release_task+0xf9/0x110 [mvsas] | |||
[Tue Dec 17 11:25:27 2019] mvs_I_T_nexus_reset+0xac/0xc0 [mvsas] | |||
[Tue Dec 17 11:25:27 2019] sas_scsi_recover_host+0x27d/0xb20 [libsas] | |||
[Tue Dec 17 11:25:27 2019] ? __pm_runtime_resume+0x54/0x70 | |||
[Tue Dec 17 11:25:27 2019] ? scsi_try_target_reset+0x90/0x90 | |||
[Tue Dec 17 11:25:27 2019] scsi_error_handler+0xc7/0x5c0 | |||
[Tue Dec 17 11:25:27 2019] ? __schedule+0x287/0x830 | |||
[Tue Dec 17 11:25:27 2019] ? scsi_eh_get_sense+0x200/0x200 | |||
[Tue Dec 17 11:25:27 2019] kthread+0x10d/0x130 | |||
[Tue Dec 17 11:25:27 2019] ? kthread_create_worker_on_cpu+0x50/0x50 | |||
[Tue Dec 17 11:25:27 2019] ret_from_fork+0x35/0x40 | |||
[Tue Dec 17 11:25:27 2019] Code: c3 f3 90 4c 8b 09 4d 85 c9 74 f6 eb d2 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 c0 50 02 00 48 03 04 d5 00 e5 ed 81 <48> 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 85 c0 74 f7 4c 8b | |||
[Tue Dec 17 11:25:27 2019] Modules linked in: scsi_transport_iscsi af_packet br_netfilter bridge stp llc iscsi_ibft iscsi_boot_sysfs dmi_sysfs msr raid456 async_raid6_recov async_memcpy libcrc32c async_pq async_xor xor async_tx raid6_pq nls_iso8859_1 nls_cp437 vfat fat ipmi_ssif joydev iTCO_wdt iTCO_vendor_support hid_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd pcspkr md_mod igb i2c_i801 ptp pps_core dca mei_me usbhid mei intel_pch_thermal ie31200_edac fan thermal ipmi_si ipmi_devintf ipmi_msghandler video button pcc_cpufreq ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm xhci_pci xhci_hcd usbcore drm mvsas drm_panel_orientation_quirks | |||
[Tue Dec 17 11:25:27 2019] libsas ahci libahci scsi_transport_sas sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs | |||
[Tue Dec 17 11:25:27 2019] CR2: ffffffff815a1d70 | |||
[Tue Dec 17 11:25:27 2019] ---[ end trace 4a47a2eaecf2db5d ]--- | |||
[Tue Dec 17 11:25:27 2019] RIP: 0010:native_queued_spin_lock_slowpath+0x161/0x190 | |||
[Tue Dec 17 11:25:27 2019] RSP: 0018:ffffc900038dfcb0 EFLAGS: 00010082 | |||
[Tue Dec 17 11:25:27 2019] RAX: ffffffff815a1d70 RBX: ffff8808518fec00 RCX: ffff8808779250c0 | |||
[Tue Dec 17 11:25:27 2019] RDX: 0000000000002048 RSI: 000000008125d290 RDI: ffff8808518fec08 | |||
[Tue Dec 17 11:25:27 2019] RBP: ffff8808529c0000 R08: 00000000000c0000 R09: 000000061a54a000 | |||
[Tue Dec 17 11:25:27 2019] R10: ffff8808529c0f40 R11: 0000000000000001 R12: ffff8808529c00b0 | |||
[Tue Dec 17 11:25:27 2019] R13: 0000000000000002 R14: 0000000000000002 R15: ffff8808539b8c00 | |||
[Tue Dec 17 11:25:27 2019] FS: 0000000000000000(0000) GS:ffff880877900000(0000) knlGS:0000000000000000 | |||
[Tue Dec 17 11:25:27 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | |||
[Tue Dec 17 11:25:27 2019] CR2: ffffffff815a1d70 CR3: 000000000200a001 CR4: 00000000003606e0 | |||
[Tue Dec 17 11:25:27 2019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | |||
[Tue Dec 17 11:25:27 2019] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | |||
* | * nach Controller-Tausch, war leider die SAS Platte nicht mehr anschliessbar | ||
/dev/md127: | |||
Version : 1.2 | |||
Raid Level : raid0 | |||
Total Devices : 9 | |||
Persistence : Superblock is persistent | |||
State : inactive | |||
Working Devices : 9 | |||
Name : raib2:0 (local to host raib2) | |||
UUID : d014324b:85ea6d08:42120868:6465e2b2 | |||
Events : 13670 | |||
Number Major Minor RaidDevice | |||
- 8 145 - /dev/sdj1 | |||
- 8 129 - /dev/sdi1 | |||
- 8 113 - /dev/sdh1 | |||
- 8 97 - /dev/sdg1 | |||
- 8 81 - /dev/sdf1 | |||
- 8 65 - /dev/sde1 | |||
- 8 49 - /dev/sdd1 | |||
- 8 33 - /dev/sdc1 | |||
- 8 17 - /dev/sdb1 | |||
=== | === Ausbau 19.12.2019 === | ||
* Durch einen SAS auf SATA Adapter konnte ich die SAS Platte wieder als spare in den Verbund mit aufnehmen | |||
* Das SATA Kabel liefert der Apaptec, da ich vermute dass dieser Controller eher mit einer SAS Platte zurecht kommt | |||
* Also habe ich von einer anderen Platte ein SATA Stecker entfernt und an den SAS Adapter angeschlossen | |||
* Die nun neu benötigte SATA- Schnittstelle liefert eine 4 Port PCIExpress Host Adapter Karte Karte | |||
=== | === Störung 16.05.2020 === | ||
[[ | [Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#4 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK | ||
[Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#0 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK | |||
[Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#0 CDB: Read(10) 28 00 8b fa f5 c8 00 00 08 00 | |||
[Sat May 16 16:41:05 2020] print_req_error: I/O error, dev sdd, sector 2348479944 | |||
[Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#4 CDB: Write(10) 2a 00 8b 8d 2d 00 00 00 90 00 | |||
[Sat May 16 16:41:05 2020] print_req_error: I/O error, dev sdd, sector 2341285120 | |||
[Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#9 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK | |||
[Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#9 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 | |||
[Sat May 16 16:41:05 2020] print_req_error: I/O error, dev sdd, sector 2064 | |||
[Sat May 16 16:41:05 2020] md/raid:md127: Disk failure on sdd1, disabling device. | |||
* Ausfall sdd (WD-WCC4M0SC7C9R) | |||
* Spare sdb (P6K4TJSV) läuft in die Recovery | |||
== | === Störung 05.06.2020 === | ||
[ | [Fri Jun 5 12:35:52 2020] md: kicking non-fresh sda1 from array! | ||
* ein resync mit dem spare erfolgte, danach schien alles normal | |||
* ich schaute mir mal sda1 an: | |||
raib2:~ # mdadm --examine /dev/sda1 | |||
/dev/sda1: | |||
Magic : a92b4efc | |||
Version : 1.2 | |||
Feature Map : 0x1 | |||
Array UUID : d014324b:85ea6d08:42120868:6465e2b2 | |||
Name : raib2:0 (local to host raib2) | |||
Creation Time : Mon Sep 30 11:27:05 2019 | |||
Raid Level : raid6 | |||
Raid Devices : 9 | |||
Avail Dev Size : 3906764943 sectors (1862.89 GiB 2000.26 GB) | |||
Array Size : 13673676800 KiB (12.73 TiB 14.00 TB) | |||
Used Dev Size : 3906764800 sectors (1862.89 GiB 2000.26 GB) | |||
Data Offset : 262144 sectors | |||
Super Offset : 8 sectors | |||
Unused Space : before=262064 sectors, after=143 sectors | |||
State : active | |||
Device UUID : cc7b9892:466e800c:75e17d58:48b3ccce | |||
Internal Bitmap : 8 sectors from superblock | |||
Update Time : Sat May 23 15:59:44 2020 | |||
Bad Block Log : 512 entries available at offset 16 sectors | |||
Checksum : cee8350f - correct | |||
Events : 22668 | |||
Layout : left-symmetric | |||
Chunk Size : 512K | |||
Device Role : Active device 2 | |||
Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing) | |||
* kann mir keinen Reim drauf machen, meine Reaktion: | |||
** <code>smartctl --test long /dev/sda</code> (war erfolgreich!) | |||
** <code>mdadm --zero-superblock /dev/sda1</code> | |||
** <code>mdadm /dev/md127 --add-spare /dev/sda1</code> | |||
=== Störung vom 17.07.2020 === | |||
[Fri Jul 17 16:50:19 2020] md: kicking non-fresh sdi1 from array! | |||
[Fri Jul 17 16:50:19 2020] md: kicking non-fresh sdg1 from array! | |||
* Ich glaube ein Neustart brachte die Lösung, es gab keine besonderen Probleme?? Bin mir nicht sicher | |||
=== Störung von 01.09.2020 === | |||
* sdg macht als spare komische Meldungen im dmesg | |||
* Ich denke ich schaffe im Oktober 2 neue Platten an | |||
=== Ausbau 10.09.2020 === | |||
* | * Wegfall: Aerocool Strike-X One PC-Gehäuse (ATX, 7X 3,5 intern, 9X 5,25 extern, 2X USB 2.0) schwarz | ||
* Wegfall: XigmaTek 4 in 3 HDD Cage | |||
* Umbau in ein Fractal Define 7 im Storage Layout | |||
* Fotos aller Platten eingepflegt | |||
=== Check 2023 === | |||
* hm komisch, der Spare war weg, ich hab sdj1 wieder zum Spare gemacht | |||
== --detail == | |||
* Stand vom 01.10.2020 | |||
/dev/ | /dev/md0: | ||
Version : 1.2 | |||
Creation Time : Mon Sep 30 11:27:05 2019 | |||
Raid Level : raid6 | |||
Array Size : 13673676800 (12.73 TiB 14.00 TB) | |||
Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB) | |||
Raid Devices : 9 | |||
Total Devices : 10 | |||
Persistence : Superblock is persistent | |||
Intent Bitmap : Internal | |||
Update Time : Fri Oct 2 13:06:14 2020 | |||
State : active | |||
Active Devices : 9 | |||
Working Devices : 10 | |||
Failed Devices : 0 | |||
Spare Devices : 1 | |||
Layout : left-symmetric | |||
Chunk Size : 512K | |||
Consistency Policy : bitmap | |||
Name : raib2:0 (local to host raib2) | |||
UUID : d014324b:85ea6d08:42120868:6465e2b2 | |||
Events : 36592 | |||
Number Major Minor RaidDevice State | |||
11 8 129 0 active sync /dev/sdi1 P6K4TJSV | |||
14 8 1 1 active sync /dev/sda1 Z4Z2W81E | |||
12 8 97 2 active sync /dev/sdg1 M0SC7C9R | |||
3 8 161 3 active sync /dev/sdk1 Z4Z2XNWC | |||
4 8 113 4 active sync /dev/sdh1 7T0V1LAS | |||
5 8 17 5 active sync /dev/sdb1 5RAD3XGS | |||
13 8 49 6 active sync /dev/sdd1 Z4Z32SNR | |||
7 8 81 7 active sync /dev/sdf1 5RAD2GGS | |||
10 8 33 8 active sync /dev/sdc1 8K7WMMAS | |||
9 8 145 - spare /dev/sdj1 M4KXAVNT |
Aktuelle Version vom 6. März 2023, 22:33 Uhr
|
Serien Nummern
- HGST
- S/N: P6K4TJSV sdk
- WD
- S/N: WCC4M0SC7C9R sdc
S/N: WCC4M0SC7AR1†'09.2019S/N: WCC4M0XEZ7CH†'12.2017- S/N: WCC4M4KXAVNT sdb
- SEAGATE
- S/N: Z4Z2W81E sdd
- S/N: Z4Z32SNR sdi
- S/N: Z4Z2XNWC sdf
- Toshiba
- S/N: X7T0V1LAS sdg
- S/N: X5RAD3XGSTZ5 sdh
- S/N: X5RAD2GGSTZ5 sdj
S/N: Y5GHNDBTSTZ5†'03.2018 †'11.2019- S/N: Z8K7WMMAS *'11.2019 sde
- Die Platten sind in 3 HDD-Cages zu max 4 Drives
- Das Bild zeigt 2 der 3 Cages
Array Layout
Logbuch
Initialisierung 2016
- Einkauf von 9x2 TB = 18 TB, das kostet zusammen 684.93 € (Stand Feb 2016).
- 8 der Platten verwende ich im RAID, eine Platte lege ich daneben für den Fall der Fälle
- im Lager: Toshiba DT01ACA S/N: X5RAD3XGSTZ5
Xsilence bis 2020
Störung vom 28.12.2017
[243680.637402] aacraid: Host adapter abort request (0,2,3,0) [243691.068772] sd 0:2:3:0: [sdi] tag#1 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK [243691.068778] sd 0:2:3:0: [sdi] tag#1 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [243691.068786] blk_update_request: I/O error, dev sdi, sector 2064 [243691.068788] md: super_written gets error=-5 [243691.068793] md/raid:md127: Disk failure on sdi1, disabling device. md/raid:md127: Operation continuing on 7 devices. [243801.115324] aacraid: Host adapter abort request timed out [243801.115334] aacraid: Host adapter abort request (0,2,3,0) [243801.115384] aacraid: Host adapter reset request. SCSI hang ? [243921.593220] aacraid: Host adapter reset request timed out [243921.593230] sd 0:2:3:0: Device offlined - not ready after error recovery [243921.593233] sd 0:2:3:0: Device offlined - not ready after error recovery [243921.593248] sd 0:2:3:0: [sdi] tag#8 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK [243921.593252] sd 0:2:3:0: [sdi] tag#8 CDB: Read(10) 28 00 04 a0 c4 00 00 02 00 00 [243921.593256] blk_update_request: I/O error, dev sdi, sector 77644800 [243921.593289] sd 0:2:3:0: [sdi] tag#11 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK [243921.593292] sd 0:2:3:0: [sdi] tag#11 CDB: Read(10) 28 00 04 a0 c6 00 00 02 00 00 [243921.593294] blk_update_request: I/O error, dev sdi, sector 77645312 [416403.254386] hrtimer: interrupt took 29227 ns [853039.443372] sd 0:2:3:0: rejecting I/O to offline device [853039.443402] sd 0:2:3:0: rejecting I/O to offline device [853039.443411] sd 0:2:3:0: rejecting I/O to offline device [853039.443418] sd 0:2:3:0: rejecting I/O to offline device [853039.443426] sd 0:2:3:0: rejecting I/O to offline device [853039.443433] sd 0:2:3:0: rejecting I/O to offline device [853039.443440] sd 0:2:3:0: rejecting I/O to offline device [853039.443448] sd 0:2:3:0: rejecting I/O to offline device [853039.443455] sd 0:2:3:0: rejecting I/O to offline device [853039.443633] sd 0:2:3:0: rejecting I/O to offline device [853039.443646] sd 0:2:3:0: rejecting I/O to offline device [853039.443653] sd 0:2:3:0: rejecting I/O to offline device [853039.443660] sd 0:2:3:0: rejecting I/O to offline device [853039.443667] sd 0:2:3:0: rejecting I/O to offline device [853039.443674] sd 0:2:3:0: rejecting I/O to offline device [853039.443681] sd 0:2:3:0: rejecting I/O to offline device [853039.443687] sd 0:2:3:0: rejecting I/O to offline device
- Ich wollte die serial ID der Platte rausfinden, ähm, jedoch hwinfo --disk lieferte bei der Platte nur noch:
28: IDE 23.0: 10600 Disk
[Created at block.245] Unique ID: ipPt.uEhVIzZ7wdA Parent ID: B35A.VPIkJrtnW73 SysFS ID: /class/block/sdi SysFS BusID: 0:2:3:0 SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0 Hardware Class: disk Model: "WDC WD20PURX-64P" Vendor: "WDC" Device: "WD20PURX-64P" Revision: "0A80" Driver: "aacraid", "sd" Driver Modules: "aacraid", "sd_mod" Device File: /dev/sdi Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0XEZ7CH, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0 Device Number: block 8:128-8:143 Drive status: no medium Config Status: cfg=new, avail=yes, need=no, active=unknown Attached to: #15 (Serial Attached SCSI controller)
- es hätte aber angezeigt werden sollen:
28: IDE 23.0: 10600 Disk
[Created at block.245] Unique ID: ipPt.dZvPpEVVaL9 Parent ID: B35A.VPIkJrtnW73 SysFS ID: /class/block/sdi SysFS BusID: 0:2:3:0 SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0 Hardware Class: disk Model: "WDC WD20PURX-64P" Vendor: "WDC" Device: "WD20PURX-64P" Revision: "0A80" Serial ID: "WD-WCC4M0XEZ7CH" Driver: "aacraid", "sd" Driver Modules: "aacraid", "sd_mod" Device File: /dev/sdi Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_WDC_WD20PURX-64P_WD-WCC4M0XEZ7CH, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0 Device Number: block 8:128-8:143 Geometry (Logical): CHS 243201/255/63 Size: 3907029168 sectors a 512 bytes Capacity: 1863 GB (2000398934016 bytes) Config Status: cfg=new, avail=yes, need=no, active=unknown Attached to: #15 (Serial Attached SCSI controller)
- ich suche also die Platte "WD-WCC4M0XEZ7CH"
raus und ersetzt durch:
28: IDE 23.0: 10600 Disk
[Created at block.245] Unique ID: ipPt.IyRYgsTsxUD Parent ID: B35A.VPIkJrtnW73 SysFS ID: /class/block/sdi SysFS BusID: 0:2:3:0 SysFS Device Link: /devices/pci0000:00/0000:00:01.1/0000:02:00.0/host0/target0:2:3/0:2:3:0 Hardware Class: disk Model: "TOSHIBA DT01ACA2" Vendor: "TOSHIBA" Device: "DT01ACA2" Revision: "ABB0" Serial ID: "X5RAD3XGS" Driver: "aacraid", "sd" Driver Modules: "aacraid", "sd_mod" Device File: /dev/sdi Device Files: /dev/sdi, /dev/disk/by-id/scsi-330000d170092e908, /dev/disk/by-id/scsi-SATA_TOSHIBA_DT01ACA2_X5RAD3XGS, /dev/disk/by-id/wwn-0x30000d170092e908, /dev/disk/by-path/pci-0000:02:00.0-scsi-0:2:3:0 Device Number: block 8:128-8:143 Geometry (Logical): CHS 243201/255/63 Size: 3907029168 sectors a 512 bytes Capacity: 1863 GB (2000398934016 bytes) Config Status: cfg=new, avail=yes, need=no, active=unknown Attached to: #15 (Serial Attached SCSI controller)
- ich schaue mal nach wie der Status des Array ist:
raib2:~ # mdadm --detail /dev/md127 /dev/md127:
Version : 1.2 Creation Time : Fri Oct 28 11:41:55 2016 Raid Level : raid6 Array Size : 11720294400 (11177.34 GiB 12001.58 GB) Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB) Raid Devices : 8 Total Devices : 7 Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Dec 28 14:39:27 2017 State : clean, degraded Active Devices : 7 Working Devices : 7 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 512K
Consistency Policy : bitmap
Name : raib2:0 (local to host raib2) UUID : 500aa0db:5aca5187:5617c3ff:dc97c2c4 Events : 10316
Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1 2 8 49 2 active sync /dev/sdd1 3 8 65 3 active sync /dev/sde1 - 0 0 4 removed 5 8 113 5 active sync /dev/sdh1 6 8 97 6 active sync /dev/sdg1 7 8 81 7 active sync /dev/sdf1
- also das defekte device ist nun 100% "removed!"
- dann reicht ein hinzufügen eines Spare,
mdadm /dev/md127 --add-spare /dev/sdi1
- nach dem rebuild - der durch obigen Befehl automatisch startet, da ja ein device "fehlt", wird es automatisch als vollwertiges "U"-Device hinzugefügt!
Störung vom 06.03.2018
A
Feb 10 09:15:25 raib2 kernel: ata5.00: exception Emask 0x0 SAct 0x38 SErr 0x0 action 0x0 Feb 10 09:15:25 raib2 kernel: ata5.00: irq_stat 0x40000008 Feb 10 09:15:25 raib2 kernel: ata5.00: failed command: READ FPDMA QUEUED Feb 10 09:15:25 raib2 kernel: ata5.00: cmd 60/f0:20:00:d2:11/01:00:00:00:00/40 tag 4 ncq dma 253952 in res 51/40:80:70:d2:11/00:01:00:00:00/40 Emask 0x409 (media error) <F> Feb 10 09:15:25 raib2 kernel: ata5.00: status: { DRDY ERR } Feb 10 09:15:25 raib2 kernel: ata5.00: error: { UNC } Feb 10 09:15:25 raib2 kernel: ata5.00: configured for UDMA/133 Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 Sense Key : Medium Error [current] Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 Add. Sense: Unrecovered read error - auto reallocate failed Feb 10 09:15:25 raib2 kernel: sd 5:0:0:0: [sdd] tag#4 CDB: Read(10) 28 00 00 11 d2 00 00 01 f0 00 Feb 10 09:15:25 raib2 kernel: blk_update_request: I/O error, dev sdd, sector 1167984 Feb 10 09:15:25 raib2 kernel: ata5: EH complete Feb 10 09:15:29 raib2 kernel: ata5.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 Feb 10 09:15:29 raib2 kernel: ata5.00: irq_stat 0x40000008 Feb 10 09:15:29 raib2 kernel: ata5.00: failed command: READ FPDMA QUEUED Feb 10 09:15:29 raib2 kernel: ata5.00: cmd 60/08:08:70:d2:11/00:00:00:00:00/40 tag 1 ncq dma 4096 in res 51/40:08:70:d2:11/00:00:00:00:00/40 Emask 0x409 (media error) <F> Feb 10 09:15:29 raib2 kernel: ata5.00: status: { DRDY ERR } Feb 10 09:15:29 raib2 kernel: ata5.00: error: { UNC } Feb 10 09:15:29 raib2 kernel: ata5.00: configured for UDMA/133 Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 Sense Key : Medium Error [current] Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed Feb 10 09:15:29 raib2 kernel: sd 5:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 00 11 d2 70 00 00 08 00 Feb 10 09:15:29 raib2 kernel: blk_update_request: I/O error, dev sdd, sector 1167984 Feb 10 09:15:29 raib2 kernel: ata5: EH complete Feb 10 09:15:45 raib2 kernel: md/raid:md127: read error corrected (8 sectors at 1165936 on sdd1)
- smartd hat den Read error auch mitbekommen
Feb 10 09:42:03 raib2 smartd[2004]: Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors Feb 10 09:42:03 raib2 smartd[2004]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 253 to 200 Feb 10 09:42:04 raib2 smartd[2004]: Device: /dev/sdd [SAT], ATA error count increased from 0 to 2
- es ist die disk "ata-TOSHIBA_DT01ACA200_Y5GHNDBTS"
- mdadm hatte niemals gesagt, dass eben eine Platte rausgeschmissen wurde
- Bei einem Neustart hiess es dann einfach
Feb 10 13:00:02 raib2 mdadm[3784]: NewArray event detected on md device /dev/md127 Feb 10 13:00:02 raib2 mdadm[3784]: DegradedArray event detected on md device /dev/md127
- meine Lösung was es 2 Spares in das Array zu bringen:
- 2018: 2 Platten für zusammen 130 Euro eingebaut:
- scsi-SATA_WDC_WD20EFRX-68E_WD-WCC4M4KXAVNT
- scsi-SATA_TOSHIBA_HDWD120_X7T0V1LAS
mdadm /dev/md127 --add-spare /dev/sdf1 mdadm /dev/md127 --add-spare /dev/sdg1
- Nach dem ersten Befehl lief sofort die Recovery los, so wie ich das erwartet hatte
- Ich habe mir das kurz angesehen, er war aber noch bei "0%"
- Der 2. Befehl brauchte sehr lange - ich glaube 40 Sekunden bis er angenommen wurde, danach war aber alles OK (Recovery und ein Spare!)
- Doch nun kam es zu einer weiteren Störung
B
Mar 07 19:24:34 raib2 kernel: md: md127: recovery done. Mar 07 19:24:35 raib2 kernel: md: recovery of RAID array md127
- also unmittelbar nachdem die erste recovery fertig war startete eine 2. und zwar ging es wieder um RAID-Drive 6 - sehr komisch - scheint mir ein Bug in der md-Software zu sein
- ich werde einfach die recovery aussitzen und dann das remove-te Drive zu einem Spare machen!
Ausbau am 26.03.2018
- Ausbau "Adaptec" da ich diesen aktiv kühlen muss
- Einbau AOC SAS MC
- Kauf von
- Kabel SFF-8087 auf 4x SATA
- Kabel SFF-8087 auf 4x SAS mit 5.25" Stromstecker
- Somit war der Einbau des "herumliegen" Fehlkaufes - eines SAS Platte "" - möglich
- Nunmehr 11 Platten im System
- Raid-Grösse auf 9 Platten vergrössert
- Anzahl der Spares auf 2 erhöht
- alle Platten wieder im Gehäuse untergebracht
Störung vom 30.09.2019
- Nach dem reboot sind alle Platten da, aber "sdk1" hat die Rolle (5) von "sdi5" geklaut
- Jetzt denkt er es ist ein raid0-System es steht auf inactive
/dev/md0:
Version : 1.2 Raid Level : raid0 Total Devices : 11 Persistence : Superblock is persistent State : inactive Name : raib2:0 (local to host raib2) UUID : 500aa0db:5aca5187:5617c3ff:dc97c2c4 Events : 61369 Number Major Minor RaidDevice - 8 17 - /dev/sdb1 - 8 33 - /dev/sdc1 - 8 49 - /dev/sdd1 - 8 65 - /dev/sde1 - 8 81 - /dev/sdf1 - 8 97 - /dev/sdg1 - 8 113 - /dev/sdh1 - 8 129 - /dev/sdi1 - 8 145 - /dev/sdj1 - 8 161 - /dev/sdk1 - 8 177 - /dev/sdl1
- Ich habe mit mdadm --examine /dev/sd*1 jede einzelne Rolle angesehen
- dabei war "k" eigentlich im Verbund nicht mehr dabei, hatte aber die identität "5" wie "i"
- Erste Idee war also dass ich "k" ganz abschalte!! "WD-WCC4M0SC7AR1"
- OK, dass war das "böse" Drive zumindest mal weg, das Problem war aber nicht gelöst
- Dann habe ich das Array neu re-created:
- mdadm --create --assume-clean --verbose /dev/md0 --level=6 --raid-devices=9 /dev/sd[cdefghijk]1
- Ich habe noch gesehen dass ein Lesecheck eigentlich 100% Fehler ergibt
- Ich habe gesehen dass er gar nicht die alte Reihenfolge der Platten benutzt hat, das ist ja dumm
- Es passte also nix zusammen, ext4.fsck ergab Millionen Fehler im Dateisystem, es war hoffnungslos
- Ich habe dieses Array verloren, da ich nicht weitergeforscht habe wie man beim "re-create" die Rollen der Partitionen beibehalten kann, dann wäre das sicher gutgegangen
- Nicht so schlimm: weil das nur ein Backupsystem war
- Nun raid "Mon Sep 30 11:27:05 2019"
Störung vom 27.11.2019
- Y5GHNDBTS, B2 die Toshiba macht nach einem Wiedereinbau 2018 erneut Probleme
- sie war auch für das Rebuild des Arrays verantwortlich
[Wed Nov 27 14:54:39 2019] ata5.00: exception Emask 0x0 SAct 0x7ff0003f SErr 0x0 action 0x0 [Wed Nov 27 14:54:39 2019] ata5.00: irq_stat 0x40000008 [Wed Nov 27 14:54:39 2019] ata5.00: failed command: READ FPDMA QUEUED [Wed Nov 27 14:54:39 2019] ata5.00: cmd 60/40:00:00:d8:50/05:00:00:00:00/40 tag 0 ncq dma 688128 in res 51/40:40:00:d8:50/00:05:00:00:00/40 Emask 0x409 (media error) <F> [Wed Nov 27 14:54:39 2019] ata5.00: status: { DRDY ERR } [Wed Nov 27 14:54:39 2019] ata5.00: error: { UNC } [Wed Nov 27 14:54:39 2019] ata5.00: configured for UDMA/133 [Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 Sense Key : Medium Error [current] [Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed [Wed Nov 27 14:54:39 2019] sd 5:0:0:0: [sde] tag#0 CDB: Read(10) 28 00 00 50 d8 00 00 05 40 00 [Wed Nov 27 14:54:39 2019] blk_update_request: I/O error, dev sde, sector 5298176 [Wed Nov 27 14:54:39 2019] ata5: EH complete
- Routinemässig führte ich einen kompletten Lesecheck durch
- sde hat hierauf durch diverse Lesefehler geglänzt, das hab ich mi dmesg gesehen
- der Read-Mismatch Count war auf 24 gestiegen
- die Temperatur von sde war 4 Grad höher als der Durchschnitt der anderen Platten
- md hat diverse Lesefehler durch Überschreiben scheinbar gelöst
- Ich habe den "check" unterbrochen (ich denke mit "frozen") und ein --replace gemacht
[Wed Nov 27 14:36:08 2019] md/raid:md0: read error corrected (8 sectors at 5257216 on sde1) [Wed Nov 27 14:36:08 2019] md/raid:md0: read error corrected (8 sectors at 5260120 on sde1) [Wed Nov 27 14:39:48 2019] md/raid:md0: read error corrected (8 sectors at 17007032 on sde1) [Wed Nov 27 14:39:48 2019] md/raid:md0: read error corrected (8 sectors at 17000048 on sde1) [Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17007096 on sde1) [Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17018104 on sde1) [Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17018232 on sde1) [Wed Nov 27 14:39:49 2019] md/raid:md0: read error corrected (8 sectors at 17018256 on sde1) [Wed Nov 27 14:41:17 2019] md: md0: data-check interrupted. [Wed Nov 27 14:41:28 2019] md/raid:md0: read error corrected (8 sectors at 18543120 on sde1) [Wed Nov 27 14:53:13 2019] md: recovery of RAID array md0 [Wed Nov 27 14:55:10 2019] md/raid:md0: read error corrected (8 sectors at 5282632 on sde1) [Wed Nov 27 14:55:10 2019] md/raid:md0: read error corrected (8 sectors at 5282640 on sde1) [Wed Nov 27 14:55:11 2019] md/raid:md0: read error corrected (8 sectors at 5282392 on sde1) [Wed Nov 27 14:56:56 2019] md/raid:md0: read error corrected (8 sectors at 17023768 on sde1) [Wed Nov 27 14:59:22 2019] md/raid:md0: read error corrected (8 sectors at 18601648 on sde1) [Wed Nov 27 14:59:22 2019] md/raid:md0: read error corrected (8 sectors at 18604832 on sde1) [Wed Nov 27 14:59:29 2019] md/raid:md0: read error corrected (8 sectors at 18604992 on sde1) [Wed Nov 27 14:59:48 2019] md/raid:md0: read error corrected (8 sectors at 18598848 on sde1) [Wed Nov 27 14:59:49 2019] md/raid:md0: read error corrected (8 sectors at 18598856 on sde1) [Wed Nov 27 14:59:49 2019] md/raid:md0: read error corrected (8 sectors at 18598880 on sde1) [Wed Nov 27 14:59:49 2019] md/raid:md0: read error corrected (8 sectors at 18598928 on sde1) [Wed Nov 27 15:00:13 2019] md/raid:md0: read error corrected (8 sectors at 18612776 on sde1)
- gebe diese Platte entgültig ins Recycling
Ausbau am 28.11.2019
- Entsorge WCC4M0SC7AR1
- Entsorge Y5GHNDBTSTZ5
- Einbau des 3. Cages, dadurch Frontblende nicht mehr verwendbar
- Neukauf Z8K7WMMAS als spare (sde)
Störung vom 29.11.2019
- Es ergeben sich Fehler beim Scrubbing
- Ich mache nach "check" jetzt einen "repair"-Lauf
raib2:~ # cat /sys/block/md0/md/mismatch_cnt 88 raib2:~ # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdf1[0] sdg1[3] sdi1[4] sdk1[8] sdh1[9] sdj1[7] sde1[10](S) sdd1[6] sdb1[1] sdc1[5] 13673676800 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU] [==>..................] resync = 14.1% (276989284/1953382400) finish=203.8min speed=137064K/sec bitmap: 0/15 pages [0KB], 65536KB chunk unused devices: <none>
- Ergebnis eines weiteren "repair"-Laufes nach den "88" Fehlern
- Diesmal gab es "0" Fehler, die Störung ist behoben
Störung vom 11.12.2019
- Nach einem Absturz der Host-Controller Software im Kernel 5.3 bot sind folgendes Bild
- Der Kernel war sehr neu - also habe ich wieder openSUSE 15.1 installiert
- Das Array war inactive und es war der level "raid0" angegeben
/dev/md0: Version : 1.2 Raid Level : raid0 Total Devices : 10 Persistence : Superblock is persistent State : inactive Working Devices : 10 Name : raib2:0 (local to host raib2) UUID : d014324b:85ea6d08:42120868:6465e2b2 Events : 12812 Number Major Minor RaidDevice - 8 161 - /dev/sdk1 - 8 145 - /dev/sdj1 - 8 129 - /dev/sdi1 - 8 113 - /dev/sdh1 - 8 97 - /dev/sdg1 - 8 81 - /dev/sdf1 - 8 65 - /dev/sde1 - 8 49 - /dev/sdd1 - 8 33 - /dev/sdc1 - 8 17 - /dev/sdb1
- Es war nicht mehr "md0" sondern "md127"
- Alle 11 Drives hatten KEINE Rolle, sondern nur "-"
- Ich habe bei allen Drives "examine" gemacht - es war bei manchen "AAAAAAAAA" bei manchen ".AA.AAAAA"
- Aber die Kern-Daten, also Rolle und raid6 war alles OK!
- aber sonst war eigentlich alles gut!
mdadm --stop /dev/md127 mdadm --assemble --force --uuid d014324b:85ea6d08:42120868:6465e2b2 /dev/md0
- Ohne "force" ging es nicht da waren fast alle Drives Busy
- Hurra das array war 100% (sogar "clean") wieder da, sogar ohn --run
- Habe der Sache mistraut und erst mal einen Lesecheck gemacht
mdadm --action=check /dev/md0 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294080-1061294088 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294088-1061294096 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294096-1061294104 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294104-1061294112 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294112-1061294120 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294120-1061294128 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294128-1061294136 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294136-1061294144 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294144-1061294152 [Wed Dec 11 18:44:52 2019] md0: mismatch sector in range 1061294152-1061294160 mismatch_count=13776 mdadm --action=repair /dev/md0
Störung 17.12.2019
[Tue Dec 17 11:24:44 2019] sas: Enter sas_scsi_recover_host busy: 6 failed: 6 [Tue Dec 17 11:24:44 2019] sas: trying to find task 0xffff8808518fec00 [Tue Dec 17 11:24:44 2019] sas: sas_scsi_find_task: aborting task 0xffff8808518fec00 [Tue Dec 17 11:24:44 2019] sas: sas_scsi_find_task: task 0xffff8808518fec00 is aborted [Tue Dec 17 11:24:44 2019] sas: sas_eh_handle_sas_errors: task 0xffff8808518fec00 is aborted [Tue Dec 17 11:24:44 2019] sas: trying to find task 0xffff88085292dc00 [Tue Dec 17 11:24:44 2019] sas: sas_scsi_find_task: aborting task 0xffff88085292dc00 [Tue Dec 17 11:25:04 2019] ../drivers/scsi/mvsas/mv_sas.c 1330:TMF task[1] timeout. [Tue Dec 17 11:25:04 2019] ../drivers/scsi/mvsas/mv_sas.c 1555:mvs_abort_task:rc= 5 [Tue Dec 17 11:25:04 2019] sas: sas_scsi_find_task: querying task 0xffff88085292dc00 [Tue Dec 17 11:25:25 2019] ../drivers/scsi/mvsas/mv_sas.c 1330:TMF task[80] timeout. [Tue Dec 17 11:25:25 2019] ../drivers/scsi/mvsas/mv_sas.c 1477:mvs_query_task:rc= 5 [Tue Dec 17 11:25:25 2019] sas: sas_scsi_find_task: task 0xffff88085292dc00 failed to abort [Tue Dec 17 11:25:25 2019] sas: task 0xffff88085292dc00 is not at LU: I_T recover [Tue Dec 17 11:25:25 2019] sas: I_T nexus reset for dev 5000cca028b1d079 [Tue Dec 17 11:25:27 2019] ../drivers/scsi/mvsas/mv_sas.c 1435:mvs_I_T_nexus_reset for device[1]:rc= 0 [Tue Dec 17 11:25:27 2019] sas: task done but aborted [Tue Dec 17 11:25:27 2019] BUG: unable to handle kernel paging request at ffffffff815a1d70 [Tue Dec 17 11:25:27 2019] IP: native_queued_spin_lock_slowpath+0x161/0x190 [Tue Dec 17 11:25:27 2019] PGD 200e067 P4D 200e067 PUD 200f063 PMD 14000e1 [Tue Dec 17 11:25:27 2019] Oops: 0003 [#1] SMP PTI [Tue Dec 17 11:25:27 2019] CPU: 2 PID: 247 Comm: scsi_eh_0 Not tainted 4.12.14-lp151.28.36-default #1 openSUSE Leap 15.1 [Tue Dec 17 11:25:27 2019] Hardware name: Supermicro Super Server/X11SSL-F, BIOS 2.2a 05/24/2019 [Tue Dec 17 11:25:27 2019] task: ffff880853ce0100 task.stack: ffffc900038dc000 [Tue Dec 17 11:25:27 2019] RIP: 0010:native_queued_spin_lock_slowpath+0x161/0x190 [Tue Dec 17 11:25:27 2019] RSP: 0018:ffffc900038dfcb0 EFLAGS: 00010082 [Tue Dec 17 11:25:27 2019] RAX: ffffffff815a1d70 RBX: ffff8808518fec00 RCX: ffff8808779250c0 [Tue Dec 17 11:25:27 2019] RDX: 0000000000002048 RSI: 000000008125d290 RDI: ffff8808518fec08 [Tue Dec 17 11:25:27 2019] RBP: ffff8808529c0000 R08: 00000000000c0000 R09: 000000061a54a000 [Tue Dec 17 11:25:27 2019] R10: ffff8808529c0f40 R11: 0000000000000001 R12: ffff8808529c00b0 [Tue Dec 17 11:25:27 2019] R13: 0000000000000002 R14: 0000000000000002 R15: ffff8808539b8c00 [Tue Dec 17 11:25:27 2019] FS: 0000000000000000(0000) GS:ffff880877900000(0000) knlGS:0000000000000000 [Tue Dec 17 11:25:27 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Dec 17 11:25:27 2019] CR2: ffffffff815a1d70 CR3: 000000000200a001 CR4: 00000000003606e0 [Tue Dec 17 11:25:27 2019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Tue Dec 17 11:25:27 2019] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [Tue Dec 17 11:25:27 2019] Call Trace: [Tue Dec 17 11:25:27 2019] _raw_spin_lock+0x1d/0x20 [Tue Dec 17 11:25:27 2019] mvs_slot_complete+0x8c/0x5d0 [mvsas] [Tue Dec 17 11:25:27 2019] mvs_int_rx+0x68/0x130 [mvsas] [Tue Dec 17 11:25:27 2019] mvs_do_release_task+0x37/0x100 [mvsas] [Tue Dec 17 11:25:27 2019] mvs_release_task+0xf9/0x110 [mvsas] [Tue Dec 17 11:25:27 2019] mvs_I_T_nexus_reset+0xac/0xc0 [mvsas] [Tue Dec 17 11:25:27 2019] sas_scsi_recover_host+0x27d/0xb20 [libsas] [Tue Dec 17 11:25:27 2019] ? __pm_runtime_resume+0x54/0x70 [Tue Dec 17 11:25:27 2019] ? scsi_try_target_reset+0x90/0x90 [Tue Dec 17 11:25:27 2019] scsi_error_handler+0xc7/0x5c0 [Tue Dec 17 11:25:27 2019] ? __schedule+0x287/0x830 [Tue Dec 17 11:25:27 2019] ? scsi_eh_get_sense+0x200/0x200 [Tue Dec 17 11:25:27 2019] kthread+0x10d/0x130 [Tue Dec 17 11:25:27 2019] ? kthread_create_worker_on_cpu+0x50/0x50 [Tue Dec 17 11:25:27 2019] ret_from_fork+0x35/0x40 [Tue Dec 17 11:25:27 2019] Code: c3 f3 90 4c 8b 09 4d 85 c9 74 f6 eb d2 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2 48 05 c0 50 02 00 48 03 04 d5 00 e5 ed 81 <48> 89 08 8b 41 08 85 c0 75 09 f3 90 8b 41 08 85 c0 74 f7 4c 8b [Tue Dec 17 11:25:27 2019] Modules linked in: scsi_transport_iscsi af_packet br_netfilter bridge stp llc iscsi_ibft iscsi_boot_sysfs dmi_sysfs msr raid456 async_raid6_recov async_memcpy libcrc32c async_pq async_xor xor async_tx raid6_pq nls_iso8859_1 nls_cp437 vfat fat ipmi_ssif joydev iTCO_wdt iTCO_vendor_support hid_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd pcspkr md_mod igb i2c_i801 ptp pps_core dca mei_me usbhid mei intel_pch_thermal ie31200_edac fan thermal ipmi_si ipmi_devintf ipmi_msghandler video button pcc_cpufreq ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm xhci_pci xhci_hcd usbcore drm mvsas drm_panel_orientation_quirks [Tue Dec 17 11:25:27 2019] libsas ahci libahci scsi_transport_sas sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs [Tue Dec 17 11:25:27 2019] CR2: ffffffff815a1d70 [Tue Dec 17 11:25:27 2019] ---[ end trace 4a47a2eaecf2db5d ]--- [Tue Dec 17 11:25:27 2019] RIP: 0010:native_queued_spin_lock_slowpath+0x161/0x190 [Tue Dec 17 11:25:27 2019] RSP: 0018:ffffc900038dfcb0 EFLAGS: 00010082 [Tue Dec 17 11:25:27 2019] RAX: ffffffff815a1d70 RBX: ffff8808518fec00 RCX: ffff8808779250c0 [Tue Dec 17 11:25:27 2019] RDX: 0000000000002048 RSI: 000000008125d290 RDI: ffff8808518fec08 [Tue Dec 17 11:25:27 2019] RBP: ffff8808529c0000 R08: 00000000000c0000 R09: 000000061a54a000 [Tue Dec 17 11:25:27 2019] R10: ffff8808529c0f40 R11: 0000000000000001 R12: ffff8808529c00b0 [Tue Dec 17 11:25:27 2019] R13: 0000000000000002 R14: 0000000000000002 R15: ffff8808539b8c00 [Tue Dec 17 11:25:27 2019] FS: 0000000000000000(0000) GS:ffff880877900000(0000) knlGS:0000000000000000 [Tue Dec 17 11:25:27 2019] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Tue Dec 17 11:25:27 2019] CR2: ffffffff815a1d70 CR3: 000000000200a001 CR4: 00000000003606e0 [Tue Dec 17 11:25:27 2019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [Tue Dec 17 11:25:27 2019] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
- nach Controller-Tausch, war leider die SAS Platte nicht mehr anschliessbar
/dev/md127: Version : 1.2 Raid Level : raid0 Total Devices : 9 Persistence : Superblock is persistent State : inactive Working Devices : 9 Name : raib2:0 (local to host raib2) UUID : d014324b:85ea6d08:42120868:6465e2b2 Events : 13670 Number Major Minor RaidDevice - 8 145 - /dev/sdj1 - 8 129 - /dev/sdi1 - 8 113 - /dev/sdh1 - 8 97 - /dev/sdg1 - 8 81 - /dev/sdf1 - 8 65 - /dev/sde1 - 8 49 - /dev/sdd1 - 8 33 - /dev/sdc1 - 8 17 - /dev/sdb1
Ausbau 19.12.2019
- Durch einen SAS auf SATA Adapter konnte ich die SAS Platte wieder als spare in den Verbund mit aufnehmen
- Das SATA Kabel liefert der Apaptec, da ich vermute dass dieser Controller eher mit einer SAS Platte zurecht kommt
- Also habe ich von einer anderen Platte ein SATA Stecker entfernt und an den SAS Adapter angeschlossen
- Die nun neu benötigte SATA- Schnittstelle liefert eine 4 Port PCIExpress Host Adapter Karte Karte
Störung 16.05.2020
[Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#4 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK [Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#0 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK [Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#0 CDB: Read(10) 28 00 8b fa f5 c8 00 00 08 00 [Sat May 16 16:41:05 2020] print_req_error: I/O error, dev sdd, sector 2348479944 [Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#4 CDB: Write(10) 2a 00 8b 8d 2d 00 00 00 90 00 [Sat May 16 16:41:05 2020] print_req_error: I/O error, dev sdd, sector 2341285120 [Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#9 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK [Sat May 16 16:41:05 2020] sd 0:2:3:0: [sdd] tag#9 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [Sat May 16 16:41:05 2020] print_req_error: I/O error, dev sdd, sector 2064 [Sat May 16 16:41:05 2020] md/raid:md127: Disk failure on sdd1, disabling device.
- Ausfall sdd (WD-WCC4M0SC7C9R)
- Spare sdb (P6K4TJSV) läuft in die Recovery
Störung 05.06.2020
[Fri Jun 5 12:35:52 2020] md: kicking non-fresh sda1 from array!
- ein resync mit dem spare erfolgte, danach schien alles normal
- ich schaute mir mal sda1 an:
raib2:~ # mdadm --examine /dev/sda1 /dev/sda1: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : d014324b:85ea6d08:42120868:6465e2b2 Name : raib2:0 (local to host raib2) Creation Time : Mon Sep 30 11:27:05 2019 Raid Level : raid6 Raid Devices : 9 Avail Dev Size : 3906764943 sectors (1862.89 GiB 2000.26 GB) Array Size : 13673676800 KiB (12.73 TiB 14.00 TB) Used Dev Size : 3906764800 sectors (1862.89 GiB 2000.26 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262064 sectors, after=143 sectors State : active Device UUID : cc7b9892:466e800c:75e17d58:48b3ccce Internal Bitmap : 8 sectors from superblock Update Time : Sat May 23 15:59:44 2020 Bad Block Log : 512 entries available at offset 16 sectors Checksum : cee8350f - correct Events : 22668 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : AAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
- kann mir keinen Reim drauf machen, meine Reaktion:
smartctl --test long /dev/sda
(war erfolgreich!)mdadm --zero-superblock /dev/sda1
mdadm /dev/md127 --add-spare /dev/sda1
Störung vom 17.07.2020
[Fri Jul 17 16:50:19 2020] md: kicking non-fresh sdi1 from array! [Fri Jul 17 16:50:19 2020] md: kicking non-fresh sdg1 from array!
- Ich glaube ein Neustart brachte die Lösung, es gab keine besonderen Probleme?? Bin mir nicht sicher
Störung von 01.09.2020
- sdg macht als spare komische Meldungen im dmesg
- Ich denke ich schaffe im Oktober 2 neue Platten an
Ausbau 10.09.2020
- Wegfall: Aerocool Strike-X One PC-Gehäuse (ATX, 7X 3,5 intern, 9X 5,25 extern, 2X USB 2.0) schwarz
- Wegfall: XigmaTek 4 in 3 HDD Cage
- Umbau in ein Fractal Define 7 im Storage Layout
- Fotos aller Platten eingepflegt
Check 2023
- hm komisch, der Spare war weg, ich hab sdj1 wieder zum Spare gemacht
--detail
- Stand vom 01.10.2020
/dev/md0: Version : 1.2 Creation Time : Mon Sep 30 11:27:05 2019 Raid Level : raid6 Array Size : 13673676800 (12.73 TiB 14.00 TB) Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB) Raid Devices : 9 Total Devices : 10 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Oct 2 13:06:14 2020 State : active Active Devices : 9 Working Devices : 10 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Name : raib2:0 (local to host raib2) UUID : d014324b:85ea6d08:42120868:6465e2b2 Events : 36592 Number Major Minor RaidDevice State 11 8 129 0 active sync /dev/sdi1 P6K4TJSV 14 8 1 1 active sync /dev/sda1 Z4Z2W81E 12 8 97 2 active sync /dev/sdg1 M0SC7C9R 3 8 161 3 active sync /dev/sdk1 Z4Z2XNWC 4 8 113 4 active sync /dev/sdh1 7T0V1LAS 5 8 17 5 active sync /dev/sdb1 5RAD3XGS 13 8 49 6 active sync /dev/sdd1 Z4Z32SNR 7 8 81 7 active sync /dev/sdf1 5RAD2GGS 10 8 33 8 active sync /dev/sdc1 8K7WMMAS 9 8 145 - spare /dev/sdj1 M4KXAVNT