左耳朵耗子：我对GitLab误删除数据库事件的几点思考

聊聊架构 · 公众号 · 架构 · 2017-02-02 20:24

正文

因为GitLab把整个事件的细节公开了出来，所以，也得到了很多外部的帮助，2nd Quadrant的CTO – Simon Riggs 在他的blog上也发布文章 Dataloss at GitLab 给了一些非常不错的建议：

关于PostgreSQL 9.6的数据同步hang住的问题，可能有一些Bug，正在fix中。
PostgreSQL有4GB的同步滞后是正常的，这不是什么问题。
正常的停止从结点，会让主结点自动释放WALSender的链接数，所以，不应该重新配置主结点的 max_wal_senders 参数。但是，停止从结点时，主结点的复数连接数不会很快的被释放，而新启动的从结点又会消耗更多的链接数。他认为，GitLab配置的32个链接数太高了，通常来说，2到4个就足够了。
另外，之前 GitLab 配置的max_connections=8000太高了，现在降到2000个是合理的。
pg_basebackup 会先在主结点上建一个checkpoint，然后再开始同步，这个过程大约需要4分钟。
手动的删除数据库目录是非常危险的操作，这个事应该交给程序来做。推荐使用刚release 的 repmgr。
恢复备份也是非常重要的，所以，也应该用相应的程序来做。推荐使用 barman （其支持S3）。
测试备份和恢复是一个很重要的过程。

看这个样子，估计也有一定的原因是——GitLab的同学对PostgreSQL不是很熟悉。

随后，GitLab在其网站上也开了一系列的issues，其issues列表在这里 Write post-mortem (这个列表可能还会在不断更新中)。

infrastructure#1094 – Update PS1 across all hosts to more clearly differentiate between hosts and environments
infrastructure#1095 – Prometheus monitoring for backups
infrastructure#1096 – Set PostgreSQL’s max_connections to a sane value
infrastructure#1097 – Investigate Point in time recovery & continuous archiving for PostgreSQL
infrastructure#1098 – Hourly LVM snapshots of the production databases
infrastructure#1099 – Azure disk snapshots of production databases
infrastructure#1100 – Move staging to the ARM environment
infrastructure#1101 – Recover production replica(s)
infrastructure#1102 – Automated testing of recovering PostgreSQL database backups
infrastructure#1103 – Improve PostgreSQL replication documentation/runbooks
infrastructure#1104 – Kick out SSH users inactive for N minutes
infrastructure#1105 – Investigate pgbarman for creating PostgreSQL backups