Revision history - refs/changes/00/259800/1 - origin: https://github.com/wikimedia/operations-puppet

visit type:

Revision	Author	Date	Message	Commit Date
7e8a881	Brandon Black	17 December 2015, 21:11:53 UTC	Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"" This reverts commit fea45e166fdd6d1c5fca6a1e206e6d45077c0602. Bug: T121564 Bug: T96847 Change-Id: I57b36b5c2100900e4964daf0ac279fa46b734f5f	17 December 2015, 21:34:12 UTC
fa7cafb	Brandon Black	17 December 2015, 21:11:15 UTC	Text VCL: raise hfp TTL to 601s Change-Id: I7214813555a4340582752c0dddb515d360908970	17 December 2015, 21:34:04 UTC
37cd235	Andrew Otto	17 December 2015, 21:23:31 UTC	Use new kafka role for eventlogging service eventbus configuration Change-Id: I172a07b17267ea2cb97549b691b475adc6836c2e	17 December 2015, 21:23:31 UTC
2e16178	Andrew Otto	17 December 2015, 21:05:50 UTC	Make eventlogging files consumer role manage output directory The eventlogging module class only needs to manage daemon output, not any potential consumer output. Change-Id: I37da626a3f3c9bc79668f1d3a888bcabc2424e14	17 December 2015, 21:05:50 UTC
a90eb55	Andrew Otto	17 December 2015, 20:57:21 UTC	Don't include eventlogging::deployment::source in production yet Change-Id: I2a9a84b44b23f97dcb36dc8d10076e0ac9235f28	17 December 2015, 20:57:21 UTC
8f787f6	dzahn	17 December 2015, 20:47:36 UTC	mediawiki: fix puppet-lint warnings Change-Id: I649efbb8f4b21d29f61b9068899fda0ad2994c21	17 December 2015, 20:49:31 UTC
f5685c6	Andrew Otto	16 November 2015, 22:03:56 UTC	Puppetize eventlogging-service with systemd in role::eventbus Deployment via scap3. This patch works in beta. There is still work to be done, but the patch is getting too large. Further work will proceeed in new changes. No changes on eventlog1001 according to https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1506/ TODO: Move role::eventbus::eventbus back to role::eventbus when T119042 works. TODO: Use role::kafka::* to get kafka config. Bug: T118780 Change-Id: I621de844ed7a5bd1ac532b52058925350d9e5337	17 December 2015, 20:10:18 UTC
c31ba3b	Andrew Otto	16 December 2015, 19:40:07 UTC	Move role::scap::target to scap::ferm, add scap::target define I don't see any reason for this to be a role class, especially if it is specifically needed for scap deployment servers to be able to deploy. This change adds a scap::target define, which simplifies the process of adding new scap::targets. Change-Id: Ia78d44b9b56ea165e9b584f8b30c0395da490f51	17 December 2015, 19:58:24 UTC
6202b92	Brandon Black	17 December 2015, 18:45:48 UTC	Text VCL: exclude lower-layer cache hits from hfp object creation (and also, move the hfp block to common code) Note the " hit(" regex match is the new X-Cache data format recently introduced. It will take up to 30 days for all old cache objects recorded as " hit (" to expire out and make this change fully effective. We can't trust the older "hit (" style because it's not a reliable indicator of a true cache hit on a real object Change-Id: I7241260f63d9fc22c3268332c67b82b7df3be424	17 December 2015, 18:58:25 UTC
a1b7921	Brandon Black	17 December 2015, 18:35:22 UTC	Revert "VCL: grace-mode only in frontend caches" This reverts commit 0f4de6da8c85a056007b54eae4082d9bd3d71848. Change-Id: Ie4110a7354c299161acf55ab09fb5ca8f08a8de5	17 December 2015, 18:55:23 UTC
d1be20f	Brandon Black	17 December 2015, 18:35:16 UTC	Revert "post-merge syntax bugfix for be768ad7c6" This reverts commit c63778ad12af6fdc75a6d53dcc88bb1f1ca697e0. Change-Id: Ib5644167c108d7c6601a99093f07a9996d51b3c4	17 December 2015, 18:55:08 UTC
5446c6e	Jaime Crespo	17 December 2015, 18:32:52 UTC	Remove extra space on data.yaml Change-Id: I898bea40ffe481dfbf4bcc9cf528c5102cba793f	17 December 2015, 18:34:58 UTC
73b1452	Chad Horohoe	30 November 2015, 18:21:55 UTC	Elastic: move merge_threads to hiera Change-Id: Id045c9e934a50cc8bd2eaf2ebcc58344cf8709e4	17 December 2015, 18:33:07 UTC
3d1c28e	Jaime Crespo	17 December 2015, 18:16:18 UTC	New generated key for jcrespo (jynus) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 New generated key protected by a hardware token (generated on 2015-11-18): ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCWNAMYh92QyZNjHcTyoapyWSKkQBSFJVgKWNW+5of3fiJ0frczz9R+MW2RiRPjdh2VoOzEdMboRogr7O5I1D2x07cVYpTNYEx4cPmzg7xLKUqPY0zxJGZz7g2zlXr1RtiM21MTNiG+tF1ndnB3KYa1LB9fA8pSgQkGz+UjFWGg2/LD6tLzNA8yB+MjV0X+nEtC+i58L5nchMN/m3RsyfCGOnJxPAsOCbQpolITCKSVceRPI/FvBAbaaUidL7MvfkgFTUjf+NX2b25ZdIVYD4BVGHrkw3fFQpPYdidEyLMN/wnu5leZskoOnuMzn2AgHQEBdsrKeV/umdFq3SjGJkkR -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlZMk0EACgkQT7LnGA8RGa0UGgCgtxtEeGl7HoJHBXxXRM3fSXUu Cr4Anjtf64lYsQw1hy5FEoHX0xQWlLGz =oEmt -----END PGP SIGNATURE----- Change-Id: Icc484d0f713d72a45fe9ec1e84232b3c2ca212dd	17 December 2015, 18:29:22 UTC
e3602de	Chad Horohoe	04 December 2015, 21:24:28 UTC	toollabs: pep8 fixes for pretty code :) Change-Id: I7701ed0848b3a5132ff4ad2de899a1769749746c	17 December 2015, 18:27:05 UTC
5578783	Chad Horohoe	08 December 2015, 19:36:44 UTC	Gerrit: move static assets to .cache. filenames This lets Gerrit serve them with 1y expires, perfect for things like the logo and background image that never change. Change-Id: Ie5aca12dafd20ca79fb024e35c71a87818c076e2	17 December 2015, 18:16:34 UTC
a0d4e43	Brandon Black	17 December 2015, 18:13:57 UTC	Followup missing bit from 70f4366dc Need to add +chfp in both layers... Change-Id: I376dd312505ce33727dc35f04b146752feac9c8f	17 December 2015, 18:14:33 UTC
00533a9	RobH	17 December 2015, 17:58:12 UTC	reclaiming calcium to spares reclaiming a system to spares T116790 Change-Id: Iecef4e75b49867bf7427bfb43156321c518e1a14	17 December 2015, 18:11:07 UTC
087a765	Brandon Black	17 December 2015, 17:43:47 UTC	VCL: Make X-Cache more accurate/informative, hopefully Comparing obj.hits to zero isn't really a reliable indicator of much of anything, so our X-Cache hit\|miss information was pretty misleading before. This new code is based on the state diagram at: https://www.varnish-software.com/book/3/VCL_Basics.html#detailed-request-flow All requests pass through one of vcl_(hit\|miss\|pass) (except for vcl_pipe requests, but then those don't go through deliver to set up X-Cache either). Sometimes they pass through _hit\|_miss and then through _pass, but usually just one of the three. With the new info, we can interpret as follows: hit - Hit on a real cached object (not a hit_for_pass object), definitely no backend fetch happens. miss - Missed the cache lookup (no hit_for_pass here either), backend fetch definitely required. pass - Either vcl_recv returned explicit pass, or there was a cache hit on a hit_for_pass object which caused a pass, backend fetch definitely required. hit+pass - "hit" happened as above, but then vcl_hit code did an explicit "return (pass)" miss+pass - "miss" happened as above, but then vcl_miss code did an explicit "return (pass)" Additionally, on the text/mobile clusters, if vcl_fetch created a new hit_for_pass object (due to e.g. beresp.ttl <= 0s), the above will be suffixed with "+chfp" for the request that created it. Whether that object actually gets stored for other hits to reference (as opposed to being anonymous and pointless) is an open question... Also, note there's a subtle change in X-Cache output so that we can tell the difference between new info and old cached info. The space before the obj.hits count is removed: Old: X-Cache: be_cp1053 hit (53) New: X-Cache: be_cp1053 hit(53) Change-Id: Ia37b469be518bb48c948cadbd1bb80dce14ea891	17 December 2015, 18:09:14 UTC
2d7a416	Marko Obrovac	17 December 2015, 18:02:17 UTC	Mathoid: Increase the number of workers temporarily There is a bug in Mathoid that when certain TeX input is given, the worker process dies. Until we do a hotfix, increase the number of workers temporarily to 50 to minimise the chances of all workers dying at the same time. Bug: T121762 Change-Id: Ice3619c27855952d0f5723bea322ea9b04ab36ea	17 December 2015, 18:07:12 UTC
2daebcd	Chad Horohoe	02 December 2015, 16:52:57 UTC	gmond_memcached.py: fix all kinds of pep8 warnings Mostly leading whitespace Change-Id: I2ef92498ea5b3d0a9c5ba0f280dae36f094f0e0e	17 December 2015, 18:03:03 UTC
5819824	Chad Horohoe	14 December 2015, 19:53:49 UTC	More fixmes for scap/manifests/scripts.pp Change-Id: I73956e777df544e053b235bd784996d09ab005b1	17 December 2015, 17:53:45 UTC
eedf937	Erik Bernhardson	16 December 2015, 01:40:24 UTC	[elasticsearch] Collect cluster health stats about shard movement The addition of relocating/initializing/unassigned shards statistics should give us better insight into when the cluster drops a node, and how it recovers from dropping that node. I would have thought this was uncommon, but we have dropped a node twice in the last 3 days and need better monitoring about what happens. Bug: T117284 Change-Id: I69788c5455115b5aa54167facfcb2dd83954e0bc	17 December 2015, 17:50:32 UTC
df47dcb	Erik Bernhardson	17 December 2015, 17:19:50 UTC	[elastic] Record count of searchs rejected due to thread pool exhaustion If the cluster is too busy it will start to reject requests, and record them here. Most of the servers report between a few thousand and 100k rejected searches (since they were restarted in september). We should record these so to help keep an eye on the cluster health. Change-Id: Idefb9c622ea1d4919f8dfd2f7350eed048e7dac2	17 December 2015, 17:46:47 UTC
1707ebf	Erik Bernhardson	09 December 2015, 23:32:53 UTC	Cron job to rebuild completion indices This will run once a week at 20 after midnight UTC. According to our graphs it looks like midnight to 7am is the least busy time for the cluster. This job takes about 20 hours to run serially, with 4 way parallelism it brings us down to about 12 hours and helps ensure we are not inducing extra latency during the busiest parts of the day. This supports the new completion suggester beta feature which is going into production on wikipedias thursday, dec 17. The initial indices have already been built, this is necessary to keep them updated in the future. Bug: T112028 Change-Id: I66c2723a366e988574b46ded4e1bdd9c3188a58e	17 December 2015, 17:42:37 UTC
c127ad9	Kunal Mehta	17 December 2015, 07:52:59 UTC	extdist: Split skindist log into a separate file The cron job runs can overlap, leading to combined log files, which are annoying to read through. Change-Id: I4bd5b5e81ad35b4d234ddd280dc861de00fdfd88	17 December 2015, 17:34:52 UTC
fec20c0	Bryan Davis	17 December 2015, 16:55:10 UTC	beta: Fix logstash::cluster_hosts The beta cluster ELK servers don't need to allow Elasticsearch cluster access to the production hosts. Since this is a single node cluster it technically doesn't need to allow access to an hosts at all, but the Puppet setup needs some value for the hiera variable lookup or catalog compilation will fail. Change-Id: I34c402e18353b1726fa0ed678688ede033a19eff	17 December 2015, 17:15:32 UTC
0945fcf	Bryan Davis	17 December 2015, 16:43:14 UTC	stashbot: Add missing logstash::cluster_hosts hiera data Needed by role::logstash::elasticsearch to open firewall access between Elasticsearch nodes. Change-Id: Iedb30012cae785745a0fb7fccd0987a99b5b31de	17 December 2015, 17:07:08 UTC
176d306	BBlack	17 December 2015, 16:59:58 UTC	Revert "VCL: differentiate hit-vs-hfp in X-Cache" This reverts commit a53c26aebf44bd5a9324f4dd72aed0a747190175. Change-Id: Ic394900b00a4f31f9efe231bb985309352682b60	17 December 2015, 17:00:12 UTC
cb49334	Alexandros Kosiaris	17 December 2015, 16:56:54 UTC	cxserver: no_proxy_entry is not an instance variable So remove the @ signed. Also remove the newline trimming at the end of each element in the list Change-Id: Ia31ef18ea96c4484acd0c38c4c31f2f8363eb334	17 December 2015, 16:56:54 UTC
4dc3f16	Alexandros Kosiaris	17 December 2015, 16:46:10 UTC	cxserver: Populate no_proxy_list correctly Have no_proxy_list as an array of elements that contains the domains a proxy should not be used for. Then interpolate it in yaml and populate the stanza. Provide data via hiera Change-Id: Ia4d21fdade1ea6d4c3a87e4598c61cbb43440c9e	17 December 2015, 16:52:35 UTC
378c1ac	Alexandros Kosiaris	17 December 2015, 16:41:12 UTC	Revert "CXServer: Do not use the proxy for RESTBase and Apertium" This reverts commit 8a1289d4bd1335f4f89042addb7a4c92f5693548. Change-Id: I70efa4fa4916d26e81a135e8897afd38e867dca9	17 December 2015, 16:52:35 UTC
6e0d820	Alexandros Kosiaris	17 December 2015, 16:40:50 UTC	Revert "CXServer: s/no_proxy/no_proxy_list/ in config" This reverts commit b1c8761eda7beee02657e3627d626a7f3a47fb58. Change-Id: If0d5600cdc4fb2d8cf3e045d2eacdc899e755e2c	17 December 2015, 16:52:35 UTC
a53c26a	Brandon Black	17 December 2015, 16:39:13 UTC	VCL: differentiate hit-vs-hfp in X-Cache Change-Id: Id5a1e35a05faba249765b85ce0e2e3495bfd1cc5	17 December 2015, 16:39:13 UTC
6230b5a	Brandon Black	17 December 2015, 16:14:16 UTC	text VCL: exempt all variants of Special:Banner.* There are Special:Banner requests of the form /w/index.php?title=Special:Banner too, so they should get this treatment as well... Change-Id: I08706b44d71cf20e7704dbe6531b6b4c975cafe1	17 December 2015, 16:14:36 UTC
5afb017	Moritz Muehlenhoff	15 December 2015, 10:53:08 UTC	Stop opendj on the former labs LDAP servers Also removes the monitoring of the LDAP ports. To be merged once most instances still accessing them are fixed. Change-Id: Ibdd7d898f8debe54fab2d74cb14e352da4a25d00	17 December 2015, 15:56:07 UTC
ea0733b	addshore	17 December 2015, 12:16:50 UTC	Grafana increase homepage dash list limits The Featured Wikidata dashboard recently fell off the bottom of the list which is by default limited to 10! (as we now have more than 10 deatured dashes). 14 means we can again show all featured dashboards and does not actually increase the size of the list, simply uses the empty space at the bottom... Change-Id: Ibad9153dd52836042157e22a1bbab121c6b828f1	17 December 2015, 15:44:32 UTC
a0bd3f2	Jaime Crespo	17 December 2015, 15:33:01 UTC	Fix typo on I8f72fda4983 s/codfw/eqiad/. No server was harmed on the process. Change-Id: I12ad84b9ad81ffbd9a6de30a78e114fadd603c5d	17 December 2015, 15:36:11 UTC
eed3f9f	cpettet	17 December 2015, 15:33:01 UTC	phabricator: log format to account for x-client-ip Bug: T114014 Change-Id: I852ba89c486e4a70676b5a2cb400931dd45eb86a	17 December 2015, 15:33:01 UTC
9557fe9	Jaime Crespo	17 December 2015, 14:48:16 UTC	Upgrading and reconfiguring mysql on db1031 and x1 codfw x1-slave has been depooled. Change-Id: I8f72fda49831d4dfa78c3f9361fd5a39d619e703 References: T120122	17 December 2015, 15:24:28 UTC
3586aa7	cpettet	17 December 2015, 15:06:29 UTC	phabricator: start using x-client-ip As of 400e9873dfc8fc3728227cc30643833525eae914 we are now limiting logs held about user activity to 30 days. Upstream also agreed to stop storing IP information in the long lived transaction tables. Bug: T114014 Change-Id: I44fd3b63178bff07300ad2d2e7d86ffd6ad686c5	17 December 2015, 15:17:43 UTC
c63778a	Brandon Black	17 December 2015, 15:16:28 UTC	post-merge syntax bugfix for be768ad7c6 Change-Id: I33a01d1f45dd32b3381246e1780c572fd5e410a8	17 December 2015, 15:17:11 UTC
0f4de6d	Brandon Black	17 December 2015, 14:33:39 UTC	VCL: grace-mode only in frontend caches We've observed an issue where, due to the fact that we pass requests through 2-3 distinct layers of varnish cache, if the lower layers (backends) use grace-mode and serve slightly-expired content, this triggers the next cache up the stack receiving such a response to create a 120s hit-for-pass object due to beresp.ttl <= 0s. This is one possible solution: simply don't allow expired "grace" responses in backend caches, only in the frontends directly facing the client. Change-Id: I32d4f25756b1b36588de461e831e4f68311b1d52	17 December 2015, 15:07:52 UTC
f841dba	Faidon Liambotis	17 December 2015, 15:00:18 UTC	labs: widen access.conf exception to everything LOCAL Commit ad2874ee9cc74585a1f685e836852e5c3ab26676 added an access.conf exception for cron, that was broken since the latest round of PAM reorganization. The commit message for the above mentioned that "[t]his will fix cron, but of course there is still a possibility we'll see effects of our access.conf configuration elsewhere (e.g. if we were using atd)". Unfortunately that turned to be the case, with at least "su" known to be broken. Instead of playing whack-a-mole with various different commands, use pam_access' LOCAL directive to allow everything locally-originated on the system. This should fix all of the effects we're seeing, while at the same time giving us the necessary protection we require for ssh. Bug: T121765 Change-Id: Id741e12d5cafed0a91710bd22ecff3b89c59e994	17 December 2015, 15:07:18 UTC
b1c8761	Marko Obrovac	17 December 2015, 15:00:56 UTC	CXServer: s/no_proxy/no_proxy_list/ in config Change-Id: Icc22acf760226ee627ba6e07067b0f7676e9c68b	17 December 2015, 15:01:56 UTC
94a0b50	Brandon Black	17 December 2015, 14:18:31 UTC	text VCL: do not create hfp objects on 5xx Change-Id: Ide2b0bfd30a910788e813b0d6dc88ee2654fb339	17 December 2015, 15:01:20 UTC
8a1289d	Marko Obrovac	17 December 2015, 14:57:20 UTC	CXServer: Do not use the proxy for RESTBase and Apertium Change-Id: Ice049a1e0b49fa3d4d7cc58c3b7c0437d71fa241	17 December 2015, 14:57:20 UTC
ed45cd5	Kartik Mistry	17 December 2015, 14:35:36 UTC	CX: Fix dictionary YAML config Change-Id: Id1e8057d911a8346bd7dcfb781d39704d247f9c7	17 December 2015, 14:38:22 UTC
3f9615a	Andrew Bogott	17 December 2015, 14:29:23 UTC	Revert "Role::db:: was renamed to role::redisdb." It was renamed! but then it was removed shortly afterwards. This reverts commit a23bf88b795f573897305ed105ad31bf2c3a8402. Change-Id: I776608357ca2dab9c3378aab1e7ad33452df5477	17 December 2015, 14:29:31 UTC
52343f3	Alexandros Kosiaris	17 December 2015, 14:20:41 UTC	cxserver: quote no strings in registry yaml ruby yaml library casts no to false causing puppet failures. force cast to string by single quoting Change-Id: Ib202ea81f5db0054546215e8297c83c00f4f7195	17 December 2015, 14:23:22 UTC
abc4c94	Alexandros Kosiaris	17 December 2015, 14:13:44 UTC	cxserver: Fix the cxserver registry structure Change-Id: I6588cb37e58215c5d7b38557845e6d3610a9854d	17 December 2015, 14:15:27 UTC
a23bf88	andrewbogott	16 December 2015, 12:50:31 UTC	Role::db:: was renamed to role::redisdb. Change-Id: Ibfd446e2d78ca61854312f71a9697b9221728496	17 December 2015, 13:59:42 UTC
d298fd2	Kartik Mistry	17 December 2015, 13:47:22 UTC	CX: Use registry from hieradata Change-Id: I25ddfa697b7e73123312f0df75638da30d2107a5	17 December 2015, 13:55:58 UTC
ae1c471	Alexandros Kosiaris	17 December 2015, 12:52:20 UTC	cxserver: also fix the icinga LVS check Change-Id: Ic7798c8420878b5e95c3fe87860e7d9c2b7e9834	17 December 2015, 12:54:29 UTC
e9e85f3	Moritz Muehlenhoff	16 December 2015, 17:58:33 UTC	Bump connection limit to 8192 After merging 2e851a77a09a9fb042f88f5beb87fb2dd1b04127, the labs instances now again primarily connect to seaborgium: It is listed first in ldap.conf and nscld.conf and both NSS and nslcd primarily use the first server and only fall back to the second if the first isn't reachable. Before that patch, the server-side idle timeout kicked in and a client which was e.g. connected to seaborgium reconnected to serpens. As such labs instances essentially connected to serpens and seaborgium round-robin every ten minutes. Grafana showed approx 1.2k connections each. Now seaborgium handles about 2.5k connections and serpens only a few hundred. This only leaves about 1.5k connections until slapd reaches the fd limit again, so bump the connection limit to 8192 to add more margin for possible connection spikes. Change-Id: Ib21f32721687c8328406dca90fb2b94a59bf6dc4	17 December 2015, 12:46:28 UTC
c45dc47	Alexandros Kosiaris	17 December 2015, 12:43:26 UTC	cxserver: Move pybal configuration to monitor /_info Change-Id: I870bd2c322e77168d27163d80c0484322b884baf	17 December 2015, 12:43:56 UTC
18df1b2	Kartik Mistry	04 November 2015, 06:29:27 UTC	service-runner migration for cxserver Bug: T117657 Depends: Ia5e0950b314f54e10e50230af9a3761be6a8ee0a Change-Id: I72f84ec489686cd38f22dbfc150ec12f1d8dcf87	17 December 2015, 12:13:52 UTC
8a19bf7	Bryan Davis	16 December 2015, 22:57:19 UTC	Add ferm to role::puppet::self Puppet uses 8140 for inbound requests from puppet agents. See https://docs.puppetlabs.com/pe/latest/install_system_requirements.html#firewall-configuration Change-Id: I6074c4a815209cef371b2d64cd0ea12989db8718	17 December 2015, 11:55:45 UTC
eeee3fa	YuviPanda	16 December 2015, 22:37:09 UTC	labs: Remove nfs_mounts params These have been unused for a few months now Change-Id: I421171eb9d062c235a16c6486e2cd82ea57e9bb7	17 December 2015, 01:09:50 UTC
ebcf848	YuviPanda	16 December 2015, 21:45:19 UTC	ores: Stop using aof for redis persistance The defaults already include rdf, so just use that. Right now we ended up using both rdb and aof, which isn't that useful in our case. Bug: T121658 Change-Id: I56f99f487e35e443862fef3e13ec884242c05661	17 December 2015, 01:09:26 UTC
d0ac42b	RobH	14 December 2015, 19:32:00 UTC	lists.wikimedia.org certificate update old certificate expires on 2016-01-30. This certificate cannot simply merge, but must roll into place at the same time as the private key file is updated. Please see the task for details and do not merge without full understanding of other updates needed. Bug:T120237 Change-Id: I068103f76c06371f6ae2e0596155815e1b9a382a	16 December 2015, 23:01:49 UTC
d86c848	dzahn	16 December 2015, 19:15:49 UTC	icinga: disable paging for test hosts This depends on I63b5a42c4d1ddd7, but after that is merged this is to disable all paging for services on test machines with one simple hiera change, without having to add special cases to each monitor::service that might appear somewhere in roles that are used on prod and test machines. Change-Id: Idbc4a54ac9199299fe3865dfcf5332fd70f6ba20	16 December 2015, 22:59:31 UTC
40e760d	RobH	14 December 2015, 19:39:06 UTC	new librenms.wikimedia.org certificate (renewal replacement) new librenms.wikimedia.org certificate, old expires on 2016-01-11. This certificate cannot simply merge, but must roll into place at the same time as the private key file is updated. Please see the task for details and do not merge without full understanding of other updates needed. T120235 Change-Id: I5e81c667f179e510bc4dc2886b163ab8a478760f	16 December 2015, 22:49:21 UTC
7389943	BryanDavis	16 December 2015, 22:12:15 UTC	Revert "tin,mira: move base::firewall to deployment role" This reverts commit 6012ee00dcdb1f73dac286b0bdead435f8fcf1ea. Change-Id: I6df14de00ca8915016b02cce0409950f23f2d8fb	16 December 2015, 22:17:46 UTC
fbaf706	cmjohnson	16 December 2015, 21:56:50 UTC	Removing dhcp entries for ores1001 and ores1002. Swapping for better suited h/w. Change-Id: I5de4fbd8246c5f11769602437bab4a7ae12a4fde	16 December 2015, 21:56:50 UTC
400e987	cpettet	16 December 2015, 21:30:39 UTC	phabricator: garbage collect user logs at 30 days Bug: T114014 Change-Id: I8d9f26e55065efe14cc164d99996f8c041f1fdbd	16 December 2015, 21:32:02 UTC
3e7a8c5	YuviPanda	16 December 2015, 21:11:22 UTC	Use Mwextension instead of the old name Mw-extension Change-Id: I2a238107eba77ea595f5468662f81ddabcad4bda	16 December 2015, 21:18:23 UTC
9838d74	andrewbogott	16 December 2015, 05:19:12 UTC	wikidatabuilder: Use require_package to avoid duplicate package conflicts. Change-Id: Iaabcd4f0d66c1a716be2574f5e32d9671b1202f5	16 December 2015, 05:19:12 UTC
6377970	dzahn	15 December 2015, 19:03:17 UTC	icinga: add logic to avoid paging for test machines Add logic to icinga to let us disable paging for test machines, without having to make edits in puppet manifests for each monitored service. Machines that are for testing could be excluded from paging by just setting "do_paging: false" in hiera for the machine or role. This is in response to the comment by Andrew on Ic4bec99c5ea1d3e0 to try and find a global solution for this issue and Alex' comments on the former PS. Change-Id: I63b5a42c4d1ddd723e994330645bf1894036db9c	16 December 2015, 20:38:43 UTC
f929d95	dzahn	16 December 2015, 19:45:00 UTC	phabricator: lower max execution time to 10s Change-Id: Ic8c1acd399f301621190f71fa297231916f226d7	16 December 2015, 19:49:37 UTC
bceddff	RobH	16 December 2015, 18:16:52 UTC	setting auth1001 install params setting auth1001 isntall params and normalizing file formatting T121655 Change-Id: Id99c17a6cebafb074051d93f328d7454ce8dfe56	16 December 2015, 18:16:52 UTC
ee23321	Alexandros Kosiaris	11 December 2015, 18:32:20 UTC	diamond: Add openldap collector Add an openldap collector allowing to query cn=Monitor stats from diamond and store them. This assumes we get a new user in both ldap and OIT ldap servers named diamond that has read only access to cn=monitor. The access part is done in this patch but the user needs to be created before it is merged Change-Id: Ia9fe25e5e6e6516e63bb452454fd883d7b72f5d9	16 December 2015, 17:32:15 UTC
f9d0dcd	Jaime Crespo	16 December 2015, 16:19:40 UTC	Reconfigure mysql at db1041 and all s7 codfw slaves Change-Id: I904f9ee810225ad437b55799554fefc1e61c0e67 References: T120122	16 December 2015, 16:19:40 UTC
fea45e1	Brandon Black	16 December 2015, 15:46:04 UTC	Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" This reverts commit 97046b8435d496c177237d3dd6ff1cc3daecbec2. Bug: T121564 Bug: T96847 Change-Id: I1d8e4b11267e924edeb4e99f187d3125437b56ae	16 December 2015, 15:47:46 UTC
fe5ea83	Andrew Otto	16 December 2015, 15:44:19 UTC	Fix group prefix for jmx metrics Change-Id: I87381ca8dbbf23aa537410bf08931d89b7bbadac	16 December 2015, 15:44:19 UTC
39eba27	Andrew Otto	16 December 2015, 15:39:41 UTC	Don't puppetize codfw main kafka cluster yet, we need a zookeeper cluster Change-Id: I0fad442d4260ce5f4bde2ea79cd14da71ac36cb8	16 December 2015, 15:39:41 UTC
bf6b3ef	Andrew Otto	16 December 2015, 15:29:11 UTC	Set group_prefix with Kafka cluster_name for jmx metrics, make some variables local Change-Id: Ide129a6867d450019d4634f7c80dc870fedddf73	16 December 2015, 15:36:42 UTC
4cf9f1d	Andrew Otto	16 December 2015, 15:13:37 UTC	Set up proper kafka log dirs for main clusters Change-Id: I65eff76a6944aaae216dd983eb71d3c7ca290758	16 December 2015, 15:19:24 UTC
4c9dd8d	Moritz Muehlenhoff	03 December 2015, 09:36:18 UTC	Uninstall ecryptfs-utils This gets pulled in by package dependencies on some Ubuntu installs (several db* servers and hooft), but is not actualla used (I checked with salt that the ecryptfs kernel module isn't loaded on any of those machines). ecryptfs-utils was suspectible to local privilege escalation in the past and since we don't use it, let's uninstall it to minimise risks. Change-Id: I59a5667fa244913796dc990dc0d9138639f7a0ef	16 December 2015, 15:15:24 UTC
0b99b86	Andrew Otto	16 December 2015, 15:11:39 UTC	Fixing one more broker name in codfw Change-Id: Ib47aaf49bbfb59ac512167491345e69adbf8d75f	16 December 2015, 15:11:39 UTC
7c7b5bc	Andrew Otto	16 December 2015, 15:10:48 UTC	Fixing broker names for main-codfw cluster config Change-Id: I7e04b79f3ba98bff28b6d905253dec687b0bd04b	16 December 2015, 15:11:03 UTC
c113b6f	Andrew Otto	16 December 2015, 15:07:38 UTC	Fix undefined variable access Change-Id: I14c6ec89ff1f4b57c857c58fc758855063b0857f	16 December 2015, 15:08:54 UTC
64aada6	Andrew Otto	08 December 2015, 20:53:07 UTC	Using more generic roles for kafka classes, configuring new main brokers kafka[12]00[12] This will eventually also deprecate role::analytics::kafka::* in favor of role::kafka::analytics::* role::kafka::analytics::* is not yet included anywhere, but will be after this puppetization is applied and verified to work on the new brokers. Bug: T120957 Bug: T121553 Bug: T121558 Change-Id: Ifec423daa5d9b2a3d3e6e4b0bd12dda5639b8594	16 December 2015, 14:59:26 UTC
2e851a7	Moritz Muehlenhoff	15 December 2015, 14:56:11 UTC	Set idle_timelimit for nslcd Commit b771a328fe26531f9329250ac799f88a402960be enabled a slapd idle_timeout of ten minutes. nslcd keeps the LDAP connection open for reuse, but since the server side now terminates the connection, nslcd needs to reconnect and spews log messages such as nslcd[24996]: [b9ad17] <shadow="jmm"> ldap_search_ext() failed: Can't contact LDAP server: Connection timed out nslcd[24996]: [b9ad17] <shadow="jmm"> connected to LDAP server ldap://ldap-labs.codfw.wikimedia.org:389 The idle_timelimit option initiates a connection termination after nine minutes of inactivity. This still allows nslcd to reuse the connection in brief periods of inactivity, and still closes them before they are shutdown from the server-side. http://lists.arthurdejong.org/nss-pam-ldapd-users/2014/msg00111.html has some references from the upstream author of nslcd. Change-Id: I27525a22385b3c995e037244f354c00452cd8bc9	16 December 2015, 14:46:12 UTC
80e8afa	Marko Obrovac	16 December 2015, 14:23:21 UTC	service::node: Configure automatic service restarts with init_restart This commit introduces a new parameter to service::node - init_restart - which controls whether the init system (Upstart, SystemD) should automatically respawn the service in case its process dies. By default, they're allowed to do so. Also, set init_restart = false for RESTBase. Change-Id: I9dfc37a5d7eb9a18063389892d26c8e4aebd276c	16 December 2015, 14:40:02 UTC
d3e1f70	Marko Obrovac	16 December 2015, 08:35:04 UTC	RESTBase: disable firejail RESTBase hasn't been tested with firejail yet, so disable it temporarily until we determine it's safe to run it so. This commit also reintroduces the firejail conditional parameter to service::node because of RESTBase. Its default value is true, and should not be changed for new services. Note: to merge in tandem with Ica246d4e4a2c551ddf47334762a821536eccb307 Bug: T118401 Change-Id: I2e81a9078b6cf4f6e417a56483c65a6cf2df1007	16 December 2015, 13:44:20 UTC
02c2006	Marko Obrovac	09 December 2015, 15:35:24 UTC	RESTBase: Switch to service::node There are many shared features between the restbase (w/ restbase::monitoring) class and the service::node define. So, instead of duplicating work to keep all service-runner-based services updated, make RESTBase use the service::node define as well. Bug: T118401 Change-Id: Ica246d4e4a2c551ddf47334762a821536eccb307	16 December 2015, 13:38:12 UTC
f769d87	Jaime Crespo	16 December 2015, 12:01:53 UTC	Reconfiguring mysqls db1022 and s6 codfw servers Change-Id: Id0bb2a9495a96f39d3a17ff9c7572dc68c2f8f72 References: T120122	16 December 2015, 12:01:53 UTC
0ff7a4f	YuviPanda	16 December 2015, 11:03:10 UTC	labstore: Run sync-exports in start-nfs too It too seems to require a nfs-kernel-server restart, so roll that into one. This has bitten us during two outages Change-Id: Ia5c09cf27289c6a9e3b47e87a11269c321c4e6c3	16 December 2015, 11:09:42 UTC
d461404	YuviPanda	16 December 2015, 10:59:05 UTC	labstore: Activate volumes before mounting them Change-Id: Iacf2739576ad0cd54fae124f5d4cb3591440fc2f	16 December 2015, 11:09:13 UTC
184fac1	YuviPanda	16 December 2015, 10:38:02 UTC	labstore: Enable snapshots immediately after creation Change-Id: I78deb15174859560757310b51fa4ad9d290302b6	16 December 2015, 10:47:24 UTC
b9ca80e	Ariel T. Glenn	16 December 2015, 10:44:57 UTC	remove ariel's non-yubi ssh key Change-Id: Icad269aaae089c05ca200fe95a974eaa997b0dde	16 December 2015, 10:44:57 UTC
fc083e4	Ariel T. Glenn	16 December 2015, 10:38:51 UTC	allow salt master to handle more than 1024 conns Change-Id: Idbaa8460c5f99e219acc9cdf40c2c4793b7e6857	16 December 2015, 10:39:29 UTC
ad2874e	Faidon Liambotis	16 December 2015, 10:26:03 UTC	labs: add an access.conf stanza to always allow cron Since 623d8760cccca15f7798c1e9c22956507cba1676, we (silently) moved pam_access.so to an account type under common-account, while previously it was under an auth type in common-auth plus an account type under the sshd PAM configuration. This has had various repercussions, depending on whether 93eb9c8fb09968027eb735af8ea19516b307b5f4 was applied (reverted and re-reverted). With the current state of affairs, the primary effect is that common-account is included by cron, and cron checks against the account type. Thus, cron now suddenly checks users against access.conf. In turn this means that the default labs-restrict-from/to rules suddenly are applied for cron jobs, which under the default configuration means that only users with a membership to the Labs project have a working cron. This can be not the case in various circumstances (e.g. system users) but it was particularly visible in the Tools project, where each tool runs as a separate "system" user that is NOT in the project-tools group. The new PAM configuration is not wrong per se, so instead of reverting it, add an access.conf rule that allows every user on the system to run cron, and evaluate that early. This will fix cron, but of course there is still a possibility we'll see effects of our access.conf configuration elsewhere (e.g. if we were using atd). pam_access errors in auth.log would be a good indications of such a breakage. Change-Id: Ic0e80114995b0046a6901a885557d2c7e1d4df22	16 December 2015, 10:33:03 UTC
71a9141	YuviPanda	16 December 2015, 08:29:50 UTC	labstore: Better error-checking(?) for start-nfs For a 'critical' script ignoring failures seems bad Change-Id: Id84b184c339bc578f9a591d2cedacb798b0d796e	16 December 2015, 10:17:41 UTC
05fa7b9	YuviPanda	16 December 2015, 08:00:32 UTC	labstore: Skip activating snapshots by default Change-Id: I031cfa3a9bdd7522612088f68a74fe17eae64259	16 December 2015, 10:17:38 UTC
b487734	cmjohnson	16 December 2015, 07:26:54 UTC	changing kafka1001 and 1002 to install jessie Change-Id: I2ea25dd6eb8b030275869f4f32735a42f53719ad	16 December 2015, 07:26:54 UTC
86619ba	Andrew Bogott	16 December 2015, 03:51:45 UTC	Revert "Revert "Reorder modules in common-account"" this may have made some small things better but it made some other big things much much worse. Putting this back in until we have a subtler patch for the smaller issues. This reverts commit b1121caee13d97fb8d737cc0fbcb78b6e19d9a73. Change-Id: Iea15980492d33c743d9e9453464520daa81a6b57	16 December 2015, 04:02:58 UTC
d9a74bc	andrewbogott	15 December 2015, 18:03:46 UTC	clean-pam-config: move backupfiles to a different dir Bug: T121533 Change-Id: I2ba6994850442a483ba903eadc342c5ebd018c42	16 December 2015, 01:18:44 UTC
bf61260	YuviPanda	16 December 2015, 00:11:39 UTC	tools: No more puppet client classs Miss this in the removal since ES was a pending patch Change-Id: I21fc1fe1a6859d13ee27d8cf74711b94754f0230	16 December 2015, 00:12:12 UTC

Newer
Older