https://github.com/wikimedia/operations-puppet

sort by:
Revision Author Date Message Commit Date
7e8a881 Revert "Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends"" This reverts commit fea45e166fdd6d1c5fca6a1e206e6d45077c0602. Bug: T121564 Bug: T96847 Change-Id: I57b36b5c2100900e4964daf0ac279fa46b734f5f 17 December 2015, 21:34:12 UTC
fa7cafb Text VCL: raise hfp TTL to 601s Change-Id: I7214813555a4340582752c0dddb515d360908970 17 December 2015, 21:34:04 UTC
37cd235 Use new kafka role for eventlogging service eventbus configuration Change-Id: I172a07b17267ea2cb97549b691b475adc6836c2e 17 December 2015, 21:23:31 UTC
2e16178 Make eventlogging files consumer role manage output directory The eventlogging module class only needs to manage daemon output, not any potential consumer output. Change-Id: I37da626a3f3c9bc79668f1d3a888bcabc2424e14 17 December 2015, 21:05:50 UTC
a90eb55 Don't include eventlogging::deployment::source in production yet Change-Id: I2a9a84b44b23f97dcb36dc8d10076e0ac9235f28 17 December 2015, 20:57:21 UTC
8f787f6 mediawiki: fix puppet-lint warnings Change-Id: I649efbb8f4b21d29f61b9068899fda0ad2994c21 17 December 2015, 20:49:31 UTC
f5685c6 Puppetize eventlogging-service with systemd in role::eventbus Deployment via scap3. This patch works in beta. There is still work to be done, but the patch is getting too large. Further work will proceeed in new changes. No changes on eventlog1001 according to https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1506/ TODO: Move role::eventbus::eventbus back to role::eventbus when T119042 works. TODO: Use role::kafka::* to get kafka config. Bug: T118780 Change-Id: I621de844ed7a5bd1ac532b52058925350d9e5337 17 December 2015, 20:10:18 UTC
c31ba3b Move role::scap::target to scap::ferm, add scap::target define I don't see any reason for this to be a role class, especially if it is specifically needed for scap deployment servers to be able to deploy. This change adds a scap::target define, which simplifies the process of adding new scap::targets. Change-Id: Ia78d44b9b56ea165e9b584f8b30c0395da490f51 17 December 2015, 19:58:24 UTC
6202b92 Text VCL: exclude lower-layer cache hits from hfp object creation (and also, move the hfp block to common code) Note the " hit(" regex match is the new X-Cache data format recently introduced. It will take up to 30 days for all old cache objects recorded as " hit (" to expire out and make this change fully effective. We can't trust the older "hit (" style because it's not a reliable indicator of a true cache hit on a real object Change-Id: I7241260f63d9fc22c3268332c67b82b7df3be424 17 December 2015, 18:58:25 UTC
a1b7921 Revert "VCL: grace-mode only in frontend caches" This reverts commit 0f4de6da8c85a056007b54eae4082d9bd3d71848. Change-Id: Ie4110a7354c299161acf55ab09fb5ca8f08a8de5 17 December 2015, 18:55:23 UTC
d1be20f Revert "post-merge syntax bugfix for be768ad7c6" This reverts commit c63778ad12af6fdc75a6d53dcc88bb1f1ca697e0. Change-Id: Ib5644167c108d7c6601a99093f07a9996d51b3c4 17 December 2015, 18:55:08 UTC
5446c6e Remove extra space on data.yaml Change-Id: I898bea40ffe481dfbf4bcc9cf528c5102cba793f 17 December 2015, 18:34:58 UTC
73b1452 Elastic: move merge_threads to hiera Change-Id: Id045c9e934a50cc8bd2eaf2ebcc58344cf8709e4 17 December 2015, 18:33:07 UTC
3d1c28e New generated key for jcrespo (jynus) -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 New generated key protected by a hardware token (generated on 2015-11-18): ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCWNAMYh92QyZNjHcTyoapyWSKkQBSFJVgKWNW+5of3fiJ0frczz9R+MW2RiRPjdh2VoOzEdMboRogr7O5I1D2x07cVYpTNYEx4cPmzg7xLKUqPY0zxJGZz7g2zlXr1RtiM21MTNiG+tF1ndnB3KYa1LB9fA8pSgQkGz+UjFWGg2/LD6tLzNA8yB+MjV0X+nEtC+i58L5nchMN/m3RsyfCGOnJxPAsOCbQpolITCKSVceRPI/FvBAbaaUidL7MvfkgFTUjf+NX2b25ZdIVYD4BVGHrkw3fFQpPYdidEyLMN/wnu5leZskoOnuMzn2AgHQEBdsrKeV/umdFq3SjGJkkR -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlZMk0EACgkQT7LnGA8RGa0UGgCgtxtEeGl7HoJHBXxXRM3fSXUu Cr4Anjtf64lYsQw1hy5FEoHX0xQWlLGz =oEmt -----END PGP SIGNATURE----- Change-Id: Icc484d0f713d72a45fe9ec1e84232b3c2ca212dd 17 December 2015, 18:29:22 UTC
e3602de toollabs: pep8 fixes for pretty code :) Change-Id: I7701ed0848b3a5132ff4ad2de899a1769749746c 17 December 2015, 18:27:05 UTC
5578783 Gerrit: move static assets to *.cache.* filenames This lets Gerrit serve them with 1y expires, perfect for things like the logo and background image that never change. Change-Id: Ie5aca12dafd20ca79fb024e35c71a87818c076e2 17 December 2015, 18:16:34 UTC
a0d4e43 Followup missing bit from 70f4366dc Need to add +chfp in both layers... Change-Id: I376dd312505ce33727dc35f04b146752feac9c8f 17 December 2015, 18:14:33 UTC
00533a9 reclaiming calcium to spares reclaiming a system to spares T116790 Change-Id: Iecef4e75b49867bf7427bfb43156321c518e1a14 17 December 2015, 18:11:07 UTC
087a765 VCL: Make X-Cache more accurate/informative, hopefully Comparing obj.hits to zero isn't really a reliable indicator of much of anything, so our X-Cache hit|miss information was pretty misleading before. This new code is based on the state diagram at: https://www.varnish-software.com/book/3/VCL_Basics.html#detailed-request-flow All requests pass through one of vcl_(hit|miss|pass) (except for vcl_pipe requests, but then those don't go through deliver to set up X-Cache either). Sometimes they pass through _hit|_miss and then through _pass, but usually just one of the three. With the new info, we can interpret as follows: hit - Hit on a real cached object (not a hit_for_pass object), definitely no backend fetch happens. miss - Missed the cache lookup (no hit_for_pass here either), backend fetch definitely required. pass - Either vcl_recv returned explicit pass, or there was a cache hit on a hit_for_pass object which caused a pass, backend fetch definitely required. hit+pass - "hit" happened as above, but then vcl_hit code did an explicit "return (pass)" miss+pass - "miss" happened as above, but then vcl_miss code did an explicit "return (pass)" Additionally, on the text/mobile clusters, if vcl_fetch created a new hit_for_pass object (due to e.g. beresp.ttl <= 0s), the above will be suffixed with "+chfp" for the request that created it. Whether that object actually gets stored for other hits to reference (as opposed to being anonymous and pointless) is an open question... Also, note there's a subtle change in X-Cache output so that we can tell the difference between new info and old cached info. The space before the obj.hits count is removed: Old: X-Cache: be_cp1053 hit (53) New: X-Cache: be_cp1053 hit(53) Change-Id: Ia37b469be518bb48c948cadbd1bb80dce14ea891 17 December 2015, 18:09:14 UTC
2d7a416 Mathoid: Increase the number of workers temporarily There is a bug in Mathoid that when certain TeX input is given, the worker process dies. Until we do a hotfix, increase the number of workers temporarily to 50 to minimise the chances of all workers dying at the same time. Bug: T121762 Change-Id: Ice3619c27855952d0f5723bea322ea9b04ab36ea 17 December 2015, 18:07:12 UTC
2daebcd gmond_memcached.py: fix all kinds of pep8 warnings Mostly leading whitespace Change-Id: I2ef92498ea5b3d0a9c5ba0f280dae36f094f0e0e 17 December 2015, 18:03:03 UTC
5819824 More fixmes for scap/manifests/scripts.pp Change-Id: I73956e777df544e053b235bd784996d09ab005b1 17 December 2015, 17:53:45 UTC
eedf937 [elasticsearch] Collect cluster health stats about shard movement The addition of relocating/initializing/unassigned shards statistics should give us better insight into when the cluster drops a node, and how it recovers from dropping that node. I would have thought this was uncommon, but we have dropped a node twice in the last 3 days and need better monitoring about what happens. Bug: T117284 Change-Id: I69788c5455115b5aa54167facfcb2dd83954e0bc 17 December 2015, 17:50:32 UTC
df47dcb [elastic] Record count of searchs rejected due to thread pool exhaustion If the cluster is too busy it will start to reject requests, and record them here. Most of the servers report between a few thousand and 100k rejected searches (since they were restarted in september). We should record these so to help keep an eye on the cluster health. Change-Id: Idefb9c622ea1d4919f8dfd2f7350eed048e7dac2 17 December 2015, 17:46:47 UTC
1707ebf Cron job to rebuild completion indices This will run once a week at 20 after midnight UTC. According to our graphs it looks like midnight to 7am is the least busy time for the cluster. This job takes about 20 hours to run serially, with 4 way parallelism it brings us down to about 12 hours and helps ensure we are not inducing extra latency during the busiest parts of the day. This supports the new completion suggester beta feature which is going into production on wikipedias thursday, dec 17. The initial indices have already been built, this is necessary to keep them updated in the future. Bug: T112028 Change-Id: I66c2723a366e988574b46ded4e1bdd9c3188a58e 17 December 2015, 17:42:37 UTC
c127ad9 extdist: Split skindist log into a separate file The cron job runs can overlap, leading to combined log files, which are annoying to read through. Change-Id: I4bd5b5e81ad35b4d234ddd280dc861de00fdfd88 17 December 2015, 17:34:52 UTC
fec20c0 beta: Fix logstash::cluster_hosts The beta cluster ELK servers don't need to allow Elasticsearch cluster access to the production hosts. Since this is a single node cluster it technically doesn't need to allow access to an hosts at all, but the Puppet setup needs some value for the hiera variable lookup or catalog compilation will fail. Change-Id: I34c402e18353b1726fa0ed678688ede033a19eff 17 December 2015, 17:15:32 UTC
0945fcf stashbot: Add missing logstash::cluster_hosts hiera data Needed by role::logstash::elasticsearch to open firewall access between Elasticsearch nodes. Change-Id: Iedb30012cae785745a0fb7fccd0987a99b5b31de 17 December 2015, 17:07:08 UTC
176d306 Revert "VCL: differentiate hit-vs-hfp in X-Cache" This reverts commit a53c26aebf44bd5a9324f4dd72aed0a747190175. Change-Id: Ic394900b00a4f31f9efe231bb985309352682b60 17 December 2015, 17:00:12 UTC
cb49334 cxserver: no_proxy_entry is not an instance variable So remove the @ signed. Also remove the newline trimming at the end of each element in the list Change-Id: Ia31ef18ea96c4484acd0c38c4c31f2f8363eb334 17 December 2015, 16:56:54 UTC
4dc3f16 cxserver: Populate no_proxy_list correctly Have no_proxy_list as an array of elements that contains the domains a proxy should not be used for. Then interpolate it in yaml and populate the stanza. Provide data via hiera Change-Id: Ia4d21fdade1ea6d4c3a87e4598c61cbb43440c9e 17 December 2015, 16:52:35 UTC
378c1ac Revert "CXServer: Do not use the proxy for RESTBase and Apertium" This reverts commit 8a1289d4bd1335f4f89042addb7a4c92f5693548. Change-Id: I70efa4fa4916d26e81a135e8897afd38e867dca9 17 December 2015, 16:52:35 UTC
6e0d820 Revert "CXServer: s/no_proxy/no_proxy_list/ in config" This reverts commit b1c8761eda7beee02657e3627d626a7f3a47fb58. Change-Id: If0d5600cdc4fb2d8cf3e045d2eacdc899e755e2c 17 December 2015, 16:52:35 UTC
a53c26a VCL: differentiate hit-vs-hfp in X-Cache Change-Id: Id5a1e35a05faba249765b85ce0e2e3495bfd1cc5 17 December 2015, 16:39:13 UTC
6230b5a text VCL: exempt all variants of Special:Banner.* There are Special:Banner requests of the form /w/index.php?title=Special:Banner too, so they should get this treatment as well... Change-Id: I08706b44d71cf20e7704dbe6531b6b4c975cafe1 17 December 2015, 16:14:36 UTC
5afb017 Stop opendj on the former labs LDAP servers Also removes the monitoring of the LDAP ports. To be merged once most instances still accessing them are fixed. Change-Id: Ibdd7d898f8debe54fab2d74cb14e352da4a25d00 17 December 2015, 15:56:07 UTC
ea0733b Grafana increase homepage dash list limits The Featured Wikidata dashboard recently fell off the bottom of the list which is by default limited to 10! (as we now have more than 10 deatured dashes). 14 means we can again show all featured dashboards and does not actually increase the size of the list, simply uses the empty space at the bottom... Change-Id: Ibad9153dd52836042157e22a1bbab121c6b828f1 17 December 2015, 15:44:32 UTC
a0bd3f2 Fix typo on I8f72fda4983 s/codfw/eqiad/. No server was harmed on the process. Change-Id: I12ad84b9ad81ffbd9a6de30a78e114fadd603c5d 17 December 2015, 15:36:11 UTC
eed3f9f phabricator: log format to account for x-client-ip Bug: T114014 Change-Id: I852ba89c486e4a70676b5a2cb400931dd45eb86a 17 December 2015, 15:33:01 UTC
9557fe9 Upgrading and reconfiguring mysql on db1031 and x1 codfw x1-slave has been depooled. Change-Id: I8f72fda49831d4dfa78c3f9361fd5a39d619e703 References: T120122 17 December 2015, 15:24:28 UTC
3586aa7 phabricator: start using x-client-ip As of 400e9873dfc8fc3728227cc30643833525eae914 we are now limiting logs held about user activity to 30 days. Upstream also agreed to stop storing IP information in the long lived transaction tables. Bug: T114014 Change-Id: I44fd3b63178bff07300ad2d2e7d86ffd6ad686c5 17 December 2015, 15:17:43 UTC
c63778a post-merge syntax bugfix for be768ad7c6 Change-Id: I33a01d1f45dd32b3381246e1780c572fd5e410a8 17 December 2015, 15:17:11 UTC
0f4de6d VCL: grace-mode only in frontend caches We've observed an issue where, due to the fact that we pass requests through 2-3 distinct layers of varnish cache, if the lower layers (backends) use grace-mode and serve slightly-expired content, this triggers the next cache up the stack receiving such a response to create a 120s hit-for-pass object due to beresp.ttl <= 0s. This is one possible solution: simply don't allow expired "grace" responses in backend caches, only in the frontends directly facing the client. Change-Id: I32d4f25756b1b36588de461e831e4f68311b1d52 17 December 2015, 15:07:52 UTC
f841dba labs: widen access.conf exception to everything LOCAL Commit ad2874ee9cc74585a1f685e836852e5c3ab26676 added an access.conf exception for cron, that was broken since the latest round of PAM reorganization. The commit message for the above mentioned that "[t]his will fix cron, but of course there is still a possibility we'll see effects of our access.conf configuration elsewhere (e.g. if we were using atd)". Unfortunately that turned to be the case, with at least "su" known to be broken. Instead of playing whack-a-mole with various different commands, use pam_access' LOCAL directive to allow everything locally-originated on the system. This should fix all of the effects we're seeing, while at the same time giving us the necessary protection we require for ssh. Bug: T121765 Change-Id: Id741e12d5cafed0a91710bd22ecff3b89c59e994 17 December 2015, 15:07:18 UTC
b1c8761 CXServer: s/no_proxy/no_proxy_list/ in config Change-Id: Icc22acf760226ee627ba6e07067b0f7676e9c68b 17 December 2015, 15:01:56 UTC
94a0b50 text VCL: do not create hfp objects on 5xx Change-Id: Ide2b0bfd30a910788e813b0d6dc88ee2654fb339 17 December 2015, 15:01:20 UTC
8a1289d CXServer: Do not use the proxy for RESTBase and Apertium Change-Id: Ice049a1e0b49fa3d4d7cc58c3b7c0437d71fa241 17 December 2015, 14:57:20 UTC
ed45cd5 CX: Fix dictionary YAML config Change-Id: Id1e8057d911a8346bd7dcfb781d39704d247f9c7 17 December 2015, 14:38:22 UTC
3f9615a Revert "Role::db:: was renamed to role::redisdb." It was renamed! but then it was removed shortly afterwards. This reverts commit a23bf88b795f573897305ed105ad31bf2c3a8402. Change-Id: I776608357ca2dab9c3378aab1e7ad33452df5477 17 December 2015, 14:29:31 UTC
52343f3 cxserver: quote no strings in registry yaml ruby yaml library casts no to false causing puppet failures. force cast to string by single quoting Change-Id: Ib202ea81f5db0054546215e8297c83c00f4f7195 17 December 2015, 14:23:22 UTC
abc4c94 cxserver: Fix the cxserver registry structure Change-Id: I6588cb37e58215c5d7b38557845e6d3610a9854d 17 December 2015, 14:15:27 UTC
a23bf88 Role::db:: was renamed to role::redisdb. Change-Id: Ibfd446e2d78ca61854312f71a9697b9221728496 17 December 2015, 13:59:42 UTC
d298fd2 CX: Use registry from hieradata Change-Id: I25ddfa697b7e73123312f0df75638da30d2107a5 17 December 2015, 13:55:58 UTC
ae1c471 cxserver: also fix the icinga LVS check Change-Id: Ic7798c8420878b5e95c3fe87860e7d9c2b7e9834 17 December 2015, 12:54:29 UTC
e9e85f3 Bump connection limit to 8192 After merging 2e851a77a09a9fb042f88f5beb87fb2dd1b04127, the labs instances now again primarily connect to seaborgium: It is listed first in ldap.conf and nscld.conf and both NSS and nslcd primarily use the first server and only fall back to the second if the first isn't reachable. Before that patch, the server-side idle timeout kicked in and a client which was e.g. connected to seaborgium reconnected to serpens. As such labs instances essentially connected to serpens and seaborgium round-robin every ten minutes. Grafana showed approx 1.2k connections each. Now seaborgium handles about 2.5k connections and serpens only a few hundred. This only leaves about 1.5k connections until slapd reaches the fd limit again, so bump the connection limit to 8192 to add more margin for possible connection spikes. Change-Id: Ib21f32721687c8328406dca90fb2b94a59bf6dc4 17 December 2015, 12:46:28 UTC
c45dc47 cxserver: Move pybal configuration to monitor /_info Change-Id: I870bd2c322e77168d27163d80c0484322b884baf 17 December 2015, 12:43:56 UTC
18df1b2 service-runner migration for cxserver Bug: T117657 Depends: Ia5e0950b314f54e10e50230af9a3761be6a8ee0a Change-Id: I72f84ec489686cd38f22dbfc150ec12f1d8dcf87 17 December 2015, 12:13:52 UTC
8a19bf7 Add ferm to role::puppet::self Puppet uses 8140 for inbound requests from puppet agents. See https://docs.puppetlabs.com/pe/latest/install_system_requirements.html#firewall-configuration Change-Id: I6074c4a815209cef371b2d64cd0ea12989db8718 17 December 2015, 11:55:45 UTC
eeee3fa labs: Remove nfs_mounts params These have been unused for a few months now Change-Id: I421171eb9d062c235a16c6486e2cd82ea57e9bb7 17 December 2015, 01:09:50 UTC
ebcf848 ores: Stop using aof for redis persistance The defaults already include rdf, so just use that. Right now we ended up using both rdb and aof, which isn't that useful in our case. Bug: T121658 Change-Id: I56f99f487e35e443862fef3e13ec884242c05661 17 December 2015, 01:09:26 UTC
d0ac42b lists.wikimedia.org certificate update old certificate expires on 2016-01-30. This certificate cannot simply merge, but must roll into place at the same time as the private key file is updated. Please see the task for details and do not merge without full understanding of other updates needed. Bug:T120237 Change-Id: I068103f76c06371f6ae2e0596155815e1b9a382a 16 December 2015, 23:01:49 UTC
d86c848 icinga: disable paging for test hosts This depends on I63b5a42c4d1ddd7, but after that is merged this is to disable all paging for services on test machines with one simple hiera change, without having to add special cases to each monitor::service that might appear somewhere in roles that are used on prod and test machines. Change-Id: Idbc4a54ac9199299fe3865dfcf5332fd70f6ba20 16 December 2015, 22:59:31 UTC
40e760d new librenms.wikimedia.org certificate (renewal replacement) new librenms.wikimedia.org certificate, old expires on 2016-01-11. This certificate cannot simply merge, but must roll into place at the same time as the private key file is updated. Please see the task for details and do not merge without full understanding of other updates needed. T120235 Change-Id: I5e81c667f179e510bc4dc2886b163ab8a478760f 16 December 2015, 22:49:21 UTC
7389943 Revert "tin,mira: move base::firewall to deployment role" This reverts commit 6012ee00dcdb1f73dac286b0bdead435f8fcf1ea. Change-Id: I6df14de00ca8915016b02cce0409950f23f2d8fb 16 December 2015, 22:17:46 UTC
fbaf706 Removing dhcp entries for ores1001 and ores1002. Swapping for better suited h/w. Change-Id: I5de4fbd8246c5f11769602437bab4a7ae12a4fde 16 December 2015, 21:56:50 UTC
400e987 phabricator: garbage collect user logs at 30 days Bug: T114014 Change-Id: I8d9f26e55065efe14cc164d99996f8c041f1fdbd 16 December 2015, 21:32:02 UTC
3e7a8c5 Use Mwextension instead of the old name Mw-extension Change-Id: I2a238107eba77ea595f5468662f81ddabcad4bda 16 December 2015, 21:18:23 UTC
9838d74 wikidatabuilder: Use require_package to avoid duplicate package conflicts. Change-Id: Iaabcd4f0d66c1a716be2574f5e32d9671b1202f5 16 December 2015, 05:19:12 UTC
6377970 icinga: add logic to avoid paging for test machines Add logic to icinga to let us disable paging for test machines, without having to make edits in puppet manifests for each monitored service. Machines that are for testing could be excluded from paging by just setting "do_paging: false" in hiera for the machine or role. This is in response to the comment by Andrew on Ic4bec99c5ea1d3e0 to try and find a global solution for this issue and Alex' comments on the former PS. Change-Id: I63b5a42c4d1ddd723e994330645bf1894036db9c 16 December 2015, 20:38:43 UTC
f929d95 phabricator: lower max execution time to 10s Change-Id: Ic8c1acd399f301621190f71fa297231916f226d7 16 December 2015, 19:49:37 UTC
bceddff setting auth1001 install params setting auth1001 isntall params and normalizing file formatting T121655 Change-Id: Id99c17a6cebafb074051d93f328d7454ce8dfe56 16 December 2015, 18:16:52 UTC
ee23321 diamond: Add openldap collector Add an openldap collector allowing to query cn=Monitor stats from diamond and store them. This assumes we get a new user in both ldap and OIT ldap servers named diamond that has read only access to cn=monitor. The access part is done in this patch but the user needs to be created before it is merged Change-Id: Ia9fe25e5e6e6516e63bb452454fd883d7b72f5d9 16 December 2015, 17:32:15 UTC
f9d0dcd Reconfigure mysql at db1041 and all s7 codfw slaves Change-Id: I904f9ee810225ad437b55799554fefc1e61c0e67 References: T120122 16 December 2015, 16:19:40 UTC
fea45e1 Revert "cache_text/mobile: send randomized pass traffic directly to t1 backends" This reverts commit 97046b8435d496c177237d3dd6ff1cc3daecbec2. Bug: T121564 Bug: T96847 Change-Id: I1d8e4b11267e924edeb4e99f187d3125437b56ae 16 December 2015, 15:47:46 UTC
fe5ea83 Fix group prefix for jmx metrics Change-Id: I87381ca8dbbf23aa537410bf08931d89b7bbadac 16 December 2015, 15:44:19 UTC
39eba27 Don't puppetize codfw main kafka cluster yet, we need a zookeeper cluster Change-Id: I0fad442d4260ce5f4bde2ea79cd14da71ac36cb8 16 December 2015, 15:39:41 UTC
bf6b3ef Set group_prefix with Kafka cluster_name for jmx metrics, make some variables local Change-Id: Ide129a6867d450019d4634f7c80dc870fedddf73 16 December 2015, 15:36:42 UTC
4cf9f1d Set up proper kafka log dirs for main clusters Change-Id: I65eff76a6944aaae216dd983eb71d3c7ca290758 16 December 2015, 15:19:24 UTC
4c9dd8d Uninstall ecryptfs-utils This gets pulled in by package dependencies on some Ubuntu installs (several db* servers and hooft), but is not actualla used (I checked with salt that the ecryptfs kernel module isn't loaded on any of those machines). ecryptfs-utils was suspectible to local privilege escalation in the past and since we don't use it, let's uninstall it to minimise risks. Change-Id: I59a5667fa244913796dc990dc0d9138639f7a0ef 16 December 2015, 15:15:24 UTC
0b99b86 Fixing one more broker name in codfw Change-Id: Ib47aaf49bbfb59ac512167491345e69adbf8d75f 16 December 2015, 15:11:39 UTC
7c7b5bc Fixing broker names for main-codfw cluster config Change-Id: I7e04b79f3ba98bff28b6d905253dec687b0bd04b 16 December 2015, 15:11:03 UTC
c113b6f Fix undefined variable access Change-Id: I14c6ec89ff1f4b57c857c58fc758855063b0857f 16 December 2015, 15:08:54 UTC
64aada6 Using more generic roles for kafka classes, configuring new main brokers kafka[12]00[12] This will eventually also deprecate role::analytics::kafka::* in favor of role::kafka::analytics::* role::kafka::analytics::* is not yet included anywhere, but will be after this puppetization is applied and verified to work on the new brokers. Bug: T120957 Bug: T121553 Bug: T121558 Change-Id: Ifec423daa5d9b2a3d3e6e4b0bd12dda5639b8594 16 December 2015, 14:59:26 UTC
2e851a7 Set idle_timelimit for nslcd Commit b771a328fe26531f9329250ac799f88a402960be enabled a slapd idle_timeout of ten minutes. nslcd keeps the LDAP connection open for reuse, but since the server side now terminates the connection, nslcd needs to reconnect and spews log messages such as nslcd[24996]: [b9ad17] <shadow="jmm"> ldap_search_ext() failed: Can't contact LDAP server: Connection timed out nslcd[24996]: [b9ad17] <shadow="jmm"> connected to LDAP server ldap://ldap-labs.codfw.wikimedia.org:389 The idle_timelimit option initiates a connection termination after nine minutes of inactivity. This still allows nslcd to reuse the connection in brief periods of inactivity, and still closes them before they are shutdown from the server-side. http://lists.arthurdejong.org/nss-pam-ldapd-users/2014/msg00111.html has some references from the upstream author of nslcd. Change-Id: I27525a22385b3c995e037244f354c00452cd8bc9 16 December 2015, 14:46:12 UTC
80e8afa service::node: Configure automatic service restarts with init_restart This commit introduces a new parameter to service::node - init_restart - which controls whether the init system (Upstart, SystemD) should automatically respawn the service in case its process dies. By default, they're allowed to do so. Also, set init_restart = false for RESTBase. Change-Id: I9dfc37a5d7eb9a18063389892d26c8e4aebd276c 16 December 2015, 14:40:02 UTC
d3e1f70 RESTBase: disable firejail RESTBase hasn't been tested with firejail yet, so disable it temporarily until we determine it's safe to run it so. This commit also reintroduces the firejail conditional parameter to service::node because of RESTBase. Its default value is true, and should not be changed for new services. Note: to merge in tandem with Ica246d4e4a2c551ddf47334762a821536eccb307 Bug: T118401 Change-Id: I2e81a9078b6cf4f6e417a56483c65a6cf2df1007 16 December 2015, 13:44:20 UTC
02c2006 RESTBase: Switch to service::node There are many shared features between the restbase (w/ restbase::monitoring) class and the service::node define. So, instead of duplicating work to keep all service-runner-based services updated, make RESTBase use the service::node define as well. Bug: T118401 Change-Id: Ica246d4e4a2c551ddf47334762a821536eccb307 16 December 2015, 13:38:12 UTC
f769d87 Reconfiguring mysqls db1022 and s6 codfw servers Change-Id: Id0bb2a9495a96f39d3a17ff9c7572dc68c2f8f72 References: T120122 16 December 2015, 12:01:53 UTC
0ff7a4f labstore: Run sync-exports in start-nfs too It too seems to require a nfs-kernel-server restart, so roll that into one. This has bitten us during two outages Change-Id: Ia5c09cf27289c6a9e3b47e87a11269c321c4e6c3 16 December 2015, 11:09:42 UTC
d461404 labstore: Activate volumes before mounting them Change-Id: Iacf2739576ad0cd54fae124f5d4cb3591440fc2f 16 December 2015, 11:09:13 UTC
184fac1 labstore: Enable snapshots immediately after creation Change-Id: I78deb15174859560757310b51fa4ad9d290302b6 16 December 2015, 10:47:24 UTC
b9ca80e remove ariel's non-yubi ssh key Change-Id: Icad269aaae089c05ca200fe95a974eaa997b0dde 16 December 2015, 10:44:57 UTC
fc083e4 allow salt master to handle more than 1024 conns Change-Id: Idbaa8460c5f99e219acc9cdf40c2c4793b7e6857 16 December 2015, 10:39:29 UTC
ad2874e labs: add an access.conf stanza to always allow cron Since 623d8760cccca15f7798c1e9c22956507cba1676, we (silently) moved pam_access.so to an account type under common-account, while previously it was under an auth type in common-auth plus an account type under the sshd PAM configuration. This has had various repercussions, depending on whether 93eb9c8fb09968027eb735af8ea19516b307b5f4 was applied (reverted and re-reverted). With the current state of affairs, the primary effect is that common-account is included by cron, and cron checks against the account type. Thus, cron now suddenly checks users against access.conf. In turn this means that the default labs-restrict-from/to rules suddenly are applied for cron jobs, which under the default configuration means that only users with a membership to the Labs project have a working cron. This can be not the case in various circumstances (e.g. system users) but it was particularly visible in the Tools project, where each tool runs as a separate "system" user that is NOT in the project-tools group. The new PAM configuration is not wrong per se, so instead of reverting it, add an access.conf rule that allows every user on the system to run cron, and evaluate that early. This will fix cron, but of course there is still a possibility we'll see effects of our access.conf configuration elsewhere (e.g. if we were using atd). pam_access errors in auth.log would be a good indications of such a breakage. Change-Id: Ic0e80114995b0046a6901a885557d2c7e1d4df22 16 December 2015, 10:33:03 UTC
71a9141 labstore: Better error-checking(?) for start-nfs For a 'critical' script ignoring failures seems bad Change-Id: Id84b184c339bc578f9a591d2cedacb798b0d796e 16 December 2015, 10:17:41 UTC
05fa7b9 labstore: Skip activating snapshots by default Change-Id: I031cfa3a9bdd7522612088f68a74fe17eae64259 16 December 2015, 10:17:38 UTC
b487734 changing kafka1001 and 1002 to install jessie Change-Id: I2ea25dd6eb8b030275869f4f32735a42f53719ad 16 December 2015, 07:26:54 UTC
86619ba Revert "Revert "Reorder modules in common-account"" this may have made some small things better but it made some other big things much much worse. Putting this back in until we have a subtler patch for the smaller issues. This reverts commit b1121caee13d97fb8d737cc0fbcb78b6e19d9a73. Change-Id: Iea15980492d33c743d9e9453464520daa81a6b57 16 December 2015, 04:02:58 UTC
d9a74bc clean-pam-config: move backupfiles to a different dir Bug: T121533 Change-Id: I2ba6994850442a483ba903eadc342c5ebd018c42 16 December 2015, 01:18:44 UTC
bf61260 tools: No more puppet client classs Miss this in the removal since ES was a pending patch Change-Id: I21fc1fe1a6859d13ee27d8cf74711b94754f0230 16 December 2015, 00:12:12 UTC
back to top