Large-Scale Internet Services

The paper is selected from http://pages.cs.wisc.edu/~remzi/Classes/739/Fall2016/.

On Designing and Deploying Internet-Scale Services - James Hamilton – Windows Live Services Platform

1. Three tenets 
	a. Expect failutres. 
		i. Failures may cause depedent components to fail. 
	b. Keep things simple. 
		i. Simple things are more easily to get it right. 
		ii. Avoid unnecessay dependencies. 
		iii. Simple installation.
		iv. Failure isolation. One server failure has no impact on other data centers. 
	c. Automate everything. 
		i. People make mistakes. 
2.  Deploy an operations-freindly service
	a. Overall Application Design
		i. system fails --> look first to operations
		ii. simplicity is the key to efficient operations
			1) Design for failure. The entire service must be capable of surviving failure without human administrative interation. To test the failure path --> just hard-fail it. 
			2) Redundancy and fault recovery
				a) is the operations team willing and ble to bring down any server in the service ant any time without draining the work load first? 
				b) Security threat modeling
					i) each possible security threat and implement enough mitigation for each
				c) Document all conceivable component failures modes and combinations. 
					i) make sure that the service and continue to operate witout unacceptable loss in service quality.
					ii) Rare combinations of errors can become commonplace. 
			3) Commodity hardware slice
				a) large clusters of commodity servers $ << small # of large servers 
				b) I/O is the constrain. Server performance continues to increase much faster than I/O performance -> a small server, more balanced system for the given amout of disk
				c) power consumption scales linearly with servers but cubically with clock frequency --> Higher performace server $$$
				d) small server failure --> small overall service workload
			4) Single-version software
				a) target a single internal deployment
				b) previous versions don't have to be supported for a decade
					i) The most economic services don't give customers control over the version they run and only one host version. 
						One. Few UE changes
						Two. willingness to allow customers that need this level of control to either host internally or switch to an application service provier 
			5) Mullti-tenancy
				a) hosting all companies or end users of a service in the same service without physical isolation
				b) Single tenancy: segregation of groups of users in an isolated cluster
			6) Quick service health check
				a) services version of a build verification test
				b) ensure that services isn't broken in any substantive way
			7) Develop in the full environment
				a) unit testing components, and full servce with their component changes 
			8) zero trust of underlying components
				a) ssume that underlying components will fail 
			9) understand access patterns 
				a) "What impacts will this feature have on the rest of the infrastructure"
				b) measure and validate the feature for load when live
			10) Version everything
				a) Expect a mixed version environment
				b) run single version software but multiple versions will be live for production and test
			11) Keep the unit/funcctional tests from the last release
				a) Keep n-1 version tests
			12) Avoid single points of faulture
				a) Prefer stateless implementations. Don't affinitize requests. Static allocation is bad (example, hashing)
				b) Use Fine-grained partitioning (where related individual tuples (e.g., cliques of friends) are co-located together in the same partition) and don't support cross-partition operations to allow efficient scaling across many database servers. 
	b. Automatic Management and Provisioning
		i. it can be hard because of human judgement needed sometimes (depedency)
		ii. Be restartable and redundant
			1) persistent state stored redundantl
		iii. Support geo-distribution
			1) support running across several hosing data center. 
		iv. Automatic provisioning and installation
		v. Configuration and code as a unit
			1) code and configuratoin as a single unit
			2) operations deploys them as a unit
			3) services should treat confi and code as a unit
			4) audit log is required if confi change must be made in production
		vi. Manage server roles or personalities rather than servers 
		vii. Multi-system failures are common
		viii. Recover at the service level 
			1) handle failures and correct errors at the service level with full context rather than in lower software levels 
		ix. Never rely on local storage for non-recoverable information 
			1) duplicate all the non-ephemeral service state 
		x. keep deployment simple 
			1) file copy, mi external dependencies. 
		xi. fail services regularly
			1) unwilling?
	c. Dependency Management
		i. Expect latency. calls to external components may talke long to complete. 
			1) set timeout
			2) operational idempotency allows the restart of the requests after timeout even though those requests may have partially or even fully completed. 
			3) ensure all starts are reported and bond reestarts to avoid a repeatedly failing request 
		ii. Isolate failures 
			1) avoid cascading failures 
		iii. Use shipping and proven components 
			1) stable version of software and hardware 
		iv. Implement inter-service monitoring and alerting 
			1) need to know when a dependent service is overloading 
		v. Dependent services require the same deisng point 
		vi. Decouple components 
			1) ensure that components can continue operation perhaps in a degraded mode during failureso f other components. For example, maintain a session key and refresh it every N hours 
	d. Release Cycle and Testing 
		i. Invest in engineering
			1) Services that don't think big to start with will be scrambling to catch up later 
		ii. Support version roll-back
		iii. Maintain forward and backward compatiblity
			1) Changing between components are all potential risk. Don't rip out support for old file formats until there is no chance of a roll back to that old format in the future
		iv. Single-server deployment
			1) The entire service must be easy to host on a single system --> for unit testing 
		v. Stress test for load
		vi. Perform capacity and performance testing prior to new releases 
			1) do at service level .
		vii. Build and deploy shallowly and iteratively
			1) get a skeletion version of the full service at the early stage
		viii. test with real data
		ix. Run system-level acceptance test
			1) sanity check
		x. test and develop in full environments
			1) use the same data collection and mining techniques used in production 
3. Graceful Degradation and Admission Control
	a. A big red switch. 
		i. a designed and tested action that can be taken when the service is no longer able to meet its SLA (?).
		ii.  keepp the vital processing progressing while shedding or delaying some non-critical workload. 
		iii. Determine what is minimally required if the sytem is in trouble and implementing and testing the option to shut of the non-essential services when that happens
	b. Control admission
		i. if the current load cannot be processed on the system, more work load --> even more and bad user experience. example: email, stop queuing -> not accept more mails into the system 
		ii. Service premium customers over non-premium customers 
	c. Meter admission. 
		i. modification of the admission control point
		ii. be able to bring the system back up slowly. Ramp up. 1 user, 10 users, 100 users.
		iii. Ways to notify users 

The conclusion: Spending time engineering the system at the beginning is worth it.