SRE cheatsheet: 2017

Thursday, September 21, 2017

DevOps and Site Reliability Engineering (SRE)

As we all know, the Computer Age and the Internet Age have both profoundly impacted the world of commerce. As customer experience changes, led by internet giants, IT operations change accordingly to support new processes. Not so long ago, new product development could mostly be decoupled from operations. Of course, there were some connections, factories had to retool their machinery if changes were made. Yet the nature of physical products allowed for development operations to drift apart.

With the explosion of cyber property in the last few decades, though, the product mix has changed. Digital products represent a large and growing part of global offerings. An expectation from such a product is to be always-reliable, accessible from anywhere by anyone at any time. Recent offerings from major cloud providers advertise simplicity in supporting this notion. In reality, everything is still technically grounded (servers need to physically be somewhere). To meet market expectations development has to work closely with operations.

For a simple example, consider a buyer-seller connection service. In the 1970s, perhaps there was a weekly publication of sellers in a relatively small geographic area. Buyers couldn’t directly compete, because the seller could only handle one caller at a time. Today, hundreds of remote buyers can compete directly and instantly, and the seller never has to negotiate with a single one if s/he doesn’t want to. For the retail equivalent (mail order catalogues), in today’s system, there might not be a human involved between the factory and the customer’s house at all.

Similar transformations abound in a plethora of industries. Cyber products are entirely new, and, because reliability and security are paramount, development simply cannot remain decoupled from operations. Moreover, simple yet powerful upgrades from development can be applied with minimal interference and downtime, so why wouldn’t operations departments cooperate with development teams to enhance the customer experience?

What is DevOps? What is SRE?

DevOps — an organizational model that encourages communication, empathy, and ownership throughout the company

SRE — an organizational model to reconcile the opposing incentives of operations and development teams within an organization

These two terms are widely used and broadly applied. Sometimes too broadly. The term Site Reliability Engineering was born at Google, the brain child of Ben Treynor. It, like DevOps, is a blend of operations and development. The most important aspects, similarly to DevOps, are automating operations processes and increasing collaboration. This is especially important in globally-scaled, always-on-demand services, because not all errors and issues can (or even should) be handled by humans. We humans have better things to do.

SRE aims to provide availability, performance, change management, emergency response, and capacity planning. Each of these factors is essential to global-grade services, because the software landscape sees intense competition. A couple days of downtime can mean customers flowing to competitors. This brave new world requires new techniques.

A One-Paragraph Primer on Reliability Terminology

Any operations student would know that there are two parts to reliability. Mean Time to Repair (MTTR) and Mean Time Between Failures (MTBF). The former is how long a system is in error before it is fixed, and the latter is how long the interval is between failures. These two concepts work together, and a balance between them is a golden gold.

Traditional Operations and Development Interaction

The traditional interaction between development teams and operations teams is bipolar. On one end, the development team is tasked with creating new features and attracting customers are much as possible. New features are an attractor, and hence more new features amplifies the attraction factor. Unfortunately, this sometimes leads to development teams to publish updates and features before they are thoroughly tested. It also leads to frustrated operations teams when the service goes down.

Conversely, the operations team is tasked with running the service once it has been approved and established. The ops team doesn’t want more work than is essential, so it encourages longer and more rigorous testing periods before release. This leads to long lead times and frustrated development teams who just want to push out the newest, coolest features.

Is there no middle ground?

The conflict between dev and ops can be palpable. Sometimes responsibility for code is even hidden from operations to limit fallout onto one person, which is known as information hiding. This is not an efficient or well-oiled system. How can we reconcile the seemingly opposite goals of development and of operations? In SRE, the term is “error budget”.

According to the creators of SRE, a 100% reliability rate is unlikely, and maybe not even desirable. 99.9% reliability is indistinguishable from 100% for the userbase. Maybe 99% is your target. It depends on the users and what level of reliability they are willing to accept. This level is defined multilaterally (see “Moral Authority”).

Whatever your target, the difference between your target and 100% is the “error budget”. The development team may produce code that has an error rate up to the budget. That means they can do less testing or roll out less stable features, as long as they don’t surpass the budgeted downtime. Once the downtime allowance is surpassed, all future launches must be blocked, including major ones, until the budget is “earned back” with performance that is better than the target reliability rate.

This small but brilliant change has interesting consequences. The dev team attempts to code for low native error rates, because they want to use their budget on more interesting and fun features, not the foundational code. Furthermore, the dev team starts to self-police, because they want to conserve the budget for worthwhile launches, not consume it on errors in basic features. Finally, there is no blame or info hiding, because everyone agreed to the budget in the first place. This leads to empathy and communication between teams, replacing the sometimes hostile environment of the traditional dev-ops relationship.

Moral Authority

In an organization, especially in the tech world, it is imperative that employees believe in their leadership. A rogue team is disastrous, and sabotage is a real threat. Whence stems the moral authority for SRE? This lies in the budgeting process. Development, operations, product managers, and organizational management agree to Service Level Agreements (SLAs), which state the minimum uptime (which necessarily stipulates the maximum downtime) that is acceptable to customers.

This is the foundation for the budget. If customers are willing to accept 99.5% uptime, then the budget is 0.5%. And since the development team has agreed to this level, they have no authority to challenge SRE blocking their launches if the budget is spent. Everyone agrees beforehand, so there is no political jockeying once the system is live.

Monitoring, Reliability, and Service Degradation

A public-facing system will inevitably be down sometimes. Even if the MTTR is extremely short and unnoticeable by customers, the system has still failed. This is the reason monitoring (and preferably automated monitoring) is essential.

According to Treynor, there are three parts to monitoring. First is logging, which is mundane and mainly for diagnostic and research purposes later. This isn’t meant to be read continuously, only used as a tool for later review, if necessary. Then there are tickets, for which humans must take action, but maybe not immediately. Then there are alerts, such as when the service is offline for most customers — these require immediate human response, likely in the form of an emergency or crisis response team.

Most error handling should be automated, and this is an area where machines fix themselves. The more machines fix themselves, the better. This quick, automatic repair is related to reliability via Mean Time to Repair (MTTR). If service problems occur but the MTTR is a few milliseconds (because computers are fixing themselves), then the users will never notice. That means dev has more available budget, a good incentive to develop automated error-handling systems.

Now, what to do when the MTTR is longer than a few milliseconds. Many errors will be on back-end systems, and with replication, there may be no discernible issue for the front-end site or service. If, however, issues apparent to the consumer are inevitable, it is best to engineer for “graceful degradation”. This just means you don’t want your service blacking out completely, but maybe slowing down or lowering service quality. A complete blackout with a completely unreachable or unresponsive service will cause customer backlash. Degraded service will cause annoyance, but probably not drive them away. This can be accomplished via Microservice Architecture, as one service going down does not take down the entire service.es, the better. This quick, automatic repair is related to reliability via Mean Time to Repair (MTTR). If service problems occur but the MTTR is a few milliseconds (because computers are fixing themselves), then the users will never notice. That means dev has more available budget, a good incentive to develop automated error-handling systems.

From the customer viewpoint, lots of short MTTR errors is probably better than long but infrequent errors, because short MTTR errors are often eliminated before customers even notice. On the other and, if a firm doesn’t implement a system for these errors, the exact opposite is desirable: one long outage means one long fix, not an endless stream. Hence, to reconcile this conflict, it is strongly suggested to create a system to handle issues. And when the company scales, it is all but imperative to automate, because problems will inevitably outstrip operation headcount.

Why is SRE important?

All organizations want to provide excellent service to users. All organizations have organizational structure, and sometimes that structure includes competing teams and incentives. SRE attempts to eliminate one major issue, especially in modern organizational structures. Chaos behind the scenes will eventually lead to chaos on the front-end, where customers can indirectly observe the Pyrrhic war between development and operations end in a spectacular implosion of the service (and the customer base).

How is SRE related to DevOps?

The first and most obvious way it is related is in using software techniques in operations. But that is trivial, especially in tech companies, because modern operations departments all rely on software to some degree. Both also foster inter-team communication.

However, DevOps encourages communication between teams across the organization, while SRE encourages communication between the development and operations teams. DevOps is concerned with broad empathy and ownership (even involving sales and marketing), while SRE tends to focus on only development and operations. Furthermore, in DevOps, the development team will feel responsible for the life of the product, while in SRE, dev might self-police, but the ultimate operations responsibility still lies with operations.

There are yet more similarities, though, such as the tendency to automate as much of the operations process as possible, including continuous delivery procedures: dev teams under an SRE model might roll out small updates to stay under the error budget, while dev teams under a DevOps model tend to make small updates for easier monitoring and bug identification. Both encourage scalability, such that products not only have solid foundations and native code, but that base product can expand with the business.

As with anything in organizational management, these terms are not mutually exclusive, and they do not have to be separated. Furthermore, each company has its own unique culture and needs, so applying aspects of DevOps and aspects of SRE simultaneously is not taboo. In fact, it is viewed positively. Innovative companies always look for the best aspect of something, extract that best aspect, and adapt and apply that aspect to their own needs. Don’t be afraid to be unique, and certainly don’t be afraid to stand on the shoulders of giants.

Provided by:Forthscale systems, cloud experts
Also published @ Forthscale medium account

Thursday, June 29, 2017

Petya / NotPetya

Just last month the WannaCry ransom-ware spread to hundreds of thousands of machines and set off a global panic. The worm-style infection relied on a leaked NSA tool (EternalBlue) that allowed it to spread rapidly across the Internet. Microsoft released a patch shortly after the attack began, even supporting systems that had long been past their patch lifetimes (Windows XP, anyone?).

A mere month later, the NotPetya malware burst onto the scene. Petya has been around since early 2016, and this outbreak is not actually Petya. However, it shares many similarities, hence the preliminary label as “Petya” and subsequently “NotPetya”. The attack bears resemblance to WannaCry in that it exploits EternalBlue, which, unfortunately, has not been patched on many systems because companies and individuals have decided uptime is more important. They effectively gambled with their data, and some of them have lost.

This ransomware hasn’t spread like WannaCry, but it also uses a more sophisticated infection technique and the encryption stage is more interesting as well. Essentially, NotPetya’s developers learned from WannaCry’s mistakes and made some clever enhancements.

The malware has hit giants like Merck, Maersk, the advertising firm WPP, and Rosneft (the Russian energy behemoth). The way NotPetya spreads is likely a big reason major firms and big networks are targeted as opposed to just anyone.

Who’s Affected?

The most affected are those without any type of malware protection and who skip critical OS updates for Windows. It is hard to imagine that anyone (and especially companies) hasn’t updated their systems after the carnage wreaked by WannaCry, but there are certainly people who haven’t.

Users of old protocols and techniques, like Server Message Block version 1 are highly vulnerable, as this is the main exploit for EternalBlue.

And since this malware spreads within a network rather than jumping around the Internet, it is more likely large organizations are going to be targeted, because they have much bigger networks to infect. Furthermore, these companies have HR and customer service departments that often download attachments from unknown sources. Such activities make them prime targets for this kind of ransomware attack.

The Infection Process

NotPetya first attempts to use the EternalBlue security hole. It exploits Microsoft’s Server Message Block version 1 (SMBv1), which is generally used for allowing file and printer sharing and miscellaneous communications tasks. The latest version is v3, and unless there is a specific need to use SMBv1, it should not be used. EternalBlue is just one compelling reason to ditch it. However, since this vulnerability has been addressed in updates and patches, the malware has other vectors for infection.

Assuming the SMBv1 exploit fails, the ransomware attempts to use PSExec (to run processes on connected computers). It also scans the memory for any user credentials, which are then used in conjunction with Windows Management Instrumentation Command Line (WMIC). Using WMIC affords NotPetya the ability to infect even patched Windows 10 machines, because WMIC is a legitimate network tool for administrators.

With that in mind, any computer that has administrator rights on a network can infect the entire network, whether it is a patched network or not.

How it Spreads

The main entry point is through a malicious file downloaded by a network user. As HR personnel tend to receive a lot of email with attachments, this is one of the identified avenues of attack. Once the malicious file is downloaded, it can use the exploits listed above to spread on the network – this is a good reason to target big companies (they have a lot more computers on their network than Jack who lives down the street).

Another major avenue of injection is through malicious code in Microsoft Office files. Auto-running macros can download the infection whenever an offending file is opened. And not to single out any single weak point, but it has been published that the MeDoc software oft-used in the Ukraine has been an involuntary delivery system.

The Encryption Process

NotPetya not only encrypts your files, it scrambles the boot sector of your hard drive, so it isn’t even possible to boot past the ransom message. This also prevents any offline tampering (as opposed to WannaCry, which could be investigated offline), since there’s no way to even look at the encrypted files. Furthermore, it seems system logs are wiped to make it that much harder to crack the malware.

In order to enforce the MBR (master boot record) encryption, the machine is forced to restart within an hour (otherwise it may take weeks for that part of the encryption, as many machines are powered on for weeks at a time with no restarts).

Prevention of the Virus

It goes without saying that one should not be downloading random files from the Internet without knowing the sender. In certain roles, though, it can be difficult to adhere to this rule though.

Another tenet of cybersecurity is having some sort of antivirus and anti-malware software. Most of the major names in cybersecurity claim they protect against the execution of NotPetya. So having some sort of antivirus will be helpful in preventing infection.

Another very important aspect is keeping software up-to-date. Updating software from trusted vendors like Microsoft is the best way to cut off a major avenue of attack (like leveraging EternalBlue). If the update cannot be applied, networks should at least attempt to disable SMBv1 to prevent spread through that vulnerability.

A Kill Switch? Maybe a “Vaccine”

If you have been infected or are at major risk thereof, one known “vaccine” is to create the file C:\Windows\perfc. Once the file is created, you should set it to read-only. Apparently NotPetya scans the computer for this file, and if it is found, it halts the encryption process.

Note, however, this is not a “kill switch” like was possible in WannaCry. This is being termed a “vaccine”, because the machine can be infected, but its data remains unscrambled. It doesn’t kill the propagation of the virus, because the virus remains on the system.

The greatest drawback with this vaccine is the file must be created for each machine on a network for the entire network to be vaccinated. It’s a very simple fix for one machine, but can be a headache on a network with thousands of machines. Regardless, this is one possible approach to prevent your files from being locked.

What to do if your files are encrypted

If it has come to your computer booting up with a ransom message, you only have one option to get the data back from that machine. Unfortunately, it means paying the ransom, which most expects and cyber-security defence teams advise against. Even more unfortunate for those affected, the email address provided in the ransom message has reportedly been taken offline.

A much better solution is to have your data backed up somewhere else. If you are practising basic data maintenance, you shouldn’t lose any of your data to this attack. If your data is backed up, this is more an inconvenience than a company killer.

Ransomware or “Wiper”?

Unfortunately for those that have been impacted, NotPetya seems to be a wiper and not ransomware. According to both Kaspersky and Comae Technologies, the encrypted files are not recoverable, even by the attacker. That means even if the payment is made, no key can be distributed to reverse the encryption (not that a victim could contact the attackers, because their contact email address has been disabled). This implies the attack was meant to be destructive and not financially driven. It could be that a well-financed state actor is behind the attack, and they already have plenty of funds. Following so closely on the heels of WannaCry, the media reported the attack as ransomware and shifted the focus from a possible nation state attack to a rogue group of criminals looking for a quick financial payoff. Watch out for more info in the near future concerning a nation’s involvement.

Some More Technical Info from around the Web

Kaspersky has a page with some information on the detection its software generates. There is also a short bit of advice for users. If you are interested in exactly who has been attacked, Avira has compiled a (probably unexhaustive) list of user language settings on compromised machines. As reported elsewhere, it is largely Russian and Ukrainian machines and disproportionately Windows 7 running Service Pack 1. And Symantec has published an article with a good overview of the infection vectors and the impacted file extensions. A not unexpected spoiler? You probably use solely these file extensions. Finally, if you’ve decided to kill SMBv1 manually, this is Microsoft’s tutorial for all of their OSes.

Click here to see Checkpoint forensic analysis

Have more questions? Give us a shout.

Provided by:Forthscale systems, cloud experts

Thursday, September 21, 2017

DevOps and Site Reliability Engineering (SRE)

What is DevOps? What is SRE?

A One-Paragraph Primer on Reliability Terminology

Traditional Operations and Development Interaction

Is there no middle ground?

Moral Authority

Monitoring, Reliability, and Service Degradation

Why is SRE important?

How is SRE related to DevOps?

Thursday, June 29, 2017

Petya / NotPetya

Petya / NotPetya

Who’s Affected?

The Infection Process

How it Spreads

The Encryption Process

Prevention of the Virus

A Kill Switch? Maybe a “Vaccine”

What to do if your files are encrypted

Ransomware or “Wiper”?

Some More Technical Info from around the Web

solving error: Your current user or role does not have access to Kubernetes objects on this EKS cluster.