SRE cheatsheet

Tuesday, May 17, 2022

solving error: Your current user or role does not have access to Kubernetes objects on this EKS cluster.

Trying to access EKS cluster with kubectl you might get an error similar to:

Your current user or role does not have access to Kubernetes objects on this EKS cluster

This may be due to the current user or role not having Kubernetes RBAC permissions to describe cluster resources or not having an entry in the cluster’s auth config map

it can happen on for example terraform created clusters or a new user joing organization.

so what happened was that EKS being amazon product by default relies on amazon security structure for RBAC and the role you currently use was not set to access it.

you can see the idenity mappings on your cluster with:

eksctl get iamidentitymapping --cluster YOUR_CLUSTER --region=YOUR_REGION

and you can add needed role using eksctl (no need for kubectl since those are rules beforehand)

eksctl create iamidentitymapping \

--cluster YOUR_CLUSTER\

---region=YOUR_REGION\

--arn arn:aws:iam::123456:role/YOUR_ROLE\

--username admin \

--group system:masters

and of you can delete the roles you no longer use with:

eksctl delete iamidentitymapping\

--cluster YOUR_CLUSTER\

--region=YOUR_REGION\

--arn arn:aws:iam::123456:role/YOUR_ROLE

Wednesday, August 26, 2020

kubeflow Istio configuration for trustworthy JWTs on rancher 2.x

Introduction:

For some reason some of the default feature gates are not turned on in rancher.

So deploying Kubeflow or any workload that uses Istio version 1.3.1 with SDS enabled you need to enable TokenRequest and TokenRequestProjection.

Issue symptoms:

istio-pilot and everything dependent will fail to start in Kubeflow deployment.
pod events / log similar to "MountVolume.SetUp failed for volume "istio-token" : failed to fetch token: the API server does not have TokenRequest endpoints enabled"

How to prepare Rancher for Istio 1.3 and up (tested on 2.x)

Option 1, use server configuration file (yaml edit)

Login to your Rancher2.0 UI
Select relevant cluster
Click on options and edit
On cluster options choose "Cluster Options" and edit ad YAML
go to: "kube-api:"
and add :
extra_args:
service-account-issuer: "kubernetes.default.svc"
service-account-signing-key-file: "/etc/kubernetes/ssl/kube-service-account-token-key.pem"
Save the file / configuration

Cluster will reconfigure.

Option 2, feature gates flags via Rancher API

Follow the instructions in this thread.

references:

Kubernetes documentation

Istio blog

Rancher git issue

Friday, November 01, 2019

mounting AWS (Amazon Web Services) EFS on Linux Ubuntu 18.04

Amazon Elastic File System (Amazon EFS) is a scalable file storage for EC2 and services that run on EC2 (for example Kubernetes clusters). The device is accessible on Linux via the NFS protocol and can be used my multiple instances and pods at the same time.
For more information on EFS visit AWS documentation.

Step one: Gather information
In our case ti is pretty straightforward. Ubuntu instance in the same VPC as the EFS and a DNS name of the file system we want to access. The format uses following convention:

http://file-system-id.efs.aws-region.amazonaws.com

And the exact URL is available on AWS console AWS home under filesystem's DNS name or via cli

Step two: Install the NFS Client for Linux

$ sudo apt-get update
$ sudo apt install nfs-kernel-server

Step three: Mount the file system on EC2 instance.
Create (if you don't have already) a mount point for the EFS

$ sudo mkdir -p /mnt/efs-mount-point

Mount the EFS share on the instance

$ sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport mount-target-DNS:/ /mnt/efs-mount-point

Now we have a mounted Amazon EFS file system on Ubuntu EC2 instance.
Keep in mind that command mounted doesn't persist across reboots. if you want it to be permanently accessible you have to add it to the fstab.

Common error:

efs mount.nfs: Connection timed out

This error can occur because either the Amazon EC2, mount target security groups or file system access are not configured properly.

For more troubleshooting tips you can visit:
https://docs.aws.amazon.com/efs/latest/ug/troubleshooting-efs-mounting.html

Thursday, August 08, 2019

How to Install Terraform 0.12 on Ubuntu 18.04

To this day (8/8/2019) Terraform is not packaged in an official apt repository. There is an option to install it with Snap but be careful it will probably be an older version. When i checked it was v0.11.11
If you do want to install it with snap, run:

$ snap install terraform

To install the latest version follow this procedure.

You might want to update your system just in case:

$ sudo apt-get update

Now since you are getting a Terraform binary from official Hashicorp site, you will need both wget and unzip packages unless already installed:

$ sudo apt-get install wget unzip

Last step would be to download an unzip Terraform package (you can find latest here).

$ wget https://releases.hashicorp.com/terraform/0.12.6/terraform_0.12.6_linux_amd64.zip
$ sudo unzip ./terraform_0.12.6_linux_amd64.zip -d /usr/local/bin/

check that it is installed:

$ terraform -v

you are all done.

Provided by:Forthscale systems, cloud experts

Tuesday, February 26, 2019

Getting AWS EC2 instance id (instanceid) from within the ec2 instance

In general you can get a lot of instance metadata by accessing API on
http://169.254.169.254/latest/meta-data/
That includes instance id.

On generic Linux system, you can get the ID either using curl:
curl http://169.254.169.254/latest/meta-data/instance-id
or wget:
wget -q -O - http://169.254.169.254/latest/meta-data/instance-id

If you instance is based on Amazon Linux or have cloud-utils installed you can also run:
ec2-metadata -i
for instance id.

more documentation on metadata is a available here:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html

Provided by:Forthscale systems, cloud experts

Monday, November 19, 2018

HTTP 409 while provisioning Google Cloud SQL instance

While creating a new Google Cloud SQL be careful not to use instance name (master or replica) that was recently used. How recent? Up to two months.

errors you might encounter:
ERROR: (gcloud.sql.instances.create) Resource in project [Project name] is the subject of a conflict: The instance or operation is not in an appropriate state to handle the request.
HTTP 409

Provided by:Forthscale systems, cloud experts

Thursday, September 21, 2017

DevOps and Site Reliability Engineering (SRE)

As we all know, the Computer Age and the Internet Age have both profoundly impacted the world of commerce. As customer experience changes, led by internet giants, IT operations change accordingly to support new processes. Not so long ago, new product development could mostly be decoupled from operations. Of course, there were some connections, factories had to retool their machinery if changes were made. Yet the nature of physical products allowed for development operations to drift apart.

With the explosion of cyber property in the last few decades, though, the product mix has changed. Digital products represent a large and growing part of global offerings. An expectation from such a product is to be always-reliable, accessible from anywhere by anyone at any time. Recent offerings from major cloud providers advertise simplicity in supporting this notion. In reality, everything is still technically grounded (servers need to physically be somewhere). To meet market expectations development has to work closely with operations.

For a simple example, consider a buyer-seller connection service. In the 1970s, perhaps there was a weekly publication of sellers in a relatively small geographic area. Buyers couldn’t directly compete, because the seller could only handle one caller at a time. Today, hundreds of remote buyers can compete directly and instantly, and the seller never has to negotiate with a single one if s/he doesn’t want to. For the retail equivalent (mail order catalogues), in today’s system, there might not be a human involved between the factory and the customer’s house at all.

Similar transformations abound in a plethora of industries. Cyber products are entirely new, and, because reliability and security are paramount, development simply cannot remain decoupled from operations. Moreover, simple yet powerful upgrades from development can be applied with minimal interference and downtime, so why wouldn’t operations departments cooperate with development teams to enhance the customer experience?

What is DevOps? What is SRE?

DevOps — an organizational model that encourages communication, empathy, and ownership throughout the company

SRE — an organizational model to reconcile the opposing incentives of operations and development teams within an organization

These two terms are widely used and broadly applied. Sometimes too broadly. The term Site Reliability Engineering was born at Google, the brain child of Ben Treynor. It, like DevOps, is a blend of operations and development. The most important aspects, similarly to DevOps, are automating operations processes and increasing collaboration. This is especially important in globally-scaled, always-on-demand services, because not all errors and issues can (or even should) be handled by humans. We humans have better things to do.

SRE aims to provide availability, performance, change management, emergency response, and capacity planning. Each of these factors is essential to global-grade services, because the software landscape sees intense competition. A couple days of downtime can mean customers flowing to competitors. This brave new world requires new techniques.

A One-Paragraph Primer on Reliability Terminology

Any operations student would know that there are two parts to reliability. Mean Time to Repair (MTTR) and Mean Time Between Failures (MTBF). The former is how long a system is in error before it is fixed, and the latter is how long the interval is between failures. These two concepts work together, and a balance between them is a golden gold.

Traditional Operations and Development Interaction

The traditional interaction between development teams and operations teams is bipolar. On one end, the development team is tasked with creating new features and attracting customers are much as possible. New features are an attractor, and hence more new features amplifies the attraction factor. Unfortunately, this sometimes leads to development teams to publish updates and features before they are thoroughly tested. It also leads to frustrated operations teams when the service goes down.

Conversely, the operations team is tasked with running the service once it has been approved and established. The ops team doesn’t want more work than is essential, so it encourages longer and more rigorous testing periods before release. This leads to long lead times and frustrated development teams who just want to push out the newest, coolest features.

Is there no middle ground?

The conflict between dev and ops can be palpable. Sometimes responsibility for code is even hidden from operations to limit fallout onto one person, which is known as information hiding. This is not an efficient or well-oiled system. How can we reconcile the seemingly opposite goals of development and of operations? In SRE, the term is “error budget”.

According to the creators of SRE, a 100% reliability rate is unlikely, and maybe not even desirable. 99.9% reliability is indistinguishable from 100% for the userbase. Maybe 99% is your target. It depends on the users and what level of reliability they are willing to accept. This level is defined multilaterally (see “Moral Authority”).

Whatever your target, the difference between your target and 100% is the “error budget”. The development team may produce code that has an error rate up to the budget. That means they can do less testing or roll out less stable features, as long as they don’t surpass the budgeted downtime. Once the downtime allowance is surpassed, all future launches must be blocked, including major ones, until the budget is “earned back” with performance that is better than the target reliability rate.

This small but brilliant change has interesting consequences. The dev team attempts to code for low native error rates, because they want to use their budget on more interesting and fun features, not the foundational code. Furthermore, the dev team starts to self-police, because they want to conserve the budget for worthwhile launches, not consume it on errors in basic features. Finally, there is no blame or info hiding, because everyone agreed to the budget in the first place. This leads to empathy and communication between teams, replacing the sometimes hostile environment of the traditional dev-ops relationship.

Moral Authority

In an organization, especially in the tech world, it is imperative that employees believe in their leadership. A rogue team is disastrous, and sabotage is a real threat. Whence stems the moral authority for SRE? This lies in the budgeting process. Development, operations, product managers, and organizational management agree to Service Level Agreements (SLAs), which state the minimum uptime (which necessarily stipulates the maximum downtime) that is acceptable to customers.

This is the foundation for the budget. If customers are willing to accept 99.5% uptime, then the budget is 0.5%. And since the development team has agreed to this level, they have no authority to challenge SRE blocking their launches if the budget is spent. Everyone agrees beforehand, so there is no political jockeying once the system is live.

Monitoring, Reliability, and Service Degradation

A public-facing system will inevitably be down sometimes. Even if the MTTR is extremely short and unnoticeable by customers, the system has still failed. This is the reason monitoring (and preferably automated monitoring) is essential.

According to Treynor, there are three parts to monitoring. First is logging, which is mundane and mainly for diagnostic and research purposes later. This isn’t meant to be read continuously, only used as a tool for later review, if necessary. Then there are tickets, for which humans must take action, but maybe not immediately. Then there are alerts, such as when the service is offline for most customers — these require immediate human response, likely in the form of an emergency or crisis response team.

Most error handling should be automated, and this is an area where machines fix themselves. The more machines fix themselves, the better. This quick, automatic repair is related to reliability via Mean Time to Repair (MTTR). If service problems occur but the MTTR is a few milliseconds (because computers are fixing themselves), then the users will never notice. That means dev has more available budget, a good incentive to develop automated error-handling systems.

Now, what to do when the MTTR is longer than a few milliseconds. Many errors will be on back-end systems, and with replication, there may be no discernible issue for the front-end site or service. If, however, issues apparent to the consumer are inevitable, it is best to engineer for “graceful degradation”. This just means you don’t want your service blacking out completely, but maybe slowing down or lowering service quality. A complete blackout with a completely unreachable or unresponsive service will cause customer backlash. Degraded service will cause annoyance, but probably not drive them away. This can be accomplished via Microservice Architecture, as one service going down does not take down the entire service.es, the better. This quick, automatic repair is related to reliability via Mean Time to Repair (MTTR). If service problems occur but the MTTR is a few milliseconds (because computers are fixing themselves), then the users will never notice. That means dev has more available budget, a good incentive to develop automated error-handling systems.

From the customer viewpoint, lots of short MTTR errors is probably better than long but infrequent errors, because short MTTR errors are often eliminated before customers even notice. On the other and, if a firm doesn’t implement a system for these errors, the exact opposite is desirable: one long outage means one long fix, not an endless stream. Hence, to reconcile this conflict, it is strongly suggested to create a system to handle issues. And when the company scales, it is all but imperative to automate, because problems will inevitably outstrip operation headcount.

Why is SRE important?

All organizations want to provide excellent service to users. All organizations have organizational structure, and sometimes that structure includes competing teams and incentives. SRE attempts to eliminate one major issue, especially in modern organizational structures. Chaos behind the scenes will eventually lead to chaos on the front-end, where customers can indirectly observe the Pyrrhic war between development and operations end in a spectacular implosion of the service (and the customer base).

How is SRE related to DevOps?

The first and most obvious way it is related is in using software techniques in operations. But that is trivial, especially in tech companies, because modern operations departments all rely on software to some degree. Both also foster inter-team communication.

However, DevOps encourages communication between teams across the organization, while SRE encourages communication between the development and operations teams. DevOps is concerned with broad empathy and ownership (even involving sales and marketing), while SRE tends to focus on only development and operations. Furthermore, in DevOps, the development team will feel responsible for the life of the product, while in SRE, dev might self-police, but the ultimate operations responsibility still lies with operations.

There are yet more similarities, though, such as the tendency to automate as much of the operations process as possible, including continuous delivery procedures: dev teams under an SRE model might roll out small updates to stay under the error budget, while dev teams under a DevOps model tend to make small updates for easier monitoring and bug identification. Both encourage scalability, such that products not only have solid foundations and native code, but that base product can expand with the business.

As with anything in organizational management, these terms are not mutually exclusive, and they do not have to be separated. Furthermore, each company has its own unique culture and needs, so applying aspects of DevOps and aspects of SRE simultaneously is not taboo. In fact, it is viewed positively. Innovative companies always look for the best aspect of something, extract that best aspect, and adapt and apply that aspect to their own needs. Don’t be afraid to be unique, and certainly don’t be afraid to stand on the shoulders of giants.

Provided by:Forthscale systems, cloud experts
Also published @ Forthscale medium account