Jeff's Chronicles

Learn Kubernetes with k3s - part5 - Helm

Jeff Wong — Wed, 27 Mar 2024 15:10:42 GMT

This post is the continuation of part 4.

Previously, I have used helm, the Kubernetes package manager to install an nginx application. To list all the releases

$ helm list
NAME      	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART        	APP VERSION
my-release	default  	1       	2024-03-19 22:58:08.794126266 +0800 +08	deployed	nginx-15.14.0	1.25.4

We can see the only one release my-release in the list. To get the release notes which was displayed on the initial installation

$ helm get notes my-release
NOTES:
CHART NAME: nginx
CHART VERSION: 15.14.0
APP VERSION: 1.25.4
...

To modify the application's config, modify values.yaml to

$ cat values.yaml
service:
  type: LoadBalancer

and run

$ helm upgrade my-release \
  oci://registry-1.docker.io/bitnamicharts/nginx \
  -f values.yaml

We see a confirmation message of the upgrade, and the content of the release notes has changed too. Run

$ kubectl get services/my-release-nginx
NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
my-release-nginx   LoadBalancer   10.43.206.241        80:30080/TCP   6d22h

The service type of my-release-nginx has changed from NodePort to LoadBalancer. Note that this way of resource provisioning is a break from the imperative model of kubectl create as we now manage the application config in code: our values.yaml file.

To rollback

$ helm rollback my-release 1
$ kubectl get services/my-release-nginx
NAME               TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
my-release-nginx   NodePort   10.43.206.241           80:30080/TCP   6d22h

We see that we are back to release 1.

Now we are using the imperative model again, since the actual application config no longer aligns with what is in the code. We will find a more permanent solution to enforce IaC in future.

Finally to uninstall the application

$ helm uninstall my-release
release "my-release" uninstalled

You can find helm charts of many open-source applications on the public registry Artifact Hub. Sometimes the helm chart can be installed from an OCI link such as nginx.

Sometimes you would need to add a repository first. For example, to install milvus db,

$ helm repo add milvus https://milvus-io.github.io/milvus-helm/

$ helm repo list
NAME  	URL                                     
milvus	https://milvus-io.github.io/milvus-helm/

$ helm repo update

$ helm upgrade --install my-release milvus/milvus
Release "my-release" does not exist. Installing it now.
NAME: my-release
LAST DEPLOYED: Wed Mar 27 22:51:20 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

An organization who deploys frequently to Kubernetes would probably use a private registry, in which case, you need to login to the registry first, by helm registry login before you run helm repo add.

Summary

A Helm chart is a collection of YAML files that describe a related set of Kubernetes resources, such as deployments, services, secrets, configmaps, and ingresses. It packages all the necessary resources for an application into a single unit.

When you install a Helm chart, Helm renders all the YAML manifests in the chart and deploys them to the Kubernetes cluster together. This ensures that all the required resources are created together.

Helm charts use templating to reduce duplication of YAML code. This allows you to define common configurations in one place and reuse them across different environments or applications.

To be continued in part 6, the common Helm chart parameters.

Private PyPI repository on AWS - part 2

Jeff Wong — Mon, 25 Mar 2024 15:14:56 GMT

This post is the continuation of part 1.

After creating the repository, we can test it by publishing a package to it and then installing that package in another project. This can be done using poetry, the python package manager.

Poetry - Python dependency management and packaging made easy

Python dependency management and packaging made easy

Create and publish a package

Create a project containing the following files, and define the dependencies with poetry commands

.
└── my_package
    ├── hello.py
    └── __init__.py

$ cat my_package/hello.py 
def f():
    return 4
    
$ pyenv local 3.9.18
$ poetry init
$ poetry add requests boto3

We have used pyenv to set the virtual environment python version.

GitHub - pyenv/pyenv: Simple Python version management

Simple Python version management. Contribute to pyenv/pyenv development by creating an account on GitHub.

GitHubpyenv

To add our private PyPI repository to poetry

$ poetry config repositories.private-pypi \
    https://example-111122223333.d.codeartifact.us-west-2.amazonaws.com/private-pypi/

The URL is the endpoint URL from the Terraform outputs in part 1.

To connect to the PyPI repository, we need to get an authorization token which will be valid for only 12 hours. The IAM user we use for requesting the token requires a special IAM permission sts:GetServiceBearerToken.

I don't like to mess around with my root user, so I created a test user without a login profile just for testing.

module "test_user" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-user"
  version = "5.37.0"

  name          = "test_user"
  force_destroy = true

  create_iam_user_login_profile = false
  create_iam_access_key         = true
}

resource "aws_iam_policy" "allow_sts_get_service_bearer_token" {
 name        = "AllowSTSGetServiceBearerToken"
 description = "Policy to allow sts:GetServiceBearerToken"

 policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = "sts:GetServiceBearerToken"
        Resource = "*"
      },
    ]
 })
}

resource "aws_iam_policy_attachment" "attach_policy_to_user" {
 name       = "AttachPolicyToUser"
 users      = [module.test_user.iam_user_name]
 policy_arn = aws_iam_policy.allow_sts_get_service_bearer_token.arn
}

output "iam_access_key_id" {
  description = "iam_access_key_id"
  value       = module.test_user.iam_access_key_id
}

output "iam_access_key_secret" {
  description = "iam_access_key_secret"
  value       = module.test_user.iam_access_key_secret
  sensitive   = true
}

Remember to add this IAM user's ARN to codeartifact_readwrite_access_arns.

codeartifact_readwrite_access_arns  = [
  module.example_github_oidc_role.arn,
  module.test_user.iam_user_arn,
]

After terraform plan & apply, run the following to get the value for iam_access_key_secret

$ terraform output iam_access_key_secret

Setup the AWS commandline with iam_access_key_id and iam_access_key_secret.

$ aws configure

Now we can request for a token.

$ TOKEN=`aws codeartifact get-authorization-token --domain example --domain-owner 111122223333 --query authorizationToken --output text`
$ poetry config http-basic.private-pypi aws $TOKEN

We can then build the project and publish the resulting packages

$ poetry build -v
$ poetry publish -v --repository=private-pypi

To be continued in part 3, installing from private PyPI.

Private PyPI repository on AWS - part 1

Jeff Wong — Sun, 24 Mar 2024 01:44:42 GMT

Development teams often find themselves needing to share common code across different projects. This is a common challenge in software development.

Reusing code not only saves time but also ensures consistency and improves software reliability.

Consider the scenario where a development team has a set of utility functions that are used across multiple projects. These functions could be anything from data processing utilities to logging and error handling functions.

The team needs a way to share these functions without duplicating the code in every project.

The Different Approaches

There are several ways to share common code among projects, each with its own set of pros and cons:

Copy-pasting the common code in every project: This is the simplest approach but can lead to code duplication and inconsistencies. It's also difficult to maintain and update the code across all projects.
Using pip install directly from GitHub repositories: This method is straightforward but can be slow, especially for large projects. It also requires SSH credentials for private repositories, adding an extra layer of complexity.
Using git submodules: This approach allows you to include a repository as a submodule in another repository, making it easy to keep track of changes. However, it adds complexity when managing changes across repositories and can lead to issues with dependency management.

Turning Common Code into Installable Packages

A more scalable and maintainable solution is to turn the common code into installable packages. This approach allows teams to easily share, update, and manage common code across projects.

However, the initial step of setting up a central artifact repository can be time-consuming. This is an upfront cost that the team needs to bear.

AWS CodeArtifact

If you're using AWS, AWS CodeArtifact provides a solution to host a PyPI repository. AWS CodeArtifact is a fully managed artifact repository service that makes it easy to store, publish, and share software packages used in your software development process.

It handles TLS, authentication, and authorization, and uses IAM to control access. The service is serverless, meaning you only pay for what you use.

Creating a Private PyPI repository

To create an AWS CodeArtifact repository, you can use Terraform.

Create a domain

locals {
  codeartifact_domain_name     = "example"
  codeartifact_repository_name = "private-pypi"

  codeartifact_readonly_access_arns = [
    module.iam_role.arn
    module.example_glue_job_orchestrator_role.arn
  ]

  codeartifact_readwrite_access_arns = [
    module.example_github_oidc_role.arn
  ]

  codeartifact_packages_arn = "arn:aws:codeartifact:${local.region}:${data.aws_caller_identity.current.account_id}:package/${local.codeartifact_domain_name}/${local.codeartifact_repository_name}/*"
}

resource "aws_codeartifact_domain" "example" {
  domain = "example"
}

data "aws_iam_policy_document" "example_domain_policy_document" {
  statement {
    effect    = "Allow"
    actions   = ["codeartifact:GetAuthorizationToken"]
    resources = [aws_codeartifact_domain.example.arn]

    principals {
      type = "AWS"
      
      identifiers = concat(
        local.codeartifact_readonly_access_arns,
        local.codeartifact_readwrite_access_arns,
      )
    }
  }
}

resource "aws_codeartifact_domain_permissions_policy" "example_domain_permissions_policy" {
  domain          = aws_codeartifact_domain.example.domain
  policy_document = data.aws_iam_policy_document.example_domain_policy_document.json
}

I have defined two access groups:

those who can perform pip install from the private PyPI.
those who can do the above and publish new packages into the private PyPI.

Both groups require an authorization token (HTTP Basic Auth) to connect to the service.

Create a repository within the domain

Then we create the repository and grant the necessary permissions to the two groups at the repository and package levels.

resource "aws_codeartifact_repository" "private_pypi" {
  repository  = "private-pypi"
  description = "Private PyPI repository"
  domain      = aws_codeartifact_domain.example.domain
}

data "aws_iam_policy_document" "private_pypi_policy_document" {
  statement {
    effect    = "Allow"
    actions   = ["codeartifact:ReadFromRepository"]
    resources = [aws_codeartifact_repository.private_pypi.arn]

    principals {
      type = "AWS"
      
      identifiers = concat(
        local.codeartifact_readonly_access_arns,
        local.codeartifact_readwrite_access_arns,
      )
    }
  }

  statement {
    effect = "Allow"

    actions = [
      "codeartifact:GetPackageVersionAsset",
      "codeartifact:GetPackageVersionReadme",
    ]

    resources = [local.codeartifact_packages_arn]

    principals {
      type = "AWS"
      
      identifiers = concat(
        local.codeartifact_readonly_access_arns,
        local.codeartifact_readwrite_access_arns,
      )
    }
  }

  statement {
    effect    = "Allow"
    actions   = ["codeartifact:PublishPackageVersion"]
    resources = [local.codeartifact_packages_arn]

    principals {
      type        = "AWS"
      identifiers = local.codeartifact_readwrite_access_arns
    }
  }
}

resource "aws_codeartifact_repository_permissions_policy" "private_pypi_permissions_policy" {
  repository      = aws_codeartifact_repository.private_pypi.repository
  domain          = aws_codeartifact_domain.example.domain
  policy_document = data.aws_iam_policy_document.private_pypi_policy_document.json
}

Output the repository endpoint URL

data "aws_codeartifact_repository_endpoint" "private_pypi_endpoint" {
  domain     = aws_codeartifact_domain.example.domain
  repository = aws_codeartifact_repository.private_pypi.repository
  format     = "pypi"
}

output "private_pypi_endpoint_url" {
  description = "Private PyPI repository endpoint URL"
  value       = data.aws_codeartifact_repository_endpoint.private_pypi_endpoint.repository_endpoint
}

Create the resources with

$ terraform plan
$ terraform apply

and take note of the endpoint URL in the outputs. It looks like this:

https://example-111122223333.d.codeartifact.us-west-2.amazonaws.com/private-pypi

To be continued in part 2, publishing packages to the private PyPI repository.

Learn Kubernetes with k3s - part4 - Services

Jeff Wong — Tue, 19 Mar 2024 15:29:55 GMT

This post is the continuation of part 3.

Previously I have explained what Kubernetes services are. I have used kubectl to reveal the services running on my cluster (in part2).

$ kubectl get services -A
NAMESPACE     NAME             TYPE           CLUSTER-IP    EXTERNAL-IP                 PORT(S)                      AGE
default       kubernetes       ClusterIP      10.43.0.1                           443/TCP                      7d1h
kube-system   kube-dns         ClusterIP      10.43.0.10                          53/UDP,53/TCP,9153/TCP       7d1h
kube-system   metrics-server   ClusterIP      10.43.22.2                          443/TCP                      7d1h
kube-system   traefik          LoadBalancer   10.43.158.9   192.168.1.53,192.168.1.54   80:30824/TCP,443:30872/TCP   7d1h

Let's explore the different types of Kubernetes services.

ClusterIP

Try connecting to the metrics-server ClusterIP service.

on tab 1, run:

$ kubectl port-forward svc/metrics-server 8443:443 -n kube-system
Forwarding from 127.0.0.1:8443 -> 10250
Forwarding from [::1]:8443 -> 10250

on tab 2:

$ curl -k https://localhost:8443/
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

403 Error! Nonetheless, the connection works.

ClusterIP: exposes the service on a cluster-internal IP. This type makes the service only reachable from within the cluster. This is the default service type.

On the cluster, you can connect to metrics-server directly by its hostname.

$ kubectl run curl --image=curlimages/curl --restart=Never --rm -it -- sh
If you don't see a command prompt, try pressing enter.
~ $ curl -k https://metrics-server.kube-system
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}~ $ exit
pod "curl" deleted

LoadBalancer

We have tried connecting to the traefik LoadBalancer service in part2.

$ curl -k https://192.168.1.53
404 page not found

$ curl -k https://192.168.1.54
404 page not found

LoadBalancer: exposes the service externally using a load balancer. Traffic is routed to the pods of a cluster-internal service. The load balancer tracks the endpoints associated with the service and balances traffic accordingly.

k3s uses a loadbalancer solution called ServiceLB. It exposes the underlying service at selected nodes on a defined port range.

It appears we would need DNS to round-robin the external traffic to these nodes to complete the load-balancing act, which is a hassle. On a managed kubernetes cluster such as EKS, GKE, the loadbalancer will be provided by the cloud provider at a static IP.

NodePort

We don't have a NodePort service for now. Install Helm, we will use it to install nginx server as a NodePort service.

Helm is a package manager for Kubernetes that makes deploying and managing Kubernetes applications easy. It automates the creation, packaging, configuration, and deployment of Kubernetes applications.

A Helm chart is a package of pre-configured Kubernetes resources. They comprise of templates for the Kubernetes manifest files that are usually involved in deploying complicated distributed applications or services.

Create a file values.yaml with the following values:

service:
  type: NodePort
  nodePorts:
    http: 30080

You can check the parameters on the nginx helm chart page. Run the helm command as given on the page with an additional -f values.yaml flag to use the values we just created.

$ helm install my-release \
      oci://registry-1.docker.io/bitnamicharts/nginx \
      -f values.yaml
$ kubectl get services
NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes         ClusterIP      10.43.0.1               443/TCP        8d
my-release-nginx   NodePort       10.43.206.241           80:30080/TCP   9m30s

After that test the service with

$ curl http://192.168.1.53:30080


......
$ curl http://192.168.1.54:30080


......

NodePort: exposes the service on each Node's IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. This makes the service reachable from outside the cluster.

NodePort does not perform any load balancing. It simply directs the traffic from whichever node the client connected to. This can cause some nodes to get overwhelmed with requests while other nodes remain idle. There is no mechanism in NodePort services to distribute load evenly across multiple nodes.

ExternalName

Create a file service.yaml with the following values:

apiVersion: v1
kind: Service 
metadata:
  name: my-google
  namespace: default
spec:
  type: ExternalName
  externalName: google.com

Run

$ kubectl apply -f service.yaml
$ kubectl get services
NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes         ClusterIP      10.43.0.1               443/TCP        8d
my-release-nginx   NodePort       10.43.206.241           80:30080/TCP   29m
my-google          ExternalName             google.com             77s

To test the service

$ kubectl run curl --image=curlimages/curl \
      --restart=Never --rm -it -- \
      curl -k https://my-google


......

ExternalName: maps the service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying or load-balancing is set up. This allows external services to be referenced.

To be continued in part 5, more about Helm.

Learn Kubernetes with k3s - part3 - kubectl

Jeff Wong — Sat, 16 Mar 2024 14:19:21 GMT

This post is the continuation of part 2.

kubectl is a command line tool that allows users to run commands against Kubernetes clusters. Specifically, kubectl enables communication between users and the control plane via Kubernetes API. So the mastery of kubectl is essential for handling real-world cluster operations and troubleshooting.

Below is a table showing the common kubectl commands.

Operation	Subcommand	Example
configuring access	kubectl config	set-cluster my-cluster
listing and reading resources	kubectl get kubectl describe kubectl logs	pods -A pod/abc pod/abc
creating resources	kubectl create kubectl apply	namespace abc -f my-manifest.yaml
updating resources	kubectl edit kubectl replace kubectl apply kubectl scale	svc/docker-registry -f nginx-deployment.yaml -f my-manifest.yaml --replicas=3 rs/foo
deleting resources	kubectl delete	pod unwanted --now
running pods	kubectl run kubectl attach kubectl exec	nginx --image=nginx my-pod -it my-pod -- ls /
connecting to resources	kubectl port-forward kubectl cp	svc/my-service 5000 my-pod:/foo bar
check resource usage (requires metrics server)	kubectl top	nodes

Explanation

No doubt there is a lot to digest. I will explain some of the concepts involved here, and leave the rest for you to research about.

Namespace

A Kubernetes namespace provides a scope for names of resources within a Kubernetes cluster. It allows the isolation of groups of resources and management of cluster components like Pods, Services, and Deployments.

Namespaces are intended for use in environments with many users spread across multiple teams or projects. They prevent name collisions that can occur when different teams are using resources with the same name. Resources within one namespace are not visible to other namespaces unless steps are explicitly taken to allow that.

To get resources from all namespaces

$ kubectl get pods -A

Pod

A Kubernetes pod is the basic building block of applications in Kubernetes. It is a group of one or more containers (such as Docker containers), with shared storage/network resources, and a specification for how to run the containers.

Pods allow containers to run together on the same machine and share resources. This means containers within a pod can communicate easily using localhost.

Service

A Kubernetes service is an abstraction layer that defines a logical set of pods and enables external traffic exposure and load balancing to those pods, regardless of where on the cluster they are scheduled.

Services define rules about how groups of pods can be accessed (e.g. randomly, via round-robin load balancing). This provides a single IP address or DNS name endpoint for reaching multiple pods.

There are different types of services that support different methods of service discovery and load balancing: ClusterIP, NodePort, LoadBalancer.

Deployment

A Kubernetes deployment is an object that manages pods and replica sets in Kubernetes. It provides declarative updates for pods and replica sets.

Deployments are responsible for creating and updating pods based on the provided configuration. It helps in rolling out new updates or roll back to previous deployments.

It provides self-healing capabilities by automatically replacing pods that fail or are terminated. This ensures the application is always available.

Aliases

You can set aliases in ~/.zshrc or ~/.bashrc to make typing of kubectl commands a little bit faster. This is convenient for your work and passing CKA certification exam.

For examples

alias kc='kubectl'
alias kns='kubectl config set-context --current --namespace'
alias kbb='kubectl run busybox --image=busybox --restart=Never -it --'

So that we can do

$ kc get pods      # get all pods
$ kns my-namespace # set default namespace
$ kbb ls -la       # run ad hoc linux commands

Infrastructure as Code

Do we need to memorize all kubectl commands out there? Not necessarily. In fact, you are almost never going to use the resource creation/update/delete commands.

The best practice is to manage and provision resources through code, not imperative commands. This is known as infrastructure as code (IaC).

IaC allows infrastructure to be treated as software which can be developed, tested, and reviewed like any other code. Creating Kubernetes resources with kubectl create or kubectl apply goes against the spirit of IaC.

So you can think of certain kubectl commands as relics of the past, although CKA certification exams test for them. Still it's best to gain some exposure, who knows you might need them for troubleshooting one day.

To be continued in part 4, about Kubernetes services.

Learn Kubernetes with k3s - part2 - Networking

Jeff Wong — Wed, 13 Mar 2024 12:39:15 GMT

This post is the continuation of part 1.

Let us look closely at the commands that we have used to install k3s.

master1.sudo( "curl -sfL https://get.k3s.io | sh -s - server "
              "--write-kubeconfig-mode 644 "
             f"--node-external-ip={server_ip} "
              "--flannel-external-ip "
              "--flannel-backend=wireguard-native "
             f"--tls-san {server_external_hostname}")

The first flag, --write-kubeconfig-mode 644 tells the process to produce a kubeconfig file in 644 mode, so that we can read it, copy it to our workstation to make the first connection to the cluster.

A kubeconfig file is used to configure access to Kubernetes clusters. It contains information such as authentication details, clusters, users and namespaces.

Next,

f"--node-external-ip={server_ip} "
 "--flannel-external-ip "
 "--flannel-backend=wireguard-native "

These flags set the node external IP to the server's static ip (we've set that to 192.168.1.53) and configure the networking plugin (flannel) to use that for routing.

There is a corresponding flag in the worker node command:

f"--node-external-ip={cxn.host}"

To test whether the cluster networking is working fine. Run the following commands:

$ curl -k https://192.168.1.53
404 page not found

$ curl -k https://192.168.1.54
404 page not found

You should see a 404 response for each.

This is because there is a traefik reverse-proxy service running at port 80 and 443 on every node, installed during cluster installation. Yet we have not defined any routing rules, hence the 404 errors.

Run this command to see all services running on the cluster:

$ kubectl get service -A
NAMESPACE     NAME             TYPE           CLUSTER-IP    EXTERNAL-IP                              PORT(S)                      AGE
default       kubernetes       ClusterIP      10.43.0.1                                        443/TCP                      71m
kube-system   kube-dns         ClusterIP      10.43.0.10                                       53/UDP,53/TCP,9153/TCP       71m
kube-system   metrics-server   ClusterIP      10.43.22.2                                       443/TCP                      71m
kube-system   traefik          LoadBalancer   10.43.158.9   192.168.1.53,192.168.1.54                80:30824/TCP,443:30872/TCP   71m

Traefik is in the last row of the table.

If you did not get a reply for any curl command, it means the networking has failed. This was the case when I used the default installation command from the official quick-start guide.

I had to reinstall k3s with the above-mentioned flags to make networking works.

Note the special flag--flannel-backend=wireguard-native

Use WireGuard to encapsulate and encrypt network traffic. May require additional kernel modules.

Sometimes the default installation command is not able to figure out which networking interface should be used for traffic routing. We remind it using the explicit flags.

For your curiosity, here are all the different options for flannel.

To fiddle with these, we need a way to uninstall k3s from all nodes and reinstall k3s. Add the following fabric tasks:

@task
def uninstall_agents(c):
    ThreadingGroup(*all_workers).sudo("/usr/local/bin/k3s-agent-uninstall.sh")


@task
def uninstall_server(c):
    Connection("k3s-master1").sudo("/usr/local/bin/k3s-uninstall.sh")

$ fab uninstall-agents
$ fab uninstall-server

You can then tap on these to test which networking options work for your setup.

Read this in-depth article "Deciphering the Kubernetes Networking Maze: Navigating Load-Balance, BGP, IPVS and Beyond" to understand more about Kubernetes networking.

The next flag, --tls-san {server_external_hostname} tells the process to generate a TLS certificate with server_external_hostname (example.com) as SAN, so that we can access the cluster through the external hostname for convenience.

Next,

master1.get("/etc/rancher/k3s/k3s.yaml", local="kubeconfig")

subprocess.run([
    "sed", "-i", f"s/127.0.0.1/{server_external_hostname}/g", "kubeconfig"
])
subprocess.run(["chmod", "600", "kubeconfig"])

These lines copy kubeconfig to the workstation with the external hostname and file mode.

And finally the installation command at the worker node,

node_token = master1.sudo(
    "cat /var/lib/rancher/k3s/server/node-token", 
    hide=True
).stdout.strip()
for cxn in ThreadingGroup(*all_workers):        
    cxn.sudo( "curl -sfL https://get.k3s.io | "
             f"K3S_URL=https://{server_ip}:6443 "
             f"K3S_TOKEN={node_token} "
......

We recover the node token from the server node, and use it to configure the worker node, as instructed in the official quick-start guide.

To be continued in part 3, about kubectl.

Learn Kubernetes with k3s - part1 - Installation

Jeff Wong — Fri, 08 Mar 2024 13:56:55 GMT

In 2024, as far as learning Kubernetes is concerned we have a myriad of tools at our disposal:

mini distributions such as minikube, kind, k0s, k3s, etc.
Q&A AI: ChatGPT, Gemini, Mistral, etc.

It makes me wonder if one can learn Kubernetes administration and application development in under 1 week. I did it in 3 months many years ago.

There is a systematic approach I have adopted successfully for many learning tasks:

Hack a mini project and ask AI models the right questions.

I believe when doing it right, it can help you quickly grasp Kubernetes' core concepts, setting you on a path to becoming a proficient user.

Let's get started with hacking with k3s.

K3s

K3s is a lightweight Kubernetes distribution that aims to simplify deploying and managing Kubernetes clusters. It is developed by Rancher Labs and is certified by the Cloud Native Computing Foundation (CNCF) as being fully compatible with Kubernetes.

K3s packages Kubernetes into a single binary file that is only around 45MB in size, much smaller than a full Kubernetes installation. It is designed for running Kubernetes on low-resource environments like edge/IoT devices.

Even though it is lightweight, k3s maintains full Kubernetes functionality and is interoperable with tools that work with regular Kubernetes, making it an excellent choice for hands-on learning and experimentation.

Getting the Servers

There are many options, I leave it to you to decide.

Get instances from a cloud provider
Set up Rasberry PIs at home
Install guest OS' with Vagrant / Virtualbox
Or (if you happen to use a Linux desktop) use your OS directly

I chose 3 and installed two guests on two home PCs separately, one as the server, another as the agent (worker node). Both host machines are on the same network.

The Vagrantfile I used looks like this:

Vagrant.configure("2") do |config|
  config.vm.hostname = "k3s-master1"
  config.vm.box = "generic/ubuntu2204"
  config.vm.network "public_network", ip: "192.168.1.53", bridge: "en0: Ethernet"

  config.vm.provider "virtualbox" do |v|
    v.memory = 8000
    v.cpus = 2
  end
end

I omit the Vagrantfile file for the worker node as it is only slightly different. Note that I have configured static IPs for both nodes.

To spin the guest nodes up, run this command on both hosts

$ vagrant up

Install K3s

At this point, I have two vagrant nodes up. To install k3s, I use an automation tool Fabric to write the necessary installation tasks in a fabfile.py.

First some settings:

SSH config:

$ cat ~/.ssh/config 
Host k3s-master1
	User vagrant
	HostName 192.168.1.53

Host k3s-worker1
	User vagrant
	HostName 192.168.1.54

fabfile.py opening:

import subprocess

from fabric import ThreadingGroup, Connection, task
from fabric.exceptions import GroupException

ips = {
    "k3s-master1": "192.168.1.53",
    "k3s-worker1": "192.168.1.54",
}
server_ip = ips["k3s-master1"]
server_external_hostname = "example.com"
all_workers = [host for host in ips if "worker" in host]

example.com is a domain I registered for the k3s API server so that I can use it to access my cluster over the internet. I have set a DNS A record pointing it to the public IP address of my router.

Then the task to upgrade the OS on both nodes (optional)

@task
def upgrade_os(c):
    all_hosts = ThreadingGroup(*ips.keys())
    all_hosts.sudo("apt update -y")
    all_hosts.sudo("apt upgrade -y")

    try:
        all_hosts.sudo("reboot")
    except GroupException:
        pass

Run this command to execute the task

$ fab upgrade-os

Then the tasks to install k3s, first on the server node, then on the worker node

@task
def install_server(c):
    master1 = Connection("k3s-master1")

    master1.sudo( "curl -sfL https://get.k3s.io | sh -s - server "
                  "--write-kubeconfig-mode 644 "
                 f"--node-external-ip={server_ip} "
                  "--flannel-external-ip "
                  "--flannel-backend=wireguard-native "
                 f"--tls-san {server_external_hostname}")

    master1.get("/etc/rancher/k3s/k3s.yaml", local="kubeconfig")

    subprocess.run([
        "sed", "-i", f"s/127.0.0.1/{server_external_hostname}/g", "kubeconfig"
    ])
    subprocess.run(["chmod", "600", "kubeconfig"])


@task
def install_agents(c):
    master1 = Connection("k3s-master1")

    node_token = master1.sudo(
        "cat /var/lib/rancher/k3s/server/node-token", 
        hide=True
    ).stdout.strip()

    for cxn in ThreadingGroup(*all_workers):        
        cxn.sudo( "curl -sfL https://get.k3s.io | "
                 f"K3S_URL=https://{server_ip}:6443 "
                 f"K3S_TOKEN={node_token} "
                  "sh -s - agent "
                 f"--node-external-ip={cxn.host}")

Run these commands to execute the tasks

$ fab install-server
$ fab install-agents

Test the k3s Installation

$ mv kubeconfig ~/.kube/config

Note: the mv command will overwrite your existing config file.

Install your kubectl command, enable port forwarding from your home router to the server node at port 6443, then run

$ kubectl get nodes
NAME          STATUS   ROLES                  AGE   VERSION
k3s-master1   Ready    control-plane,master   67m   v1.28.7+k3s1
k3s-worker1   Ready                     63m   v1.28.7+k3s1

You should see both nodes in Ready state.

Check all pods are running fine

$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS       AGE
kube-system   helm-install-traefik-8frcc                0/1     Completed   2              12h
kube-system   helm-install-traefik-crd-gbnjw            0/1     Completed   0              12h
kube-system   svclb-traefik-5d5abbc4-95jnq              2/2     Running     2 (10h ago)    12h
kube-system   traefik-f4564c4f4-hxws8                   1/1     Running     1 (10h ago)    12h
kube-system   coredns-6799fbcd5-zvmtd                   1/1     Running     1 (10h ago)    12h
kube-system   local-path-provisioner-6c86858495-gbnjx   1/1     Running     2 (10h ago)    12h
kube-system   metrics-server-78bcc4784b-gv7kp           1/1     Running     2 (10h ago)    12h
kube-system   svclb-traefik-5d5abbc4-sfww6              2/2     Running     6 (4m8s ago)   11h
kube-system   svclb-traefik-5d5abbc4-7bqn9              2/2     Running     0              7h59m

Try creating a pod

$ kubectl run busybox --image=busybox --restart=Never -it -- uname -r
5.15.0-100-generic

We now have a working k3s cluster!

To be continued in part 2, networking.

GitHub Actions Amazon ECR Integration - part2

Jeff Wong — Sun, 18 Feb 2024 12:45:00 GMT

This post is the continuation of part 1.

In this post, I will show how to use a GitHub Actions workflow to automatically build and publish docker images to an ECR repository on a push or pull request merge.

We need a dummy project for illustration. I have bootstrapped my project using poetry, my favourite python build tool.

The project structure looks like

$ tree -I ".git|.terrafom|terraform|tests|.pytest_cache" -a
.
├── Dockerfile
├── .github
│   └── workflows
│       └── ci.yml
├── .gitignore
├── poetry.lock
├── pyproject.toml
├── .python-version
└── src
    └── handler.py

The project definition metadata, pyproject.toml:

[tool.poetry]
name = "learn-gha-ecr"
version = "0.1.0"
description = ""

[tool.poetry.dependencies]
python = "^3.11"
boto3 = "^1.34.11"
jsonpickle = "^3.0.2"

[tool.poetry.group.dev.dependencies]
pytest = "^8.0.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

The lambda handler, handler.py:

import os
import logging
import jsonpickle
import boto3

logger = logging.getLogger()
logger.setLevel(logging.INFO)

client = boto3.client('lambda')

def lambda_handler(event, context):
    logger.info('## ENVIRONMENT VARIABLES\r' + jsonpickle.encode(dict(**os.environ)))
    logger.info('## EVENT\r' + jsonpickle.encode(event))
    logger.info('## CONTEXT\r' + jsonpickle.encode(context))

    response = client.get_account_settings()
    logger.info(response['AccountUsage'])

At the moment the project isn't doing any real work, except logging some information when the lambda function executes. I proceed to write the GitHub Actions workflow file for the project, .github/workflows/ci.yml

On any push to main branch, checking out the code,

name: Lambda Application CI workflow

on:
  push:
    branches:
      - main

jobs:
  build:
    runs-on: ubuntu-latest

    permissions:
      id-token: write
      contents: read

    env:
      AWS_REGION: 
      AWS_ACCOUNT_ID: 
      BUILD_ROLE_NAME: GHA-Build-Role

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

Take note of the environment variable BUILD_ROLE_NAME, this is the name of the IAM role I have created in part1.

Installing Python and Poetry build tool,

- name: Read Python version
  id: read-python-version
  run: printf "version=$(cat .python-version)\n" >> "$GITHUB_OUTPUT"

- name: Install Python
  id: install-python
  uses: actions/setup-python@v5
  with:
    python-version: "${{ steps.read-python-version.outputs.version }}"

- name: Install Poetry
  uses: snok/install-poetry@v1
  with:
    version: 1.7.1
    virtualenvs-in-project: true

Installing the project dependencies,

- name: Restore Python virtual environment from cache
  uses: actions/cache@v4
  with:
    path: .venv
    key: venv-${{ runner.os }}-${{ steps.install-python.outputs.python-version }}-${{ hashFiles('poetry.lock') }}

- name: Install project dependencies
  run: |
    poetry install --no-interaction --no-root
    poetry self add poetry-plugin-export
    poetry export --without-hashes --without dev -f requirements.txt > requirements.txt

Here we use the caching function of GitHub Actions to speed things up. The cache is invalidated only if poetry.lock's content or python version has changed.

We also exports the project dependencies to a requirements.txt file for docker to use later.

Running unit tests,

- name: Run tests
  run: |
    poetry run pytest

The first run will break, as there isn't any tests written. You may want to write some dummy tests to rectify the build failure after that happens.

Building and publishing docker images

- name: Get temporary credentials from AWS STS
  uses: aws-actions/configure-aws-credentials@v4
  with:
    aws-region: ${{ env.AWS_REGION }}
    role-to-assume: "arn:aws:iam::${{ env.AWS_ACCOUNT_ID }}:role/${{ env.BUILD_ROLE_NAME }}"
    role-session-name: run-${{ github.run_id }}

- name: Login to Amazon ECR
  id: login-ecr
  uses: aws-actions/amazon-ecr-login@v2

- name: Build and publish docker image
  id: build-publish-docker
  env:
    DOCKER_REGISTRY_ENDPOINT: ${{ steps.login-ecr.outputs.registry }}
  run: |
    version=`poetry version`
    short_version=`poetry version --short`
    IMAGE_NAME=${version% $short_version}
    IMAGE_TAG=$short_version-build$GITHUB_RUN_NUMBER
    docker build --tag $DOCKER_REGISTRY_ENDPOINT/$IMAGE_NAME:$IMAGE_TAG .
    docker push --all-tags $DOCKER_REGISTRY_ENDPOINT/$IMAGE_NAME
    printf "IMAGE_NAME=$IMAGE_NAME\nIMAGE_TAG=$IMAGE_TAG\n" >> "$GITHUB_OUTPUT"

It is important here that you have created the necessary AWS resources with Terraform, otherwise these steps will fail.

Post-build steps to tag the GitHub repo on a successful build.

    outputs:
      IMAGE_NAME: ${{ steps.build-publish-docker.outputs.IMAGE_NAME }}
      IMAGE_TAG: ${{ steps.build-publish-docker.outputs.IMAGE_TAG }}

  post-build:
    needs: build
    runs-on: ubuntu-latest

    permissions:
      contents: write

    env:
      IMAGE_NAME: "${{ needs.build.outputs.IMAGE_NAME }}"
      IMAGE_TAG: "${{ needs.build.outputs.IMAGE_TAG }}"

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Tag Github repo
        run: |
          GIT_TAG="v${{ env.IMAGE_TAG }}"
          git config --global user.email "github-action@users.noreply.github.com"
          git config --global user.name "Github Action"
          git tag -a "$GIT_TAG" -m "tagged by GHA $GIT_TAG"
          git push origin "$GIT_TAG"

The post-build job is separate because it requires elevated permission.

Bonus: trigger a deployment using a workflow in another GitHub repo. You need personal access token for this step to work.

- name: Invoke CD workflow
  uses: actions/github-script@v7
  with:
    github-token: ${{ secrets.PAT_WORKFLOW_DISPATCH }}
    script: |
      github.rest.actions.createWorkflowDispatch({
        owner: context.repo.owner,
        repo: 'infra_setup',
        ref: 'main',
        workflow_id: 'cd.yml',
        inputs: {
          env: 'dev',
          images: '{"${{ env.IMAGE_NAME }}": "${{ env.IMAGE_TAG }}"}'
        }
      })

Once this is done, commit and push the changes to main branch to trigger the workflow. Enjoy coding.

GitHub Actions Amazon ECR Integration - part1

Jeff Wong — Sat, 17 Feb 2024 12:47:08 GMT

How do you publish docker images to an Amazon ECR repository from a GitHub Actions workflow? (without using any long-lived tokens)

GitHub Actions (GHA) is commonly used as a continuous integration tool, and docker image is a common format for distributing and deploying container-based software. AWS is a popular IaaS platform that offers a pay-on-use container registry service.

If you use all three, it's important to know how to set up your AWS environment in a secured way to allow seamless publishing of docker images. You need

an Amazon ECR repository
to set up GitHub OpenID Connect provider in IAM
an assumable IAM role with
a policy that allows login from GHA and docker image push to the ECR repository.

I will be using Terraform, my main IaC tool and Terraform AWS modules to make my code succinct and maintainable.

The ECR repo

Assuming the docker images will be used in AWS Lambda functions, I have defined a local variable ecr-repo-names to list the repo names. I have also defined repository_lambda_read_access_arns which lists the lambdas that need image pull permission (from the created ECR repos).

locals {
  ecr-repo-names = [
    "example",
  ]
}

module "ecr-repos" {
  source  = "terraform-aws-modules/ecr/aws"
  version = "~> 1.6.0"

  for_each = toset(local.ecr-repo-names)

  repository_name = each.value
  repository_lambda_read_access_arns = [
    "arn:aws:lambda:${var.region}:${data.aws_caller_identity.current.account_id}:function:*"
  ]

  repository_lifecycle_policy = jsonencode({
    rules = [{
      rulePriority = 1
      description = "Expire images older than 1 year"
      action = {
        type = "expire"
      }
      selection = {
        tagStatus = "any"
        countType = "sinceImagePushed"
        countUnit = "days"
        countNumber = 365
      }
    }]
  })
}

The repository lifecycle policy is to make sure the repository space usage is within control.

GitHub OIDC provider

This will allow IAM to trust GitHub for authentication, so that long-lived tokens are not required.

locals {
  github-repos = [
    "kakarukeys/example",
  ]
}

module "iam_github_oidc_provider" {
  source    = "terraform-aws-modules/iam/aws//modules/iam-github-oidc-provider"
  version = "~> 5.33.1"
}

GitHub will send AWS certain OIDC subject claims during authentication, and it is important you define proper IAM policies to allow the right github repositories to publish docker images based on the claims!

📎 An OIDC provider authenticates the user's identity and allows him to securely access other applications (when they are pre-configured to trust the provider).
The OIDC provider does three main things:

It performs the actual authentication of the user (e.g. by verifying their username and password)

It issues JSON Web Tokens (JWTs) containing information about the user's identity to other applications so they can recognize who the user is without storing passwords.

It protects the user's sensitive information like passwords and only shares necessary details about the user (like name, email) with other applications according to the user's consent.

Assumable IAM role and policies

The IAM policy is based on GitHub's recommendations, with one exception: I have further allow GHA to perform lambda:GetLayerVersion action so that it can download and install Lambda extensions.

data "aws_iam_policy_document" "ecr-publish-policy-document" {
  statement {
    effect = "Allow"
  
    actions = [
      "ecr:GetAuthorizationToken",
      "lambda:GetLayerVersion",
    ]

    resources = ["*"]
  }

  statement {
    effect = "Allow"

    actions = [
      "ecr:BatchCheckLayerAvailability",
      "ecr:BatchGetImage",
      "ecr:CompleteLayerUpload",
      "ecr:GetDownloadUrlForLayer",
      "ecr:InitiateLayerUpload",
      "ecr:PutImage",
      "ecr:UploadLayerPart",
    ]

    resources = [for r in module.ecr-repos : r.repository_arn]
  }
}

resource "aws_iam_policy" "ecr-publish-policy" {
  name        = "ecr-publish-policy"
  description = "Allow principal to publish images into selected ECR repos"
  policy = data.aws_iam_policy_document.ecr-publish-policy-document.json
}

module "gha-build-role" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-github-oidc-role"
  version = "~> 5.33.1"

  # specify for Github Enterprise
  # audience     = "https://mygithub.com/"
  # provider_url = "mygithub.com/_services/token"

  subjects = [
    for repo in local.github-repos : 
    "repo:${repo}:ref:refs/heads/main"
  ]

  name = "GHA-Build-Role"
  description = "IAM role for Github Action to assume, for performing tasks related to project build"
  max_session_duration = 3600   # secs

  policies = {
    EcrPublish = aws_iam_policy.ecr-publish-policy.arn
  }
}

The assumable role is defined with a special AWS Terraform module called iam-github-oidc-role (many codes were abstracted away!). What it does is to allow GHA to make API calls to AWS STS service in order to assume the role, which is allowed by us for publishing images.

Pay attention to subjects which defines the repos and branches for which the role can be used for docker image publishing.

Testing

That's all. Use these scripts in a Terraform project, create a plan and apply it to create the necessary resources. Output the role ARN with

output "gha_build_role_arn" {
  description = "arn of GHA Build Role"
  value       = module.gha-build-role.arn
}

You can then build a docker image with a Dockerfile you have and test the integration with

$ docker build -t .dkr.ecr..amazonaws.com/example:0.1.0 .

$ aws ecr get-login-password --region  | docker login \
    --username AWS \
    --password-stdin .dkr.ecr..amazonaws.com

$ docker push .dkr.ecr..amazonaws.com/example:0.1.0

Note that the above commands actually uses your AWS profile user's permissions (likely a root user) to interact with ECR. The next step would be to use the role ARN in an actual GHA workflow.

To be continued in part 2.

My experiment with AI in DevOps

Jeff Wong — Wed, 13 Dec 2023 03:31:01 GMT

You have heard how the latest advances in AI can help with software development. A natural question would be if AI could be integrated with DevOps to revolutionize the way infrastructure is managed and deployed.

So I did a simple experiment.

Can AI help automate some of my daily tasks?

I use Terraform as my major IaC (Infrastructure as Code) tool. At the very least, if I feed the model with detailed instructions, can it write my IaC code flawlessly?

The LLM models I used

I did some research, and found 3 cutting-edge language models suitable for Terraform coding:

I was on a budget, so I picked the free tier for each. Kagi FastGPT is a nimble research tool that gives superb results, as it performs RAG with Kagi's search index.

Phind does RAG too, and its underlying model was fine-tuned with codes. So it is being marketed as a coding assistant. OpenAI ChatGPT: you know what it is and its capability.

The task to do

I used the 3 models to research and later write Terraform and Python scripts, to set up an AWS Lambda function that opens and parses any new csv file in an existing S3 bucket, and appends the rows to a Redshift table.

The Lambda function is to be triggered by an EventBridge event which is emitted when a new file is added to the S3 bucket.

This is a very simplistic but realistic task, and to lessen the scope, I have not instructed the models to generate the deployment script. So I took care of the deployment myself, by manually packaging the python code and uploading it to another S3 bucket to be picked by the Lambda function.

But I expected the models to generate proper IAM roles and policies, so that the Lambda function had proper access to the various AWS services for its task. A DevOps engineer would find this task the most time-consuming.

The research phase

I tried to imagine myself as a junior / intermediate DevOps engineer (it was harder than I thought) using AI. So I asked the basic questions, and then the detailed questions. For examples:

what are the common ways where AWS lambda functions are used in data engineering?
can AWS EventBridge do what AWS Cloudwatch Events do?
how to deploy a lambda function that requires external python libraries as project dependencies?
what are the managed policies available for lambda function to write to Amazon Cloudwatch Logs?

To my pleasant surprise, the models got most of the answers and facts correct. If one missed something, the other models usually filled in.

The RAG models: Kagi FastGPT and Phind did slightly better than OpenAI ChatGPT 3.5. I expected this, because it is common sense that the answers will be better if you feed the models with the relevant tech documentations.

One downside was that the models did not have a strong opinion on which approach was optimal or the best practice. They gave commandline or console instructions in some answers, but to be fair, I was conducting a research and so did not prompt the models for instructions conforming with IaC.

The coding phase

After studying the answers from the research phase, I formulated an approach and created the necessary prompts. I started asking the models to generate codes.

write a Terraform script to achieve the following:

set up an EventBridge event and rule to track changes in an existing S3 bucket: adding of new csv file

when a new csv file is added in the bucket, trigger a lambda function to run

the lambda function should parse the file and append the rows in the csv file to a Redshift table, using Python runtime.

the lambda function will be uploaded to another existing S3 bucket to be picked up by AWS

create a proper role with proper managed policies so that the lambda function has access to read from S3, and write to Redshift.

write some python code to achieve the following:

make a connection to an Amazon Redshift Serverless database. assuming the script will be run with appropriate IAM role to make the connection.

given a csv DictReader object, read all dictionary objects from it.

write all dictionary objects into a database table, appending to it.

I have asked the models to generate a single script instead of a whole project (did not think that was possible). I already had a project boilerplate that I could use.

The challenges of AI

While AI demonstrated speed and versatility in providing code snippets, it struggled to deliver the most optimal solutions. I found that errors and inaccuracies often permeated the generated code.

In some cases, some lines of code were complete fabrications or "hallucinations" — instances where the AI produces made-up results but presents them as if they were true. You need considerable DevOps knowledge in order to spot such fabrications.

A model gave an incorrect aws_lambda_permission resource block in its generated code:

resource "aws_lambda_permission" "s3_event_permission" {
  statement_id  = "AllowExecutionFromS3Event"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.csv_parser_lambda.function_name
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.existing_bucket.arn
}

The principal should be events.amazonaws.com.

Another glaring weakness of AI is its inability to leverage reusable third-party libraries and modules. It often causes the generated code to be very verbose.

In my experiment, the generated terraform script consists of many individual resource blocks. AI has failed to make use of any official AWS terraform modules.

Additional prompts and follow-up questioning were not able to force the models to use AWS terraform modules. It seemed the model had not been trained with data that included those modules.

In the generated python script, AI has opted to use the INSERT SQL statement instead of the more efficient COPY.

with open(download_path, 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    next(csv_reader)  # Skip header

    for row in csv_reader:
        # Assuming your Redshift table has columns col1, col2, col3
        cursor.execute(
            "INSERT INTO your_redshift_table (col1, col2, col3) VALUES (%s, %s, %s)", 
            (row[0], row[1], row[2])
        )

redshift_conn.commit()
cursor.close()
redshift_conn.close()

DevOps principles emphasize modularity and efficiency. Unfortunately AI failed to grasp adequately these aspects.

Conclusion

As the experiment unfolded, it became evident that while AI holds promise in augmenting DevOps workflows, it is not yet mature enough to replace human expertise entirely. A DevOps expert brings not just technical proficiency but also contextual understanding and oversight, qualities that AI struggles to replicate.

Despite its limitations, AI proved its ability to generate ideas swiftly and in abundance. This presents an opportunity for human experts to leverage AI as a tool for ideation and exploration, allowing them to sift through generated concepts and decide on their feasibility.