AWS CodeBuild: The missing link for deployment pipelines in AWS

This is a follow-up of my AWSAdvent article Serverless everything: One-button serverless deployment pipeline for a serverless app , which extends the example deployment pipeline with AWS CodeBuild.

Deployment pipelines are very common today, as they are usually part of a continuous delivery/deployment workflow. While it’s possible to use e.g. projects like Jenkins or concourse for those pipelines, I prefer using managed services in order to minimize operations and maintenance so I can concentrate on generating business value. Luckily, AWS has a service called CodePipeline which makes it easy to create deployment pipelines with several stages and actions such as downloading the source code from GitHub, and executing build steps.

For the build steps, there are several options like invoking an external Jenkins Job, or SoranoCi etcpp. But when you want to stay in AWS land, your options were quite limited until recently. The only pure AWS option for CodePipeline build steps (without adding operational overhead, e.g. managing servers or containers) was invoking Lambda functions, which has several drawbacks that I all experienced:

Using Lambda as Build Steps

5 minutes maximum execution time

Lambda functions have a limit of 5 minutes which means that the process gets killed if it exceeds the timeout. Longer tests or builds might get aborted and thus result in a failing deployment pipeline. A possible workaround would be to split the steps into smaller units, but that is not always possible.

Build tool usage

The NodeJS 4.3 runtime in Lambda has the npm command pre-installed, but it needs several hacks to be working. For example, the Lambda runtime is a read-only file system except for tmp, so in order to use e.g. NPM, you need to fake the HOME to /tmp. Another example is that you need to find out where the preinstalled NPM version lives (checkout my older article on NPM in Lambda).

Artifact handling

CodePipeline works with so called artifacts: Build steps can have several input and output artifacts each. These are stored in S3 and thus have to be either downloaded (input artifact) or uploaded (output artifact) by a build step. In a Lambda build step, this has to be done manually, means you have to use the S3 SDK of the runtime for artifact handling.

NodeJS for synchronous code

When you want to use a preinstalled NPM in Lambda, you need to use the NodeJS 4.3 runtime. At least I did not manage to get the preinstalled NPM version running which is part of the Lambda Python runtime. So I was stuck with programming in NodeJS. And programming synchronous code in NodeJS did not feel like fun for me: I had to learn how promises work for code which would be a few lines of Python or Bash. When I look back, and there would be still no CodeBuild service, I would rather invoke a Bash or Python script from within the NodeJS runtime in order to avoid writing async code for synchronous program sequences.

Lambda function deployment

The code for Lambda functions is usually packed as ZIP file and stored in an S3 bucket. The location of the ZIP file is then referenced in the Lambda function. This is how it looks in CloudFormation, the Infrastructure-as-Code service from AWS:

1
2
3
4
5
6
LambdaFunction:
Type: AWS::Lambda::Function
Properties:
Code:
S3Bucket: !Ref DeploymentLambdaFunctionsBucket
S3Key: !Ref DeploymentLambdaFunctionsKey

That means there has to be another build and deployment procedure which packs and uploads the Lambda function code to S3 itself. Very much complexity for a build script which is usually a few lines of shell code, if you ask me.

By the way, actually there is a workaround: In CloudFormation, it’s possible to specify the code of the Lambda function inline in the template, like this:

1
2
3
4
5
6
7
8
LambdaFunctctionWithInlineCode:
Type: AWS::Lambda::Function
Properties:
Code:
ZipFile: |
exports.handler = function(event, context) {
...
}

While this has the advantage that the pipeline and the build step code are now in one place (the CloudFormation template), this comes at the cost of losing e.g. IDE functions for the function code like syntax checking and highlighting. Another point: the inline code is limited to 4096 characters length, a limit which can be reached rather fast. Also the CloudFormation templates tend to become very long and confusing. In the end using inline code just felt awkward for me …

No AWS CLI installed in Lambda

Last but not least, there is no AWS CLI installed in the Lambda runtime, which makes things to be done in build steps, like uploading directories to S3, really hard, because this has to be done in the programming runtime. What would be a one-liner in AWS CLI, can be much more overhead and lines of code in e.g. NodeJS or Python.

At the recent re:invent conference, AWS announced CodeBuild which is a build service, very much like a managed version of Jenkins, but fully integrated into the AWS ecosystem. Here are a few highlights:

  • Fully integrated into AWS CodePipeline: CodePipeline is the “Deployment Pipeline” service from AWS and supports CodeBuild as an action in a deployment pipeline. It also means that CodePipeline can checkout code from e.g. a GitHub repository first, save it as output artifact and pass it to CodeBuild, so that the entire artifact handling is managed, no (un)zipping and S3 juggling necessary.
  • Managed build system based on Docker Containers: First you don’t need to take care of any Docker management. Second you can either use AWS provided images, which provide a range of operating systems / environments, e.g. Amazon Linux and Ubuntu for several pre-built environments, e.g. NodeJS or Python or Go (http://docs.aws.amazon.com/codebuild/latest/userguide/build-env-ref.html). Or you can bring your own container (I did not try that out yet).
  • Fully supported by CloudFormation, the Infrastructure-as-Code service from AWS: You can codify CodeBuild projects so that they are fully automated, and reproducible without any manual and error-prone installation steps. Together with CodePipeline they form a powerful unit to express entire code pipelines as code which further reduces total cost of ownership.
  • YAML-DSL, which describes the build steps (as a list of shell commands), as well as the output artifacts of the build.

Another great point is that the provided images are very similar to the Lambda runtimes (based on Amazon Linux) so that they are predestinated for tasks like packing and testing Lambda function code (ZIP files).

CodeBuild in action

So, what are the particular advantages of using CodeBuild vs. Lambda in CodePipeline? Have a look at this Pull Request. It replaces the former Lambda-based approach with CodeBuild in the project I set up for my AWS Advent article: Several hundred lines of JavaScript got replaced by some lines of CodeBuild YAML. Here is how a sample build file looks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
version: 0.1
phases:
install:
commands:
- npm install -g serverless
- cd backend && npm install
build:
commands:
- "cd backend && serverless deploy"
- "cd backend && aws cloudformation describe-stacks --stack-name $(serverless info | grep service: | cut -d' ' -f2)-$(serverless info | grep stage: | cut -d' ' -f2) --query 'Stacks[0].Outputs[?OutputKey==`ServiceEndpoint`].OutputValue' --output text > ../service_endpoint.txt"
artifacts:
files:
- frontend/**/*
- service_endpoint.txt

This example shows a buildspec.yml with two main sections: phases and artifacts:

  • phases apparently lists the phases of the build. These predefined names actually have no special meaning and you can put as many and arbitrary commands into it. The example shows several shell commands executed, in particular first - in the install stage - the installation of the serverless NPM package, followed by the build stage which contains the execution of the Serverless framework (serverless deploy). Lastly, it runs a more complex command to save the output of a CloudFormation stack into a file called service_endpoint.txt: That file is later picked up as an output artifact.
  • artifacts lists the directories and files which CodePipeline will generate as an output artifact. Used in combination with CodePipeline, it provides a seamless integration into the pipeline and you can use the artifact as input for another pipeline stage or action. In this example the frontend folder and the mentioned service_endpoint.txt file are nominated as output artifacts.

The artifacts section can also be omitted, if there are no artifacts at all.

Now that we learned the basics of the buildspec.yml file, lets see how this integrates with CloudFormation:

CodeBuild and CloudFormation

CloudFormation provides a type AWS::CodeBuild::Project to describe CodeBuild projects - an example follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
DeployBackendBuild:
Type: AWS::CodeBuild::Project
Properties:
Artifacts:
Type: CODEPIPELINE
Environment:
ComputeType: BUILD_GENERAL1_SMALL
Image: aws/codebuild/eb-nodejs-4.4.6-amazonlinux-64:2.1.3
Type: LINUX_CONTAINER
Name: !Sub ${AWS::StackName}DeployBackendBuild
ServiceRole: !Ref DeployBackendBuildRole
Source:
Type: CODEPIPELINE
BuildSpec: |
version: 0.1
...

This example creates a CodeBuild project which integrates into a CodePipeline (Type: CODEPIPELINE), and which uses a AWS provided image for nodejs runtimes. The advantage is that e.g. NPM is preinstalled. The Source section describes again that the source code for the build is coming from a CodePipeline. The BuildSpec specifies in inline build specification (e.g. the one shown above).

You could also specify that CodeBuild should search for a buildspec.yml in the provided source artifacts rather than providing one via the project specification.

CodeBuild and CodePipeline

Last but not least, let’s have a look at how CodePipeline and CodeBuild integrate by using an excerpt from the CloudFormation template which describes the pipeline as code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Pipeline:
Type: AWS::CodePipeline::Pipeline
Properties:
...
Stages:
- Name: Source
Actions:
- Name: Source
InputArtifacts: []
ActionTypeId:
Category: Source
Owner: ThirdParty
Version: 1
Provider: GitHub
OutputArtifacts:
- Name: SourceOutput
- Name: DeployApp
Actions:
- Name: DeployBackend
ActionTypeId:
Category: Build
Owner: AWS
Version: 1
Provider: CodeBuild
OutputArtifacts:
- Name: DeployBackendOutput
InputArtifacts:
- Name: SourceOutput
Configuration:
ProjectName: !Ref DeployBackendBuild
RunOrder: 1

This code describes a pipeline with two stages: While the first stage checks out the source code from a Git repository, the second stage is the interesting one here: It describes a stage with a CodeBuild action which takes the SourceOutput as input artifact, which ensures that the commands specified in the build spec of the referenced DeployBackendBuild CodeBuild project can operate on the source. DeployBackendBuild is the actual sample project we looked at in the previous section.

The Code

The full CloudFormation template describing the pipeline is on GitHub. You can actually test it out by yourself by following the instructions in the original article.

Summary

Deployment pipelines are as valuable as the software itself as they ensure reliable deployments, experimentation and fast time-to-market. So why shouldn’t we treat them like software, namely as code. With Codebuild, AWS completed the toolchain of building blocks which are necessary to codify and automate the setup of deployment pipelines for our software:

  • without the complexity of setting up / maintaining third party services
  • no error-prone manual steps
  • no management of own infrastructure like Docker clusters as “build farms”.
  • no bloated Lambda functions for build steps

This article showcases a CloudFormation template which should help the readers to get started with the own CloudFormation/CodePipeline/CodeBuild combo which provisions within minutes. There are no excuses anymore for manual and/or undocumented software deployments within AWS anymore ;-)

Website now powered by Hexo, AWS CloudFront and S3

Over the past days I moved my blog over to AWS CloudFront and S3, powered by the static blog generator Hexo.

Here are a few highlights:

  • The source code of the website is now open source on GitHub.
  • The infrastructure for the website is automated and codified by a CloudFormation template.
  • The website is secured via HTTPS thanks to CloudFront and the Amazon Certificate Manager
  • The build of the website if entirely codified and automated with AWS CodePipeline and CodeBuild (see the CloudFormation template for details).
  • The website and building infrastructure are serverless. No servers, VMs or containers to manage.
  • Major performance enhancements since the website is now static and powered by a CDN.

New AWS CloudFormation YAML syntax and variable substitution in action

I’ve been using CloudFormation YAML syntax for a while now with Ansible and the serverless framework which would convert the YAML to JSON before uploading the template. That already gave me the YAML advantages of e.g. code comments, not having to care about commas etc.

A few days ago, AWS announced native YAML support for CloudFormation templates, in addition to the existing JSON format.

And along with that they added new shorthand syntax for several functions.

Let’s go through a template which I created not only in order to get used to the new syntax :)

Injecting “arguments” to inline Lambda functions

One of the real powers of Lambda and CloudFormation is that you can use Lambda to add almost any missing functionality to CloudFormation (e.g. custom resources), or to create small functions, without having to maintain another deployment workflow for the function (In this example I created an Lambda function which polls some web services and writes the result into a CloudWatch custom metric.)

The interesting part is how AccessPointName is injected into the Lambda function (in this example some Python code). We are making use of the new short substitution syntax here which allows us to replace CloudFormation references with their value:

CheckProgram:
  Type: AWS::Lambda::Function
  Properties:
    Code:
      ZipFile: !Sub |
        ...
        def handler(event, context):  
          ...
          found_access_points = [access_point for access_point in api_result["allTheRouters"] if access_point["name"] == "${AccessPointName}"]

In this example the variable “AccessPointName“ gets then substituted by the value (in this particular case a stack parameter). Please also mind the “|” which is no special CloudFormation syntax but multi line YAML syntax.

Throughout the
template
you can find other usage examples of the new substitution syntax, for example a cron job with CloudWatch events which gets:

CheckProgramTrigger:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: !Sub rate(${CheckRateMinutes} minutes)
    Targets:
      - Arn:
          !GetAtt [CheckProgram, Arn]
        Id: InvokeLambda

Referencing with the !Ref and !GetAttr shortcuts

Another feature addition is a short hand syntax for Ref and GetAttr calls.

AccessPointOfflineAlertTopic:
  Type: AWS::SNS::Topic
  Properties:
    Subscription:
      - Endpoint: !Ref NotificationEmail
        Protocol: email

This example creates an SNS topic with an email subscription which is once again a CloudFormation template parameter.

Recap

With the new syntax it’s now possible to create YAML syntax, and we have nice shortcuts for commonly used functions. My personal highlight is the shorthand substitution syntax, esp. when using inline Lambda functions.

How to install and use a newer version (3.x) of NPM on AWS Lambda.

My current experiment is to build a serverless deploy pipeline (With AWS CodePipeline) which uses AWS Lambda for the build steps. One step includes to invoke NPM to build a static website out of JavaScript components (which would be deployed to an S3 bucket in a later step).

Ok, so let’s go ahead and look what is actually installed in the preconfigured Node 4.3 env on AWS Lambda. First we want to find out if NPM is actually already installed. So we just create a new Lambda function which invokes a `find’ command, here is the used source code:

exports.handler = (event, context, callback) => {  
  var child_process = require('child_process'); 
  console.log(child_process.execSync('find /usr -name npm -type f', {encoding: 'utf-8'}));   
}; 

And, voila, we found something, here is the output:

/usr/local/lib64/node-v4.3.x/lib/node_modules/npm/bin/npm

So let’s try to execute it!

console.log(child_process.execSync('/usr/local/lib64/node-v4.3.x/lib/node_modules/npm/bin/npm version', {encoding: 'utf-8'}));

And here is the output:

module.js:327
    throw err;
    ^

Error: Cannot find module '/usr/local/lib64/node-v4.3.x/lib/node_modules/npm/bin/node_modules/npm/bin/npm-cli.js'
    at Function.Module._resolveFilename (module.js:325:15)
    at Function.Module._load (module.js:276:25)
    at Function.Module.runMain (module.js:441:10)
    at startup (node.js:134:18)
    at node.js:962:3

    at checkExecSyncError (child_process.js:464:13)
    at Object.execSync (child_process.js:504:13)
    at exports.handler (/var/task/index.js:4:29)

Ok, that doesn’t look good, does it? Actually the ‘node_modules/npm/bin/node_modules/npm/bin/npm-cli.js’ part looks broken.

Ok, so my next step was to find the correct path to npm-cli.js, so I have a chance to call it without the apparently broken executable wrapper:

console.log(child_process.execSync('find /usr -type f -name npm-cli.js', {encoding: 'utf-8'}));


/usr/local/lib64/node-v4.3.x/lib/node_modules/npm/bin/npm-cli.js

So let’s try to call it directly:

console.log(child_process.execSync('node /usr/local/lib64/node-v4.3.x/lib/node_modules/npm/bin/npm-cli.js version', {encoding: 'utf-8'}));

gives us:

{ npm: '2.14.12',  ... }

Yay! We got NPM working!

But NAY, it’s an old version!

So let’s go ahead and try to install a newer version! Lambda gives us a writable /tmp, so we could use that as a target dir. NPM actually wants to do much stuff in the $HOME directory (e.g. trying to create cache dirs), but it is not writable within a Lambda env.

So my “hack” was to set the $HOME to /tmp, and then install a newer version of NPM into it (by using the --prefix option):

process.env.HOME = '/tmp';
console.log(child_process.execSync('node /usr/local/lib64/node-v4.3.x/lib/node_modules/npm/bin/npm-cli.js install npm --prefix=/tmp --progress=false', {encoding: 'utf-8'}));
console.log(child_process.execSync('node /tmp/node_modules/npm/bin/npm-cli.js version', {encoding: 'utf-8'}));

Ok, NPM got installed and is ready to use!

npm@3.10.5 ../../tmp/node_modules/npm

The last step is to symlink the npm wrapper so it can be used without hassle. And actually many build systems seem to expect a working npm executable:

fs.mkdirSync('/tmp/bin');
fs.symlinkSync('/tmp/node_modules/npm/bin/npm-cli.js', '/tmp/bin/npm');
process.env.PATH = '/tmp/bin:' + process.env.PATH;
console.log(child_process.execSync('npm version', {encoding: 'utf-8'}));

And here we go! Now it’s possible to use a up-to-date version of NPM within a Lambda function.

Some additional learnings:

  • NPM needs a lot of memory, so I configured the Lambda function with max memory of 1500MB RAM. Otherwise it seems to misbehave or garbage collect a lot.
  • You should start with a clean tmp before installing NPM in order to avoid side effects, as containers might get reused by Lambda. That step did it for me:
child_process.execSync('rm -fr /tmp/.npm');  
// ... npm install steps ...
  • Downloading and installing NPM every time the build step is executed makes it more flaky (remember the fallacies of networking!). It also reduces the available execution time by 10 seconds (the time it takes to download and install NPM). One could pack the installed npm version as an own Lambda function in order to decouple it. But that’s a topic for another blog post.

Keeping your Pocket list clean with pocketcleaner and AWS Lambda

Over the last years my Pocket reading queue got longer and longer. It actually dated back to stuff from 2013. Over the time a realized I would never ever be able to keep up with it again.

Some days ago I found out that Daniel (mrtazz) developed a nice tool named pocketcleaner which archives too old Pocket entries. I thought “Hey great, that’s one solution to my problem, but how to execute it?”. People who know me might already have an idea :) I don’t like servers in terms infrastructure that I have to maintain. So I thought AWS Lambda to the rescue!

And here it is: An Ansible playbook which setups a Lambda function which downloads, configures and executes the Go binary. It can be triggered by a AWS event timer. No servers, just a few cents per month (maximum!) for AWS traffic and Lambda execution costs.

Simple service discovery using AWS Private Hosted Zones

A rather simple, but effective and easy-to-setup service discovery (SD) mechanism with near-zero maintenance costs can be build by utilizing the AWS Private Hosted Zone (PHZ) feature. PHZs allows you to connect a Route53 Hosted Zone to a VPC, which in turn means that DNS records in that zone are only visible to attached VPCs.

Before digging deeper in the topic, let’s try to find a definition for ‘simple service discovery’. I’d say in 99% of the cases service discovery is something like “I am an application called myapp, please give me (for example) my database and cache endpoints, and service Y which I rely on”, so the service consumer and service announcer need to speak a common language, and we need no manual human interaction. This is at least how Wikipedia defines service discovery protocols:

Service discovery protocols (SDP) are network protocols which allow automatic detection of devices and services offered by these devices on a computer network. Service discovery requires a common language to allow software agents to make use of one another’s services without the need for continuous user intervention.

So back to the topic. You might think: Why not use Consul, Etcd, SkyDNS etcpp?

“no software is better than no software” — rtomayko

You are not done with installing the software. You might need to package, configure, monitor, upgrade and sometimes deeply understand and debug it as well. I for one just simply love it when my service providers are doing this for me (and Route53 has actually a very good uptime SLA, beat that!) and I can concentrate on adding value for my customers.

“However, service discovery apps introduce more complexity, magic, and point of failures, so don’t use them unless you absolutely need to.”

This is another point. Keeping it simple is hard and an art. I learned the hard way that I should try to avoid more complex tools and processes as long as possible. Once you introduced complexity it’s hard to remove it again because you or other people might have built even more complex stuff upon it.

Ok, we are almost done with my ‘Total cost of ownership’ preaching. Another aspect for me of keeping it simple and lean is to use as much infrastructure as possible from my IaaS provider. For example databases (RDS), caches (ElastiCache), Queues and storage (e.g. S3). Those services usually don’t have a native interface to announce their services to Consul, Etcd etc. so one would need to write some glue which takes events from your IaaS provider, filters and then announces changes to the SD cluster.1

Ok, so how can we achieve a service discovery with the AWS building blocks and especially Private Hosted Zones?

The first thing to do is to create a new Private Hosted Zone and associate it to your VPC. In our example we’ll call it snakeoil.prod.internal, indicating that it is the internal DNS for our snakeoil company in our environment prod (which indicates that other environments, e. g. staging or development reside in other VPCs).

Ok, nothing really special. Now we could add our first resource record to the hosted zone, and resolve it, e.g. cache-myapp, indicating it’s the cache endpoint for my app
mypp. We will use CloudFormation and troposphere as a preprocessor for creating an Elasticache Cluster and its PHZ announcement:

PrivateHostedZone = "snakeoil.prod.internal."
app_elasticache = elasticache.CacheCluster(...);
template.add_resource(app_elasticache)
app_elasticache_private_hosted_zone_dns_rr = route53.RecordSetType(
   "SessionClusterPHZEndpoint",
   HostedZoneName=PrivateHostedZone,
   Name="cache-myapp.%s" % (PrivateHostedZone),
   Type="CNAME",
   ResourceRecords=[Join("", [GetAtt(app_elasticache, "ConfigurationEndpoint.Address"), "."])],
   TTL="60"
)
template.add_resource(app_elasticache_private_hosted_zone_dns_rr)

This snippet creates a CNAME in the PHZ which points to the ElastiCache cluster endpoint.

It will actually look like this when we ping it from an EC2 instance within the VPC:

$ host cache-myapp.prod.snakeoil.internal
cache-myapp.prod.snakeoil.internal is an alias for app-x.z7iqq9.cfg.use1.cache.amazonaws.com
app-x.z7iqq9.cfg.use1.cache.amazonaws.com has address 192.0.2.1

But wait, now we need to specify the entire PHZ domain (snakeoil.prod.internal) everytime we want to lookup the service? Wouldn’t it be great when we could just lookup
cache-myapp, so our application does not need to know in which zone or environment it is running (The principle of least knowledge)?!

This is where DHCP option sets come into play. We can just create a new one which includes snakeoil.prod.internal:

Once we associated our VPC with this DHCP option set, we can omit the domain part as it’s now part of the search domain (propagated via DHCP):

$ host cache-myapp
cache-myapp is an alias for app-x.z7iqq9.cfg.use1.cache.amazonaws.com.
app-x.z7iqq9.cfg.use1.cache.amazonaws.com has address 192.0.2.1

Now we can just hardcode the service endpoint in our software (or it’s configuration), for example like this:

$client = new Memcached();
$client->addServer('cache-myapp', $server_port);

No need for configuration management like Puppet or Chef , no need for Service Discovery (Consul etc)., and no need for glue software (e.g. confd). The contract between the service consumer and announcer is
the service name.

Hint: We could theoretically add even more granularity by creating a VPC for every (application-env)-tuple we have. This would eventually lead to a scheme where the app would only need to lookup database, cache and service-y, so even the name of the app could be omitted in the ‘search query’. But the VPC networking overhead might not be worth it. You have to decide which trade-off to make.

Warning 1: Route53 propagation times

During my research I found out that it takes approximately 40 seconds for Route53 to propagate changes. So if you rely on real-time changes, you should rather look into more sophisticated approaches like Consul, Etcd, SkyDNS etc. I guess AWS will improve propagation delays over time.

Another issue is the default SOA TTL set by AWS, it’s 900 seconds by default which actually is the negative cache TTL. That means once you requested a record which is currently not propagated, you have to wait 15 minutes until the
negative cache expires. I would recommend to set it to a low value like 10-60 seconds.

Warning 2: DNS and Networking

“Everything is a Freaking DNS problem” Kris Buytaert

DNS is a network protocol and as result is constrained by the fallacies of distributed computing. DNS queries are usually not cached on Linux distros by default, but luckily there are caching solutions available. We are currently using nscd, but there is at least dnsmasq. I would recommend to install one of those to make your system more resilient in case of networking or DNS problems.

Recap

Service Discovery can be made arbitrarily complex, but it can also be kept simple using the building blocks AWS is giving us. The demonstrated pattern can be used for almost everything which just connects to an endpoint.

I am planning to write follow up blog posts for more sophisticated service discovery with SRV records, and also how to use TXT records for storing configuration/credentials, and even feature-ramp-ups within the PHZ. Stay tuned!

Acknowledgement

The basic idea of doing discovery by just resolving bare hostnames was initially brought to me by my fellow co-worker Ingo Oeser who successfully used this kind of discovery at his former employer.

He pointed out that those setups included DNSSEC as well in order to prevent DHCP and/or DNS spoofing. We currently don’t consider this a problem in an AWS VPC.

1It looks like HashiCorp can integrate IaaS compoenents with their Autodiscovery by using their pay product ‘Atlas” as a bridge between TerraForm and Consul but I didn’t validate this hypothesis.

devopsdays Ghent recap

!!! ATTENTION: Highly unstructured braindump content !!!

Day 1

The Self-Steering Organization: From Cybernetics to DevOps and Beyond

Nice intro intro cybernetics and systems theory. Nothing really new for me as I’m into system theory a very little bit. Keywords: Auto autopoiesis, systems theory, cybernetics, empathy, feedback.

Ceci n’est pas #devops

  • “DevOps is culture, anyone who says differently is selling something. Tools are necessary but not sufficient. Talking about DevOps is not DevOps.”
  • fun experiment replace every occurrence of “DevOps” with “empathy” and see what happens ;-) reminded me of the “butt plugin”)

Cognitive Biases in Tech: Awareness of our own bugs in decision making

  • Talk is mainly backed by the book “Thinking, fast and slow”
  • Brain is divided in System 1 and System 2
  • System 2 gets tired easily: Do important decisions in the morning (e. g. monolith vs. micro-service), postpone trivial ones to the evening (e. g. what to cook for dinner)
  • great hints for better post mortems

5 years of metrics and monitoring

  • great recap on infoq
  • You really have to focus on how to visualize stuff. Looks there needs to be expertise for this in a company which wants to call itself “metrics driven” or “data driven”
  • We have to be aware of Alert fatuigues:
    • noise vs. signal
    • not reacting to alerts anymore, because “they will self-heal anyway in a few minutes” (we call this “troll-alert” internally, which is a very bad description for an alert coming from a non-human system which is apparently not able to troll)

Ignites

Repository as an deployment artifact - Inny So

  • talking about apt.ly - application+environment as atomic release tags

Day 2

Running a fully self-organizing/self-managing team (or company)

  • good recap at infoq
  • interesting open allocation work model, but with managers, feedback loops, retrospectives and planning meetings. They call it “self-selection”
  • it’s sometimes better to let people go instead of trying to keep them
  • people need explicit constraints to work in, otherwise they don’t know their and others boundaries

[Automation with humans in mind: making complex

systems predictable, reliable and humane](https://gist.github.com/s0enke/0ac2f6a0cce307d9cddc)

Open spaces

Internal PaaS

I hosted a session on “Why/How to build an internal PaaS”. The reason for doing this is building a foundation for (micro-)services: Feature Teams should be able to easily deploy new services (time to market <1hour). They should not care about building their own infrastructure for: Deployment of appliations of different languages (PHP, Ruby, Java, Python, Go …), metrics, monitoring, databases, caches etcpp.

So I had a quick look, e.g. at flynn.io or deis.io, which pretend to do what I want, and I hoped someone actually using stuff like that might be right here.

The session itself was a bit clumsy: I guess I couldn’t explain my problem well enough, or it actually is no problem. Or the problem is too new as there was no one in the room who actually had more than 2 micro-services deployed.

But anyway, talking about what I want to achieve actually helped me to shape my thoughts.

Microservices - What is important for Ops
  • Session hosted by MBS
  • If a company wants to migrate to / implement microservices, an Ops team should insist on 12-factor-app style in order to have a standardization
  • Have a look at Simian Army which has been implemented to mitigate common bad practices in microservice architecture, e. g. make everyone aware of fallacies of distributed computing.
  • Not really for Ops, but for devs:
    • EBI / Hexagonal programming style from beginning on, so it doesnt matter (in theory) if monolithic or service-oriented. In theory easy to switch
    • Jeff Bezos Rules
    • Generally having a look at Domain Driven Design and orienting (e. g. using repository and entities instead of ActiveRecord)

All the videos

on ustream

Other Recaps

External MySQL slaves with RDS reloaded

In an earlier first post I demonstrated a way to connect an external slave to a running RDS instance. Later then AWS added the native possibility to import and export via replication.

In my case, several problems popped up:

  • My initial blog post did not show how to start from an existing data set, e. g. do a mysqldump, and import
  • RDS does not allow –master-data mysqldumps as “FLUSH TABLES WITH READ LOCK” is forbidden in RDS
  • So we do not have any chance to get the exact starting point for reading the binlog on the slave.

The RDS documentation documents an export of data, but not a longer lasting softmigration. For me it’s critical to have a working replication to on-premise MySQL over several months, not only a few hours to export my data. Actually we are migrating into AWS and have to connect our old replication chain to the AWS RDS master.

Another point: The RDS documentation is even unclear and buggy. For example it states

Run the MySQL SHOW SLAVE STATUS statement against the MySQL instance running external to Amazon RDS, and note the master_host, master_port, master_log_file, and read_master_log_pos values.

But then

Specify the master_host, master_port, master_log_file, and read_master_log_pos values you got from the Mysql SHOW SLAVE STATUS statement you ran on the RDS read replica.

Ok, to which master shall I connect? The MySQL instance outside of RDS should not have any master host data set yet, because it’s a fresh instance? The master host on the read replica is a private RDS network address, so we could never connect to that from our VPC.

Next point: RDS lets us set a binlog retention time, which is NULL by default. That means binlogs are purged as fast as possible. We had the following case with an external connected slave: The slave disconnected because of some network problem and could not reconnect for some hours. In the meantime the RDS master already purged the binary logs and thus the slave could not replicate anymore:

Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

So I was forced to find a solution to setup a fresh external slave from an existing RDS master. And without any downtime of the master because it’s mission critical!

First I started to contemplate a plan with master downtime in order to exercise the simple case first.

Here is the plan:

  1. Set binlog rentition on the RDS master to a high value so you are armed against potential network failures:

    > call mysql.rds_set_configuration('binlog retention hours', 24*14);  
    Query OK, 0 rows affected (0.10 sec)  
    
    > call mysql.rds_show_configuration;  
    +------------------------+-------+------------------------------------------------------------------------------------------------------+  
    | name                   | value | description                                                                                          |  
    +------------------------+-------+------------------------------------------------------------------------------------------------------+  
    | binlog retention hours | 336   | binlog retention hours specifies the duration in hours before binary logs are automatically deleted. |  
    +------------------------+-------+------------------------------------------------------------------------------------------------------+  
    1 row in set (0.14 sec)  
    
  2. Deny all application access to the RDS database so no new writes can happen and the binlog position stays the same. Do that by removing inbound port 3306 access rules (except your admin connection) from the security groups attached to your RDS instance. Write them down because you have to re-add them later. At this time your master is “offline”.

  3. Get the current binlog file and position from the master, do it at least 2 times and wait some seconds inbetween in order to validate it does not change anymore. Also check SHOW PROCESSLIST whether you and rdsadmin are the only connected users against the RDS master.
  4. Get a mysqldump (without locking which is forbidden by RDS, as stated above):

    $ mysqldump -h <read replica endpoint> -u <user> -p<password> --single-transaction --routines --triggers --databases <list of databases> | gzip > mydump.sql.gz
    
  5. rsync/scp to slave

  6. call STOP SLAVE on your broken or new external slave
  7. Import dump
  8. Set binlog position on the external slave (I assume the remaining slave settings, e. g. credentials, are already set up).

    CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin-changelog.021761', MASTER_LOG_POS=120
    
  9. Re-add RDS master ingress security rules (or at least add the inbound security rule which allows the external slave to connect to the RDS master).

  10. Start external slave. The slave should now catch up with the RDS master.
  11. Re-add remaining RDS master security group ingress rules if any.

Ok, now we know how to do it with downtime. This might be OK for testing and staging environments, but not for production databases.

How can we do it without downtime of the RDS master?

The AWS manual says we should create a RDS read replica and mysqldump the read replica instead of the master, but it is unclear and buggy about how to obtain the master binlog position.

But using a read replica is actually the first correct step.

So here is my alternative plan:

Spin up a read replica, stop the replication manually.

> CALL mysql.rds_stop_replication;  
+---------------------------+  
| Message                   |  
+---------------------------+  
| Slave is down or disabled |  
+---------------------------+  
1 row in set (1.10 sec)

Now we can see which master binlog position the slave currently is at via the Exec_Master_Log_Pos variable. This is the pointer to the logfile of the RDS master and thus we now know the the exact position from where to start after setting up our new external slave. The second value we need to know is the binlog file name, this is Relay_Master_Log_File - for example:

Relay_Master_Log_File: mysql-bin-changelog.022019  
  Exec_Master_Log_Pos: 422

As the mysql documentation states:

The position in the current master binary log file to which the SQL thread has read and executed, marking the start of the next transaction or event to be processed. You can use this value with
the CHANGE MASTER TO statement’s MASTER_LOG_POS option when starting a new slave from an existing slave, so that the new slave reads from this point. The coordinates given by
(Relay_Master_Log_File, Exec_Master_Log_Pos) in the master’s binary log correspond to the coordinates given by (Relay_Log_File, Relay_Log_Pos) in the relay log.

Now we got the 2 values we need and we have consistent state to create a dump because the read replica stopped replication.

$ mysqldump -h <read replica endpoint> -u <user> -p<password> --single-transaction --routines --triggers --databases <list of databases> | gzip > mydump.sql.gz

Now follow the steps 5-8 and 10 from above.

You should have a running external read slave which is connected to the RDS master by now. You may delete the RDS read replica again as well.

Happy replicating!

Replicating AWS RDS MySQL databases to external slaves

Update: Using an external slave with an RDS master is now possible as well as RDS as a slave with an external master

Connecting external MySQL slaves to AWS RDS mysql instances is one of the most wanted features, for example to have migration strategies into and out of RDS or to support strange replication chains for legacy apps. Listening to binlog updates is also a great way to update search indexes or to invalidate caches.

As of now it is possible to access binary logs from outside RDS with the release of MySQL 5.6 in RDS. What amazon does not mention is the possibility to connect external slaves to RDS.

Here is the proof of concept (details on how to set up a master/slave setup is not the focus here :-) )

First, we create a new database in RDS somehow like this:

soenke♥kellerautomat:~$ rds-create-db-instance soenketest --backup-retention-period 1 --db-name testing --db-security-groups soenketesting --db-instance-class db.m1.small --engine mysql --engine-version 5.6.12 --master-user-password testing123 --master-username root --allocated-storage 5 --region us-east-1 
DBINSTANCE  soenketest  db.m1.small  mysql  5  root  creating  1  ****  n  5.6.12  general-public-license
      SECGROUP  soenketesting  active
      PARAMGRP  default.mysql5.6  in-sync
      OPTIONGROUP  default:mysql-5-6  in-sync  

So first lets check if binlogs are enabled on the newly created RDS database:

master-mysql> show variables like 'log_bin';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | ON    |
+---------------+-------+
1 row in set (0.12 sec)

master-mysql> show master status;
+----------------------------+----------+--------------+------------------+-------------------+
| File                       | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+----------------------------+----------+--------------+------------------+-------------------+
| mysql-bin-changelog.000060 |      120 |              |                  |                   |
+----------------------------+----------+--------------+------------------+-------------------+
1 row in set (0.12 sec)

Great! Lets have another check with the mysqlbinlog tool as stated in the RDS docs.

But first we have to create a user on the RDS instance which will be used by the connecting slave.

master-mysql> CREATE USER 'repl'@'%' IDENTIFIED BY 'slavepass';
Query OK, 0 rows affected (0.13 sec)

master-mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
Query OK, 0 rows affected (0.12 sec)

Now lets have a look at the binlog:

soenke♥kellerautomat:~$ mysqlbinlog -h soenketest.something.us-east-1.rds.amazonaws.com -u repl -pslavepass --read-from-remote-server -t mysql-bin-changelog.000060
...
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
CREATE USER 'repl'@'%' IDENTIFIED BY PASSWORD '*809534247D21AC735802078139D8A854F45C31F3'
/*!*/;
# at 582
#130706 20:12:02 server id 933302652  end_log_pos 705 CRC32 0xc2729566  Query   thread_id=66    exec_time=0     error_code=0
SET TIMESTAMP=1373134322/*!*/;
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%'
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

As we can see, even the grants have been written to the RDS binlog. Great! Now lets try to connect a real slave! Just set up a vanilla mysql server somewhere (local, vagrant, whatever) and assign a server-id to the slave. RDS uses some (apparently) random server-ids like 1517654908 or 933302652 so I currently don’t know how to be sure there are no conflicts with external slaves. Might be one of the reasons AWS doesn’t publish the fact that slave connects actually got possible.

After setting the server-id and optionally a database to replicate:

server-id       =  12345678
replicate-do-db=soenketesting

lets restart the slave DB and try to connect it to the master:

slave-mysql> change master to master_host='soenketest.something.us-east-1.rds.amazonaws.com', master_password='slavepass', master_user='repl', master_log_file='mysql-bin-changelog.000067', master_log_pos=0;
Query OK, 0 rows affected, 2 warnings (0.07 sec)

slave-mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

And BAM, it’s replicating:

slave-mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: soenketest.something.us-east-1.rds.amazonaws.com
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin-changelog.000068
          Read_Master_Log_Pos: 422
               Relay_Log_File: mysqld-relay-bin.000004
                Relay_Log_Pos: 595
        Relay_Master_Log_File: mysql-bin-changelog.000068
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: soenketesting
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 422
              Relay_Log_Space: 826
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 933302652
                  Master_UUID: ec0eef96-a6e9-11e2-bdf0-0015174ecc8e
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
1 row in set (0.00 sec)

So lets issue some statements on the master:

master-mysql> create database soenketesting;
Query OK, 1 row affected (0.12 sec)
master-mysql> use soenketesting
Database changed
master-mysql> create table example (id int, data varchar(100));
Query OK, 0 rows affected (0.19 sec)

And it’s getting replicated:

slave-mysql> use soenketesting;
Database changed
slave-mysql> show create table example\G
*************************** 1. row ***************************
       Table: example
Create Table: CREATE TABLE `example` (
  `id` int(11) DEFAULT NULL,
  `data` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)