SSL and Consul Service Discovery

Nebula is an Open Source CICD pipeline. It was developed using Jenkins, ansible, packer, python and AWS as an end-to-end CICD product, distributed across individual self-healing clouds, so recoverable and massively scalable. Spun down to minimize cost in AWS when idling.

I started about four weeks ago to bring all of the products together on a single instance, creating Nebula-in-a-Box or Orion. This would give us a test bed for pushing out the open sourced version of this product and also a way to hand it to someone with minimal prep and setup, just point your GitHub service notifyCommits and build. To do this I disassembled the ansible roles out of the Jenkins builds themselves and created the beginnings of an installer to re-assemble multiple roles with sane defaults but also the ability to explicitly define every repo. Then created the Nebula-in-a-Box build to bring all of these pieces together on a single AMI (and eventually, single docker image, or virtual box…).

That splitting out of the ansible roles was the first re-architect. The second was to automatically configure and bring up consul+ vault with working consul service discovery. That’s relatively easy on the single instance, but one of the requirements was the ability to bring up AWS cloud agents at will and as needed and have them also participate in the consul service discovery, thus allowing us to do things like set the vault url to “vault.service.default.consul” and the nebula utility url to “nebula.service.default.consul”. It looks like that will continue to work as this gets spread out onto individual instances, and eventually forms he pattern for a consolidated service discover domain for the robust multiple cloud build farms. The AMIs have to have at-boot scripts that create and configure token and encrypt values for the local consul, then place config files for the agents when launched that will run at agent startup and re-configure and reset consul and dns configs to have them automatically join the existing cluster before reporting they are ready to run.

This also required carefully configuring the controller instance, the main instance, to own leadership of the consul cluster, rather than expect it to be determined by a quorum of three or more. Much of the time there is only a single controller, with agents spun up as needed and then dropped. No quorum. No survivability, except that the intent is these are throw-away stateless instances, one dies, just bring up another an go.

In developing that, the parts at the end that failed were after the AMI is built – the execution of the at-boot scripts first on the controller, the jenkins+consul+vault+nebula api instance, then at launch of an agent from that controller. The thing that goes wrong then requires, fix that in the script on the agent instance, verify it works, place it and check it in for the build, rebuild the blank AMI, spin up the instance from that new AMI, launch an agnate, verify… Using chopsticks to manipulate chopsticks to manipulate chopsticks, basically.

Got that all working. Except – except that the ansible plays we had been using to bring up vault and consul on the fly incorrectly formatted the ssl cert to be used by vault and consul and the alt_names and cn for the cert came out completely unusable. Nothing lined up with the instance and vault immediately refused to allow the cert.

Consul service discovery by default works across datacenter. The automatic url created for services by default is consul.service.datacenter (in this case “default”).consul. I wanted a certificate that aligned within consul service discovery and that datacenter string was the unique key.

It shows up in consul.conf.

{
  "ui": false,
  "disable_remote_exec": true,
  "domain": "consul.",
  "data_dir": "/opt/consul/data",
  "log_level": "INFO",
  "server": false,
  "client_addr": "0.0.0.0",
  "bind_addr": "0.0.0.0",
  "datacenter": "default",
  "bootstrap": false,
  "encrypt": "{{ cryptvalue.stdout }}",
  "rejoin_after_leave": true,
  "leave_on_terminate": true,
  "acl_datacenter": "{{ uniqueid.stdout }}",
  "acl_default_policy": "allow",
  "acl_down_policy": "allow",
  "acl_master_token": "{{ tokenvalue.stdout }}",
  "ports": { "dns": 53, "https": 8543 },
  "cert_file": "/etc/pki/tls/certs/consul_cert.pem",
  "key_file": "/etc/pki/tls/private/consul_key.pem"
}
"datacenter": "default",

is the key…

That’s the default in consul.service.default.consul for example.

The first step to using it is to generate a unique 32 character string (ansible)

 # generate random consul datacenter name
    - name: standalone at-boot | generate random datacenter name for use by consul service discovery
      shell: "cat /dev/urandom | tr -dc 'a-zA-Z' | fold -w 32 | head -n 1"
      register: uniqueid

Then generate the cert…

 - name: standalone at-boot | generate the cert + key for consul
      command: openssl req -new -nodes -x509 -subj "/CN=consul.service.{{ uniqueid.stdout }}.consul" -days 3650 -keyout /etc/pki/tls/private/consul_key.pem -out /etc/pki/tls/certs/consul_cert.pem -extensions v3_ca creates=/etc/pki/tls/certs/consul_cert.pem

and then drop that into consul.conf:

   # add unqueid from above for datacenter string
    - name: standalone at-boot | insert uniqueid.stdout as datacenter string
      lineinfile:
        path: /etc/consul.conf
        regexp: '"datacenter":'
        line: "  \"datacenter\": \"{{ uniqueid.stdout }}\","

and we get a certificate with *.service.randomstring.consul as the naming. And vault, and anyone talking with it is now happy.

— doug