ansible Troubleshooting

I am working through an ansible play setup where a notifyCommit from GitHub (or a curl from the command line…) goes to a Jenkins controller which pulls a Jenkinsfile to guide a pipeline build out of the repo and then that Jenkinsfile first calls an installer.yml play.

That installer construct is a git submodule in the application repo. Thus stable code across all of the builds.

It brings in and assembles the role(s) needed for a build, and creates an inventory-packages file in /etc/ on the instance image containing both what it was configured to install and what came through.

Then packer is called by Jenkins in the Jenkinsfile, which provisions with ansible, up to a point. Part of those plays is installing yet more ansible plays to run at boot of the instance, and do final on-the-fly configuration for keys, secrets, ssl certs, consul service discovery.

The at-boot play also imports playbooks used for consul + vault instantiation.

I found troubleshooting through layer after layer after layer, trying to trace the threads of files and manipulations through, was time consuming. The Jenkinsfile and then packer and ansible plays was relatively direct, I log onto the agent instance building the job and go look at the build. packer creates an output file that can be tailed using:

cd workspace
tail -f `du -ak | grep output | awk '{ print $2 }'`

From that you can follow along directly . I use “-name: name of playbook or other significant string | actions being taken” to make each play clear in the logs.

Once you get to the at-boot plays, I can log on and go look at the aftermath. I found a better path.

I set

- name: standalone | wait...
  pause:
    seconds:  120

This was originally added to move execution away from boot – to let things settle before instantiating vault+consul on the box. I used to put the at-boot run into rc.local with an “at +2 <" start, forcing it into the at mechanism to run 2 minutes later. That mostly seemed to work. But having the pause inside worked better in this case. The pause added behavior in the logging.

1540307739,,ui,message,    us-west-1-dev: TASK [get github access | pause 42 seconds] ************************************
1540307739,,ui,message,    us-west-1-dev: Pausing for 42 seconds
1540307739,,ui,message,    us-west-1-dev: (ctrl+C then ‘C’ = continue early%!(PACKER_COMMA) ctrl+C then ‘A’ = abort)

On the command line if this play was run, you could continue using ctrl-C + C. In the logs you just get the pause logged.

To work through the development of the at boot piece(s) of this, I added this 2 minute pause. Then once the image is created I launch the test instance. I log on as soon as ssh is available, and kill off the script. Then I go into the playbook and add

- name: debug | pause to go see
  pause:
    prompt: "take a look..."

just past each step I want to verify.

All of this will be covered by test coverage – that’s on the map – but it’s a few months out.

With these pauses added, I run the playbook using

ansible-playbook -c local name_of_playbook.yml

and then, as each pause triggers, I look at the play in a separate window and go look at the result. Thus I can step through break points in the plays and see if they are really doing what was expected, and fix and then rework on the fly. The final test is incorporate all those changes into a new image and allow it to boot and run the scripts without interaction. But this allows fixing the pieces that look like they should have worked but have a missing letter, or path element, or permissions issue or whatever. That can all get resolved in the environment where the play will run in an interactive manner that allows fixing and then re-running.

— doug