Network Brouhaha

Cloud Director V to T Migration Videos

Fri, 31 Mar 2023 00:00:00 +0000

Recently I recorded a couple videos with my teammate, Joseph Polcar, on Cloud Director V to T migration. The first video provides an overview of the migration tool, running and evaluating an assessment, and other steps needed to prepare for a migration. The second video provides an overview of the YAML configuration file used by the migration tool, a walkthrough of what happens during each phase of the migration, and how to perform a rollback. Hopefully you find these videos helpful. Feel free to leave any questions in the comments, or contact me in LinkedIn.

Introducing IP Spaces for VMware Cloud Director

Tue, 31 Jan 2023 00:00:00 +0000

Welcome! This blog post is about a new feature in VMware Cloud Director (VCD), IP Spaces. As a VMware employee, I want to make it clear that the thoughts and opinions expressed in this post are my own and do not necessarily reflect the position of my employer. With that out of the way, let’s try to wrap our heads around IP Spaces in Cloud Director!

I find myself asking the question “Why?” frequently in customer conversations (shout out to Simon Sinek and the Golden Circle!) In this blog post, my goal is to get to the “why” of IP Spaces. I will touch on the “how” and “what”, but these are fully covered in the Cloud Director documentation and other blog posts, which are linked at the bottom of this post.

Simon Sinek’s Golden Circle

The Background

When backed by NSX-V, IP address management in Cloud Director is simple. The typical architecture consists of an external network with tenant edge gateways connected. The provider specifies a block of usable IPs that can be assigned to the external interface of each edge. If needed, additional IPs can be pulled from the block and assigned to the edge external interface for NAT, Load Balancing VIPs, VPN endpoints, etc. Everything the tenant needs to connect to the outside world can be accomplished by assigning one or more IPs to an edge interface and routing is very simple.

Cloud Director External Connectivity with NSX-V

External connectivity is quite different when Cloud Director is backed by NSX-T. External networking is provided via a T0 Gateway, which is created by the provider and imported into Cloud Director. Each tenant edge gateway is a T1 router that is connected to the T0 (or in some cases, a T0 VRF). Addresses used by the tenant are no longer assigned to an interface, but rather assigned via endpoint IP, which is essentially a loopback address assigned to the T1. Since there are now multiple hops to get from the data center network, through the T0, to the tenant T1, dynamic routing (e.g. BGP) is typically used to advertise the endpoint IPs that are assigned to the T1. These endpoint IPs can be used to SNAT workloads to the internet or terminate IPsec tunnels, providing very similar functionality to what is available in NSX-V.

This change in behavior led to IP address sprawl and providers struggled to keep track of which tenants were using which IPs. To address this challenge, IP Spaces was born.

Cloud Director External Connectivity with NSX-T

IP Spaces Overview

In VCD 10.4.1, there is a new configuration section to define IP Spaces. IP Spaces can be Public, Private, or Shared. Public IP Spaces are defined by the provider and specify what public IPs can be consumed by tenants. Private IP Spaces are defined by the tenant and are intended to simplify the process of connecting a tenant virtual data center (VDC) to a corporate WAN. Shared IP Spaces are like Private IP Spaces, allowing providers a streamlined way to provide dedicated services to tenants, such as NTP, software repos, managed services, etc.

The scope of an IP range defines which networks are internal or external, or in other words, which networks are local to VCD, and which are remote. If you are familiar with the old Cisco terminology for NAT, think inside and outside networks. Relating this to NAT is helpful because that is one of the primary reasons that these scopes are defined. In future VCD releases, this information may be used to automatically create NAT and NONAT rules to simplify the configuration of typical architectures.

Rounding out the concepts that are included in an IP Spaces are IP ranges, IP prefixes, and quota settings. IP ranges can be supplied in list form or CIDR notation and must be within the range defined as the internal scope. Tenants can request individual IPs out of the range to assign for services like NAT or a load balancer VIP. IP prefixes are also constrained to the internal scope, and they define specific subnets that tenants can consume. Quota settings define how many individual IPs and prefixes each tenant can use.

The Why

Defining these parameters – IP Space type, scope, ranges, prefixes, and quotas – provides VCD with far more information than was available with the basic IP address management in previous versions. Providers have fine-grained control over exactly which IP addresses and ranges tenants are allowed to consume. This also means that future VCD releases will have enough information to potentially configure NAT/NONAT rules, firewall rules, and BGP policy (prefix lists/filtering/etc.) for a variety of common topologies. The initial release of IP Spaces is just the beginning, providing a much more manageable and coherent IP address management system for providers and tenants. I am looking forward to seeing what other new capabilities will be unlocked as this feature evolves.

Helpful Links

Release Notes: https://docs.vmware.com/en/VMware-Cloud-Director/10.4.1/rn/vmware-cloud-director-1041-release-notes/index.html

Documentation: https://docs.vmware.com/en/VMware-Cloud-Director/10.4/VMware-Cloud-Director-Tenant-Portal-Guide/GUID-FB230D89-ACBC-4345-A11A-D099D359ED1B.html

Notes

The two images at the top of this post were made using Stable Diffusion, an AI image generator. The first was generated by a prompt to create a picture with computer networking and clouds. The second was used to modify a simple diagram using pix2pix and img2img. I find it weird, and I like it.

2022 Update: Simple Cloud Automation with VCD, Terraform, ZeroTier and Slack

Thu, 10 Mar 2022 00:00:00 +0000

In 2018 I wrote a blog titled Simple cloud automation with vCD, Terraform, ZeroTier and Slack. A lot has changed since I wrote that post, so it’s time to update it. The goal is still the same: deploy a VM (inside a vApp) in Cloud Director and automate network connectivity with ZeroTier. Slack is used to monitor the progress and display the IP address assigned by ZeroTier. Overall, I want to be able to deploy a VM that has outbound internet connectivity and be able to connect to it without having to configure any firewall rules, NAT, or SSL/IPsec VPN.

I did make some adjustments to my approach while preparing to write this post. Instead of relying on Guest Customization with VMware tools, I chose to use cloud-init. This went so poorly that I wrote a dedicated post on it 😂: Using cloud-init for Customization with VCD and Terraform. VCD also has a completely different Terraform provider than the one I demoed in 2018, which I will dig into at the end of this post.

Tools Used and Prerequisites

VMware Cloud Director - VMware’s cloud service delivery platform, typically used by service providers in the VMware Cloud Provider Program. I used VCD 10.3 in my lab when using the Terraform code you will see below.
HashiCorp Terraform - An open-source tool written in Go, Terraform allows users to define infrastructure as code. Many public cloud providers are supported in Terraform, as well as on-prem infrastructure like vSphere and NSX-T. The Terraform provider for VCD is available at https://registry.terraform.io/providers/vmware/vcd/latest.
ZeroTier - The ZeroTier docs state that “ZeroTier is a smart Ethernet switch for planet Earth.” ZeroTier uses an agent to provide connectivity between endpoints connected to the same ZeroTier network. Anyone can create a free account on the ZeroTier website and create multiple networks. Endpoints connected to ZeroTier are managed through the web portal (or API). In other words, ZeroTier is a simple, free¹, fast² VPN. If you’re wondering how ZeroTier works, check out their awesome whitepaper. My friend and uber-network nerd Tony Efantis provides a deep dive into ZeroTier on YouTube: https://www.youtube.com/watch?v=Lao9T_RQTak
Slack - I’m assuming everyone is familiar with Slack by now. For this example, Slack is used to provide visibility into the process of connecting a new VM to ZeroTier. Slack’s free tier is great for testing simple automation and receiving notifications via webhooks.
GitHub - I’m hosting scripts on GitHub, but any web host could fill this need. If you choose another host, you should still use Git for version control for Terraform code and other scripts. The current script I’m using is at https://github.com/shamsway/zerotier-installer, and it is a simplified and modified version of the install script provided by ZeroTier at https://install.zerotier.com/.

Before deploying anything with Terraform, I installed ZeroTier on my local workstation, uploaded an Ubuntu cloud image OVA to my VCD catalog, and configured an incoming webhook for Slack. My VCD environment is preconfigured to allow outbound internet traffic, but nothing else.

Terraform Example

Below is the main.tf file to create a vApp, attach an existing Org network to the vApp, and clone a VM into the vApp using cloud-init for customization.

terraform {
  required_providers {
    vcd = {
      source = "vmware/vcd"
    }
  }
}

variable "ztnetwork" {
  type        = string
  description = "ZeroTier Network to join"
}

variable "ztapi" {
  type        = string
  sensitive   = true
  description = "ZeroTier API Access Token"
}

variable "slack_webhook_url" {
  type        = string
  description = "Slack webhook URL"
  default     = ""
}

variable "vcd_vm_name" {
  type = string
  description = "Name of new vApp created from template"
}

resource "vcd_vapp" "ubuntu" {
  org  = "my-org"
  vdc  = "my-vdc"
  name = "ubuntu"

  power_on = true
}

resource "vcd_vapp_org_network" "ubuntu-network" {
  org = "my-org"
  vdc = "my-vdc"

  vapp_name        = vcd_vapp.ubuntu.name
  org_network_name = "org-network"
}

resource "vcd_vapp_vm" "ubuntu" {
  org           = "my-org"
  vdc           = "my-vdc"
  vapp_name     = "ubuntu"
  catalog_name  = "my-catalog"
  template_name = "ubuntu-2110-cloud"
  name          = "ubuntu-vm"
  memory        = 4096
  cpus          = 1
  os_type       = "ubuntu64Guest"
  power_on      = true

  network {
    type               = "org"
    name               = "org-network"
    ip_allocation_mode = "MANUAL"
    ip                 = "192.168.1.10"
  }

  guest_properties = {
    "user-data" = base64encode(templatefile("cloud-config.yaml", { ztnetwork = var.ztnetwork, ztapi = var.ztapi, slack_webhook_url = var.slack_webhook_url, hostname = var.vcd_vm_name }))
  }
}

Most of this is straightforward, but the magic happens in the guest_properties block of the vcd_vapp_vm resource. The user-data property contains a base 64 encoded version of my cloud-init configuration. You can see that the templatefile() function is used to insert some values needed for the ZeroTier install script: the ZeroTier network to connect to, an API key for ZeroTier, the webhook URL for Slack, and the VM hostname.

Here is my cloud-config.yaml, which performs the customization of the VM upon first boot:

#cloud-config

hostname: ${hostname}
users:
 - name: ubuntu
   sudo: ["ALL=(ALL) NOPASSWD:ALL"]
   groups: [sudo]
   shell: /bin/bash
   ssh_authorized_keys:
    - ssh-rsa alongstringthatisansshkey
manage_resolv_conf: true
packages:
 - python3-pip
 - jq
runcmd:
 - export ZTNETWORK=${ztnetwork}
 - export ZTAPI=${ztapi}
 - export SLACK_WEBHOOK_URL=${slack_webhook_url}
 - wget https://raw.githubusercontent.com/shamsway/zerotier-installer/master/zerotier-installer.sh
 - chmod +x zerotier-installer.sh
 - ./zerotier-installer.sh
 - rm zerotier-installer.sh
final_message: "The system is ready and prepped (took $UPTIME seconds)"

This cloud-init config will configure the local ubuntu user with sudo privileges, disable password-based logins, add my desired SSH key and install some necessary packages. The runcmd block is the bit that actually downloads my ZeroTier installer from GitHub and executes it, connecting the VM to my ZeroTier network and providing output to Slack.

Now, let’s see this in action.

Workflow

The output from terraform apply looks just as you’d expect if you’ve ever seen Terraform run:

Plan: 3 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

vcd_vapp.ubuntu-zt: Creating...
vcd_vapp.ubuntu-zt: Still creating... [10s elapsed]
vcd_vapp.ubuntu-zt: Creation complete after 16s [id=urn:vcloud:vapp:db4d4ee7-b171-45dc-a98a-67cd717db127]
vcd_vapp_org_network.ubuntu-zt-network: Creating...
vcd_vapp_vm.ubuntu: Creating...
vcd_vapp_org_network.ubuntu-zt-network: Creation complete after 5s [id=urn:vcloud:network:1b61037f-dc6d-4ae5-aefc-59962de1e647]
vcd_vapp_vm.ubuntu: Still creating... [10s elapsed]
[snip]
vcd_vapp_vm.ubuntu: Creation complete after 1m58s [id=urn:vcloud:vm:d20caca3-8b80-45da-8435-c4d44c988ccb]

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

VCD creates the vApp, clones a template VM into the vApp, and powers it on. When the VM boots, cloud-init runs and executes each step specified in cloud-config.yaml, which will ultimately connect the new VM to my ZeroTier network. API calls are used to authorize the new VM to connect to my ZeroTier network automatically, so I don’t have to go in and manually accept the new VM in the ZeroTier portal. The process of connecting the VM to ZeroTier is output to Slack, and once complete I can grab the provided IP and immediately connect to the new VM.

user@ubuntu:~$ ssh [email protected]
The authenticity of host '172.29.189.205 (172.29.189.205)' can't be established.
ECDSA key fingerprint is SHA256:sOGaDtQ6D6bvIhmr/YhKt6Olt9EsVNRNGAomfVuIW1o.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '172.29.189.205' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 21.10 (GNU/Linux 5.13.0-28-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Wed Mar 16 23:47:03 UTC 2022

  System load:  0.03              Processes:                   138
  Usage of /:   23.0% of 9.52GB   Users logged in:             0
  Memory usage: 6%                IPv4 address for ens192:     192.168.1.10
  Swap usage:   0%                IPv4 address for ztmjfe5xok: 172.29.189.205

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

ubuntu@ubuntu-impish-21:~$ ping google.com
PING google.com (142.250.191.238) 56(84) bytes of data.
64 bytes from ord38s32-in-f14.1e100.net (142.250.191.238): icmp_seq=1 ttl=113 time=3.54 ms
64 bytes from ord38s32-in-f14.1e100.net (142.250.191.238): icmp_seq=2 ttl=113 time=3.60 ms

Notice that SSH key-based authentication is used instead of a password, which is common practice for instances running in the cloud.

So there it is - a VM deployed into VCD and automatically connected to ZeroTier, making it available without having to configure any sort of inbound firewall rules, NAT, or IPSec/SSL VPN.

State of the VCD Terraform Provider in 2022

When I wrote about this in 2018, the VCD Terraform provider was written by HashiCorp and was based on a go library named govcloudair. This library was not maintained by VMware and it was not actively developed, meaning that the VCD provider supported a limited number of features. I am happy to report that the current VCD provider is in a much better state. The provider is actively developed by VMware along with the underlying go library, go-vcloud-director. As of March 2022, there were over 2 million installs of the VCD Terraform provider, and new features are being added regularly. Many of the workarounds and caveats I mentioned in my 2018 post are no longer required. Huzzah!

Final Thoughts

Here are a few random thoughts/potential improvements:

This same workflow could be used in any cloud environment. It would require outbound internet access to be enabled, and cloud-init is well supported across cloud providers. Each cloud provider’s Terraform provider documentation should contain examples for using cloud-init.
Cloud-init could be used to install ZeroTier and send the output to Slack, but I didn’t want to spend the time to convert my install script. Initially, I used a script hosted on GitHub because there was a limit on the size of a script that can be used with Guest Customization, but cloud-init does not have that limit. I may convert my install script over to cloud-init at a later date.
The ZeroTier install script uses https://github.com/philippbosch/slack-webhook-cli to send messages to Slack, which requires Python to be installed. Installing Python adds time to the process. Sending messages to Slack is just a webhook, so a bash script could be used instead. This would remove the requirement to install Python and the whole process would be a bit faster.

Resources

VCD Terraform provider: https://registry.terraform.io/providers/vmware/vcd/latest
Go-vcloud-director library: https://github.com/vmware/go-vcloud-director
ZeroTier documentation: https://docs.zerotier.com/zerotier/manual/
ZeroTier overview on Wikipedia: https://en.wikipedia.org/wiki/ZeroTier
How Does ZeroTier Actually Work? https://www.youtube.com/watch?v=Lao9T_RQTak

GPL license / Up to 100 devices / Requires license to embed in commercial products. ↩
Quick setup, but actual traffic may proxy through ZeroTier servers. There is no throughput guarantee. ↩

Using cloud-init for Customization with VCD and Terraform

Thu, 10 Mar 2022 00:00:00 +0000

Recently I decided to update a blog post I wrote in 2018, Simple cloud automation with vCD, Terraform, ZeroTier and Slack. At a very high level, this blog post walks through deploying a vApp to VCD that is customized to run a script at first boot. In the original blog post, I relied on Guest Customization with VMware tools to accomplish this. For a variety of reasons - primarily curiosity - I decided to use cloud-init to run the script instead. Cloud-init is quite flexible and well supported, but in hindsight, my choice led me down quite a rabbit hole. This post covers the details of how cloud-init reads its configuration through VMware tools, tips for troubleshooting cloud-init, and some other lessons learned along the way. Of course, I’ll share a working example that deploys a vApp to VCD using cloud-init for customization.

The act that set the stage for this post is something I have done many times: I uploaded an Ubuntu ISO to a VCD catalog and used it to create a vApp. That vApp, and the single VM it contained, would be added to the same VCD catalog as a vApp template. This was my first mistake, but it took me several hours to figure out why.

Before we get into that, let’s level set on how cloud-init works.

The Basics of cloud-init

Here is how cloud-init describes itself:

“Cloud-init is the industry standard multi-distribution method for cross-platform cloud instance initialization. It is supported across all major public cloud providers, provisioning systems for private cloud infrastructure, and bare-metal installations.” -https://cloudinit.readthedocs.io/

Taking a look at the provided configuration examples makes it clear what the capabilities are:

Add/configure users
Create files
Install or update software
Configure networking
Configure Certificate Authorities
Run scripts/arbitrary commands
And much more

The typical scenario for cloud-init is that a config file is supplied when a server boots, is read by cloud-init and executed. The cloud-init docs refer to the config file as user-data. So, how is user-data supplied? The details vary, but a datasource is the vehicle to deliver configuration files cloud-init. Cloud-init supports several datasources to deliver user-data (there are datasources available for major cloud providers), but in a VMware environment the most promising options are OVF and VMware.

The VMware datasource docs state that it supports GuestInfo keys for supplying user-data. GuestInfo is metadata in the form of key/value pairs set in a VM’s extraConfig property, which can be read by VMware tools. As long as this metadata can be set via the VCD Terraform provider, this sounds like the datasource that would be used by cloud-init.
The OVF datasource docs state “The OVF Datasource provides a datasource for reading data from on an Open Virtualization Format ISO transport.” That sounds less promising. I’m not interested in building an ISO to bootstrap cloud-init.

Queue my surprise when I finally got cloud-init working, and the logs indicated that it used the OVF datasource. The datasource used by cloud-init can be checked with the cloud-id command, and this was the output I received:

ubuntu@ubuntu-impish-21:~$ cloud-id
ovf

Since all of the cloud-init code is available on GitHub, it’s not too difficult to see how the various data sources work. After a bit of snooping, it’s clear that the OVF datasource also reads the extraConfig metadata through VMware tools. In this case, it appears that the cloud-init docs are out of date. That was one of many valuable lessons during this process. Let me share two important ones with you.

Lesson #1: Check GitHub issues

The VCD Terraform Provider docs have a section on guest customization, but it doesn’t mention cloud-init specifically. It does show an example of configuring metadata with the provider, so I felt confident that I could supply cloud-init user-data with that method. I mentioned in the intro that I made a mistake by attempting to use cloud-init with an Ubuntu server that I built from an ISO. I’m quite sure there is a way to make it work, but I kept hitting roadblocks. Had I skimmed the resolved issues in the VCD Terraform Provider repo, I would have found this helpful comment:

The problem that I had was the OVA machine I tried to use.
A standard version of Ubuntu.
First part to make this working correctly is to download the cloud image at:
http://cloud-images.ubuntu.com/

The commenter then goes on to provide a working example of using cloud-init with the VCD Terraform Provider. Normally I do a search through GitHub issues when I’m troubleshooting something. In this case, inexplicably, I did not. If I had read that comment first, I would have saved a lot of time. However, I would not have learned so many useful strategies for troubleshooting cloud-init.

Lesson #2: Use a Cloud Image

I was aware cloud images existed, but I was set in my ways. I’d used a bootable ISO to build a Linux VM template so many times and I didn’t consider that there was an easier option. I also assumed cloud images were purely for cloud providers, and I didn’t bother to check if there was a VMware flavor available. Lesson learned. There’s a great post on using the Ubuntu cloud image on vSphere here: https://d-nix.nl/2021/04/using-the-ubuntu-cloud-image-in-vmware/. That only covers the vSphere side of things, but that post is a great explainer.

Deploying and Customizing a VCD vApp with Terraform

With those (rather obvious) lessons learned, let’s do this thing.

You will need the following:

A cloud-config.yaml file, containing the cloud-init user-data. The file extension is a clue that this is a YAML-formatted file. If you have cloud-init installed locally, you can verify that it is a valid config with cloud-init devel schema -c cloud-init.yaml. I highly recommend that you do this.
A cloud image OVA downloaded on your local workstation. For Ubuntu, these are available at http://cloud-images.ubuntu.com/

Creating a Catalog

Creating a catalog in VCD with Terraform is pretty simple. Here is an example:

resource "vcd_catalog" "mycatalog" {
 org = "my-org"

 name             = "my-catalog"
 description      = "Catalog created by Terraform"
 delete_recursive = "true"
 delete_force     = "true"
}

Uploading an OVA to a Catalog

Similarly, adding the cloud image OVA to the new catalog is straightforward. The upload time will be dependent on the bandwidth available, but the Ubuntu 21.10 cloud image is only about 540 MB.

resource "vcd_catalog_item" "ubuntu-2110-cloud" {
 org     = vcd_catalog.mycatalog.org
 catalog = vcd_catalog.mycatalog.name

 name                 = "ubuntu-2110-cloud"
 description          = "Ubuntu 21.10 cloud image"
 ova_path             = "./impish-server-cloudimg-amd64.ova"
 upload_piece_size    = 10
}

Deploying the vApp

This is the final step, and it requires a few different Terraform resources, but it’s not too difficult to follow.

resource "vcd_vapp" "ubuntu" {
 org  = "my-org"
 vdc  = "my-vdc"
 name = "ubuntu"

 power_on = true
}

resource "vcd_vapp_org_network" "ubuntu-network" {
 org = "my-org"
 vdc = "my-vdc"

 vapp_name        = vcd_vapp.ubuntu.name
 org_network_name = "org-network"
}

resource "vcd_vapp_vm" "ubuntu" {
 org           = "my-org"
 vdc           = "my-vdc"
 vapp_name     = vcd_vapp.ubuntu.name
 catalog_name  = "my-catalog"
 template_name = "ubuntu-2110-cloud"
 name          = "ubuntu-vm"
 memory        = 4096
 cpus          = 1
 os_type       = "ubuntu64Guest"
 power_on      = true

 network {
   type               = "org"
   name               = "org-network"
   ip_allocation_mode = "MANUAL"
   ip                 = "192.168.1.10"
 }

 guest_properties = {
   "user-data" = base64encode("cloud-config.yaml")
 }
}

The vcd_vapp resource creates the new vApp that contain a single VM running the cloud image template in my catalog
The vcd_vapp_org_network resource attaches an existing org network to the new vApp
The vcd_vapp_vm resource provides all of the configuration for the single VM that will be in the new vApp, including the cloud-init user-data

Most of the config in the vcd_vapp_vm resource is what you’d expect - compute, memory, and networking settings. The guest_properties section is the important bit. It configures the extraConfig property on the VM, which is where cloud-init will read the user-data from. Notice that the base64encode() function is used to convert the cloud-config.yaml file into a single, long, encoded string. This is how cloud-init expects the user-data to be passed over.

If you have values in your cloud-config.yaml file that you need to change on the fly, like credentials or API keys, you can use the templatefile() function to insert those values into the config file before encoding it. It’s possible that user-data will contain sensitive data and it is trivial to decode base64. In a production environment, you should remove the user-data from the VM after first boot.

I traveled down a winding road to get here, but I finally assembled all of the pieces needed to do what I set out for originally: update an old blog post. If all you needed was some tips on using cloud-init with Terraform and VCD, you can go along your merry way. Stick around if you want some tips on troubleshooting cloud-init.

Troubleshooting cloud-init

Here are some basic troubleshooting steps for cloud-init with vSphere/VCD:

Make sure you have a recent version of VMware Tools installed. This is required to read the metadata associated with the VM.
Make sure you are using a cloud image or you have taken the steps to ensure that your VM is properly configured to work with cloud-init. You can see an example of this with the govc tool at https://github.com/vmware/govmomi/blob/master/govc/USAGE.md#vmchange.
Verify that VMware Tools is able to access VM metadata. You can use the command vmware-rpctool 'info-get guestinfo.ovfEnv' to check this. If the command returns a slew of XML, it is working as expected.
Verify the VM metadata. You can view this in vSphere by browsing to the VM -> Settings -> vApp Options. Base64 encoded user-data should be visible under the properties section, and you can click the “View OVF Environment” button to see the XML formatted version of the metadata. This is the same information you should see from running the vmware-rpctool command on the VM. You can also view these properties in VCD by viewing the Guest Properties section in the VM properties.
Check the cloud-init logs at /var/log/cloud-init.log and /var/log/cloud-init-output.log for errors and warnings.
Run cloud-id to verify that the correct datasource is being used. If the output is fallback or none, cloud-init was not able to detect the datasource.
ds-identify is used by cloud-init to find all available datasources. Check the logs at /run/cloud-init/ds-identify.log to see why the desired datasource is not found.
While troubleshooting, you can completely reset cloud-init with sudo cloud-init clean --logs, and reboot to have cloud-init run again. This saves time over redeploying a template.

Resources

Terraform VCD provider: https://registry.terraform.io/providers/vmware/vcd/3.5.1
vcd_catalog resource: https://registry.terraform.io/providers/vmware/vcd/latest/docs/resources/catalog
vcd_catalog_item resource: https://registry.terraform.io/providers/vmware/vcd/latest/docs/resources/catalog_item
vcd_vapp resource: https://registry.terraform.io/providers/vmware/vcd/latest/docs/resources/vapp
vcd_vapp_org_network resource: https://registry.terraform.io/providers/vmware/vcd/latest/docs/resources/vapp_org_network
vcd_vapp_vm resource: https://registry.terraform.io/providers/vmware/vcd/latest/docs/resources/vapp_vm
OVF Runtime Environment: https://williamlam.com/2012/06/ovf-runtime-environment.html
Using the Ubuntu Cloud Image in VMware: https://d-nix.nl/2021/04/using-the-ubuntu-cloud-image-in-vmware/
Terraform, vSphere, and Cloud-Init oh my! https://grantorchard.com/terraform-vsphere-cloud-init/
Cloud-init config examples: https://cloudinit.readthedocs.io/en/latest/topics/examples.html

Intro to Google Cloud VMware Engine – Common Networking Scenarios

Tue, 04 May 2021 00:00:00 +0000

This post will cover some common networking scenarios in Google Cloud VMware Engine (GCVE), like exposing a VM via public IP, accessing cloud-native services, and configuring a basic load balancer in NSX-T. I’ll also recap some important and useful features in GCP and GCVE. There is a lot of material covered, so I’ve provided a table of contents to allow you to skip to the topic you’re interested in.

Creating Workload Segments in NSX-T
Exposing a VM via Public IP
- Creating Firewall Rules
Load Balancing with NSX-T
Accessing Cloud-Native Services
- Google Private Access
Viewing Routing Information
VPN Connectivity
DNS Notes
Wrap Up
Helpful Resources

Other posts in this series:

Creating Workload Segments in NSX-T

Your GCVE SDDC initially comes with networking pre-configured, and you don’t need to worry about configuring and trunking VLANs. Instead, any new networking configuration will be done in NSX-T. If you are new to NSX-T, the GCVE documentation covers creating new workload segments, which should be your first step before creating or migrating any VMs to your GCVE SDDC.

This diagram represents the initial setup of my GCVE environment, and I will be building on this example over the following sections. If you’ve been following along with this blog series, this should look familiar. You can see a “Customer Data Center” on the left, which in my case is a lab, but it could be any environment connected to GCP via Cloud VPN or Cloud Interconnect. There is also a VPC peered with my GCVE environment, which is where my bastion host is running.

I’ve created a workload segment, 192.168.83.0/24, and connected three Ubuntu Linux VMs to it. A few essential steps must be completed outside of NSX-T when new segments are created while using VPC peering or dynamic routing over Cloud VPN or Cloud Interconnect.

First, you must have Import/export custom routes enabled in private service access for the VPC peered with GCVE. Custom routes are covered in my previous post, Connecting a VPC to GCVE. Notice that my newly created segment shows up under Imported Routes.

Second, any workload segments must be added as a custom IP range to any Cloud Router participating in BGP peering to advertise routes back to your environment. This would apply to both Cloud Interconnect and Cloud VPN, where BGP is used to provide dynamic routing. Configuring this will ensure that the workload subnet will be advertised to your environment. More information can be found here.

NSX-T has an excellent Terraform provider, and I have already covered several GCP Terraform examples in previous posts. My recommendation is to add new NSX-T segments via Terraform and add the custom subnet advertisement for the segment to any Cloud Routers via Terraform in the same workflow. This way, you will be sure you never forget to update your Cloud Router advertisements after adding a new segment.

Exposing a VM via Public IP

Let’s add an application into the mix. I have a test webserver running on VM1 that I want to expose to the internet.

In GCVE, public IPs are not assigned directly to a VM. Instead, public IPs are allocated through the GCVE portal and assigned to the private IP of the relevant VM. This creates a simple destination NAT from the allocated public IP to the internal private IP.

Browse to Network > Public IPs and click Allocate to allocate a public IP. You will be prompted to supply a name and the region for the public IP. Click Submit, and you will be taken back to the Public IPs page. This page will now show the public IP that has been allocated. The internal address it is assigned to is listed under the Attached Address column.

You can find more information on public IPs in the GCVE documentation.

Creating Firewall Rules

GCVE also includes a firewall beyond the NSX-T boundary, so it will need to be configured to allow access to the public IP that was just allocated. To do this, browse to Network > Firewall tables and click Create new firewall table. Provide a name for the firewall table and click Add Rule.

Configure the rule to allow the desired traffic, choosing Public IP as the destination. Choose the newly allocated public IP from the dropdown, and click Done.

The new firewall table will be displayed. Click Attached Subnets, then Attach to a Subnet. This will attach the firewall table to a network.

Choose your SDDC along with System management from the Select a Subnet dropdown, and click Save. System management is the correct subnet to use when applying the firewall table to traffic behind NSX-T per the GCVE documentation.

I am now able to access my test webserver via the allocated public IP. Huzzah! More information on firewall tables can be found in the GCVE documentation.

Load Balancing with NSX-T

Now that the test webserver is working as expected, it’s time to implement a load balancer in NSX-T. Keep in mind that GCP also has a native load balancing service, but that is beyond the scope of this post.

Public IPs can be assigned to any private IP, not just IPs assigned to VMs. For this example, I’ll configure the NSX-T load balancer and move the previously allocated public IP to the load balancer VIP. There are several steps needed to create a load balancer, so let’s dive in.

The first step is to create a new load balancer via the Load Balancing screen in NSX-T Manager. Provide a name, choose a size, and the tier 1 router to host the load balancer. Click Save. Now, expand the Virtual Servers section and click Set Virtual Servers.

This is where the virtual server IP (VIP) will be configured, along with a backing server pool. Provide a name and internal IP for the VIP. I used an IP that lives in the same segment as my servers, but you could create a dedicated segment for your VIP. Click the dropdown under Server Pool and click Create New.

Next, provide a name for your server pool, and choose a load balancing algorithm. Click Select Members to add VMs to the pool.

Click Add Member to add a new VM to the pool and provide the internal IP and port. Rinse and repeat until you’ve added all of the relevant VMs to your virtual server pool, then click Apply.

You’ll be taken back to the server pool screen, where you can add a monitor to check the health of the VMs in your pool. Click Set Monitors to choose a monitor.

My pool members are running a simple webserver on port 80, so I’m using the default-http-lb-monitor. After choosing the appropriate monitor, click Apply.

Review the settings for the VIP and click Close.

Finally, click Save to apply the new settings to your load balancer.

The last step is to browse to Network > Public IPs in the GCVE portal and edit the existing public IP allocation. Update the name as appropriate, and change the attached local address to the load balancer VIP. No firewall rules need to be changed since the traffic is coming in over the same port (tcp/80).

Browsing to the allocated public IP and pressing refresh a few times shows that our load balancer is working as expected!

Accessing Cloud-Native Services

The last addition to this example is to include a GCP cloud-native service. I’ve chosen to use Cloud Storage because it is a simple example, and it provides incredible utility. This diagram illustrates my desired configuration.

My goal is to stage a simple static website in a Google Storage bucket, then mount the bucket as a read-only filesystem on each of my webservers. The bucket will be mounted to /var/www/html and will replace the testing page that had been staged on each server. You may be thinking, “This is crazy. Why not serve the static site directly from Google Storage?!” This is a valid question, and my response is that this is merely an example, not necessarily a best practice. I could have chosen to use Google Filestore instead of Google Storage as well. This illustrates that there is more than one way to do many things in the cloud.

The first step is to create a Google Storage bucket, which I completed with this simple Terraform code:

provider "google" {
  project = var.project
  region  = var.region
  zone    = var.zone
}

resource "google_storage_bucket" "melliott-vmw-static-site" {
  name          = "melliott-vmw-static-site"
  location      = "US"
  force_destroy = true
  storage_class = "STANDARD"
}

resource "google_storage_bucket_acl" "melliott-vmw-static-site-acl" {
  bucket = google_storage_bucket.melliott-vmw-static-site.name

  role_entity = [
    "OWNER:[email protected]"
  ]
}

Next, I found a simple static website example, which I stored in the bucket and modified for my needs. After staging this, I completed the following steps on each webserver to mount the bucket.

Install the Google Cloud SDK (https://cloud.google.com/sdk/docs/install)
Install gcsfuse (https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/installing.md), which is used to mount Google Storage buckets in linux via FUSE
Authenticate to Google Cloud with gcloud auth application-default login. This will provide a URL that will need to be pasted into a browser to complete authentication. The verification code returned will then need to be pasted back into the prompt on the webserver.
Remove existing files in /var/www/html
Mount the bucket as a read-only filesystem with gcsfuse -o allow_other -o ro [bucket-name] /var/www/html

root@ubuntu:/var/www# gcsfuse -o allow_other -o ro melliott-vmw-static-site /var/www/html
2021/05/04 16:19:10.680365 Using mount point: /var/www/html
2021/05/04 16:19:10.686743 Opening GCS connection...
2021/05/04 16:19:11.037846 Mounting file system "melliott-vmw-static-site"...
2021/05/04 16:19:11.042605 File system has been successfully mounted.
root@ubuntu:/var/www#
root@ubuntu:/var/www#
root@ubuntu:/var/www# ls /var/www/html
assets  error  images  index.html  LICENSE.MD  README.MD

After mounting the bucket and running an ls on /var/www/html, I can see that my static website is mounted correctly.

Browsing to the public IP fronting my load balancer VIP now displays my static website, hosted in a Google Storage bucket. Pretty snazzy!

Google Private Access

My GCVE environment has internet access enabled, so native services are accessed via the internet gateway. If you don’t want to allow internet access for your environment, you can still access native services via Private Google Access. Much of the GCP documentation for this feature focuses on access to Google APIs from locations other than GCVE, but it is not too difficult to apply these practices to GCVE.

Google Private Access is primarily enabled by DNS, but you still need to enable this feature for any configured VPCs. The domain names used for this service are private.googleapis.com and restricted.googleapis.com. I was able to resolve both of these from my GCVE VMs, but my VMs are configured to use the resolvers in my GCVE environment. If you cannot resolve these hostnames, make sure you are using the GCVE DNS servers. As a reminder, these server addresses can be found under Private Cloud DNS Servers in the summary page for your GCVE cluster. You can find more information on Google Private Access here.

Viewing Routing Information

Knowing where to find routing tables is incredibly helpful when troubleshooting connectivity issues. There are a handful of places to look in GCP and GCVE to find this information.

VPC Routes

You can view routes for a VPC in the GCP portal by browsing to VPC networks, clicking on the desired VPC, then clicking on the Routes tab. If you are using VPC peering, you will notice a message that says, “This VPC network has been configured to import custom routes using VPC Network Peering. Any imported custom dynamic routes are omitted from this list, and some route conflicts might not be resolved. Please refer to the VPC Network Peering section for the complete list of imported custom routes, and the routing order for information about how GCP resolves conflicts.” Basically, this message says that you will not see routes for your GCVE environment in this table.

VPC Network Peering Routes

To see routes for your GCVE environment, browse to VPC Network Peering and choose the servicenetworking-googleapis-com entry for your VPC. You will see routes for your GCVE environment under Imported Routes and any subnets in your VPC under Exported Routes. You can also view these routes using the gcloud tool.

View imported routes: gcloud compute networks peerings list-routes servicenetworking-googleapis-com --network=[VPC Name] --region=[REGION]] --direction=INCOMING
View exported routes: gcloud compute networks peerings list-routes servicenetworking-googleapis-com --network=[VPC Name] --region=[REGION]] --direction=OUTGOING

Example results:

melliott@melliott-a01 gcp-bucket % gcloud compute networks peerings list-routes servicenetworking-googleapis-com  --network=gcve-usw2 --region=us-west2 --direction=INCOMING
DEST_RANGE         TYPE                   NEXT_HOP_REGION  PRIORITY  STATUS
168.80.0/29    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.0/29    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.16/29   DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.16/29   DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.8/29    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.8/29    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.112/28  DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.80.112/28  DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
30.28.0/24      DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
30.28.0/24      DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.81.0/24    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.81.0/24    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.83.0/24    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted
168.83.0/24    DYNAMIC_PEERING_ROUTE  us-west2         0         accepted

NSX-T

Routing and forwarding tables can be downloaded from the NSX-T manager web interface or via API. It’s also reasonably easy to grab the routing table with PowerCLI. The following example displays the routing table from the T0 router in my GCVE environment.

Import-Module VMware.PowerCLI
Connect-NsxtServer -Server my-nsxt-manager.gve.goog
$t0s = Get-NsxtPolicyService -Name com.vmware.nsx_policy.infra.tier0s
$t0_name = $t0s.list().results.display_name
$t0.list($t0_name).results.route_entries | Select-Object network,next_hop,route_type | Sort-Object -Property network

network                  next_hop         route_type
-------                  --------         ----------
0.0.0.0/0                192.168.81.225   t0s
0.0.0.0/0                192.168.81.241   t0s
10.30.28.0/24            169.254.160.3    t1c
10.30.28.0/24            169.254.160.3    t1c
169.254.0.0/24                            t0c
169.254.160.0/31                          t0c
169.254.160.0/31                          t0c
169.254.160.2/31                          t0c
169.254.160.2/31                          t0c
192.168.81.224/28                         t0c
192.168.81.240/28                         t0c
192.168.83.0/24          169.254.160.1    t1c
192.168.83.0/24          169.254.160.1    t1c

VPN Connectivity

I haven’t talked much about VPNs in this blog series, but it is an important component that deserves more attention. Provisioning a VPN to GCP is an easy way to connect to your GCVE environment if you are waiting on a Cloud Interconnect to be installed. It can also be used as backup connectivity if your primary connection fails. NSX-T has can terminate an IPSec VPN, but I would recommend using Cloud VPN instead. This will ensure you have connectivity to any GCP-based resources along with GCVE.

I’ve put together some example Terraform code to provision the necessary VPN-related resources in GCP. The example code is available at https://github.com/shamsway/gcp-terraform-examples in the gcve-ha-vpn subdirectory. Using this example will create the minimum configuration needed to stand up a VPN to GCP/GCVE. It is assumed that you have already created a VPC and configured peering with your GCVE cluster. This example does not create a redundant VPN solution, but it can be easily extended to do so by creating a secondary Cloud Router, interface, and BGP peer. You can find more information on HA VPN topologies in the GCP documentation. After using the example code, you will still need to configure the VPN settings at your site. Google provides configuration examples for several different vendors at Using third-party VPNs with Cloud VPN. I’ve written previously about VPNs for cloud connectivity, as well as other connection methods, in Cloud Connectivity 101

DNS Notes

I’ve saved the most important topic for last. DNS is a crucial component when operating in the cloud, so here are a few tips and recommendations to make sure you’re successful. Cloud DNS has a 100% uptime SLA, which is not something you see very often. This service is so crucial to GCP that Google has essentially guaranteed that it always be available. That is the type of guarantee that provides peace of mind, especially when you will have so many other services and applications relying on it.

In terms of GCVE, you must be able to properly resolve the hostnames for vCenter, NSX, HCX, and other applications deployed in your environment. These topics are covered in detail at these links:

The basic gist is this: the DNS servers running in your GCVE environment will be able to resolve A records for the management applications running in GCVE (vCenter, NSX, HCX, etc.). If you have configured VPC peering with GCVE, Cloud DNS will be automatically configured forward requests to the GCVE DNS servers for any gve.goog hostname. This will allow you to resolve GCVE-related A records from your VPC or bastion host. The last step is to make sure that you can properly resolve GCVE-related hostnames in your local environment. If you are using Windows Server for DNS, you need to configure a conditional forwarder for gve.goog, using the DNS servers running in GCVE. Other scenarios, like configuring BIND, are covered in the documentation links above.

Wrap Up

This is a doozy of a post, so I won’t waste too many words here. I genuinely hope you enjoyed this blog series. There will definitely be more GCVE-related blogs in the future, and you can hit me up any time @NetworkBrouhaha and let me know what topics you’d like to see covered. Thanks for reading!

Helpful Resources

You can find a hands-on lab for GCVE at https://labs.hol.vmware.com/ and searching for HOL-2179-01-ISM

Intro to Google Cloud VMware Engine – HCX Configuration

Mon, 12 Apr 2021 00:00:00 +0000

Now that we have an SDDC running in Google Cloud VMware Engine, it is time to migrate some workloads into the cloud! VMware HCX will be the tool I use to migrate Virtual Machines to GCVE. If you recall from the first post in this series, HCX was included in our SDDC deployment, so there is no further configuration needed in GCVE for HCX. The GCVE docs cover installing and configuring the on-prem components for HCX, so I’m not going to cover those steps in this post. As with previous posts, I will be taking an “automation first” approach to configuring HCX with Terraform. All of the code referenced in this post is available at https://github.com/shamsway/gcp-terraform-examples in the gcve-hcx sub-directory.

Other posts in this series:

Before we look at configuring HCX with Terraform, there are a few items to consider. The provider I’m using to configure HCX, adeleporte/hcx, is a community provider. It is not supported by VMware. It is also under active development, so you may run across a bug or some outdated documentation. In my testing of the provider, I have found that it works well for an environment with a single service mesh but needs some improvements to support environments with multiple service meshes.

Part of the beauty of open-source software is that anyone can contribute code. If you would like to submit an issue to track a bug, update documentation, or add new functionality, cruise over to the GitHub repo to get started.

HCX Configuration with Terraform

Configuring HCX involves configuring network profiles and a compute profile, which are then referenced in a service mesh configuration. The service mesh facilitates the migration of VMs to and from the cloud. The HCX documentation describes these components in detail, and I recommend reading through the user guide if you plan on performing a migration of any scale.

The example Terraform code linked at the beginning of the post will do the following:

Create a site pairing between your on-premises data center and your GCVE SDDC
Add two network profiles, one for management traffic and another for vMotion traffic. Network profiles for uplink and replication traffic can also be created, but in this example, I will use the management network for those functions.
Create a compute profile consisting of the network profiles created, and other parameters specific to your environment, like the datastore in use.
Create a service mesh between your on-prem data center and GCVE SDDC. This links the two compute profiles at each site for migration and sets other parameters, like the HCX features to enable.
Extend a network from your on-prem data center into your GCVE SDDC.

After Terraform completes the configuration, you will be able to migrate VMs from your on-prem data center into your GCVE SDDC. To get started, clone the example repo with git clone https://github.com/shamsway/gcp-terraform-examples.git, then change to the gcve-hcx sub-directory. You will find these files:

main.tf – Contains the primary Terraform code to complete the steps mentioned above
variables.tf – Defines the input variables that will be used in main.tf

Let’s take a look at the code that makes up this example.

main.tf Contents

terraform {
  required_providers {
    hcx = {
      source = "adeleporte/hcx"
    }
  }
}

Unlike previous examples, this one does not start with a provider block. Instead, this terraform block will download and install the adeleporte/hcx provider from registry.terraform.io, which is a handy shortcut for installing community providers.

provider "hcx" {
  hcx = "https://your.hcx.url"

  admin_username = var.hcx_admin_username
  admin_password = var.hcx_admin_password

  username = var.hcx_username
  password = var.hcx_password
}

The provider block specifies the URL for your HCX appliance, along with admin credentials (those used to access the appliance management UI over port 9443) and user credentials for the standard HCX UI. During my testing, I had to use an IP address instead of an FQDN for my HCX appliance. Note that this example has the URL specified directly in the code instead of using a variable. You will need to edit main.tf to set this value, along with a few other values that you will see below.

resource "hcx_site_pairing" "gcve" {
  url      = "https://gcve.hcx.url"
  username = var.gcve_hcx_username
  password = var.gcve_hcx_password
}

The hcx_site_pairing resource creates a site pairing between your on-prem and GCVE-based HCX appliances. This allows both HCX appliances to exchange information about their local environments and is a prerequisite to creating the service mesh. I used the FQDN of the HCX server running in GCVE for the url parameter, but I had previously configured DNS resolution between my lab and my GCVE environment. You can find the IP and FQDN of your HCX server in GCVE by browsing to Resources > [Your SDDC] > vSphere Management Network.

resource "hcx_network_profile" "net_management_gcve" {
  site_pairing = hcx_site_pairing.gcve
  network_name = "Management network name"
  name         = "Management network profile name"
  mtu          = 1500

  ip_range {
    start_address = "172.17.10.10"
    end_address   = "172.17.10.13"
  }

  gateway       = "172.17.10.1"
  prefix_length = 24
  primary_dns   = "172.17.10.2"
  secondary_dns = "172.17.10.3"
  dns_suffix    = "yourcompany.biz"
}

This block and the block immediately following it add new network profiles to your local HCX server. Network profiles specify a local network to use for specific traffic (management, uplink, vMotion, or replication) as well as an IP range reserved for use by HCX appliances. For smaller deployments, it is OK to use one network profile for multiple traffic types. This example creates a management network profile, which will also be used for uplink and replication traffic, and another profile dedicated for vMotion.

resource "hcx_compute_profile" "compute_profile_1" {
  name       = "SJC-CP"
  datacenter = "San Jose"
  cluster    = "Compute Cluster"
  datastore  = "comp-vsanDatastore"
  depends_on = [
    hcx_network_profile.net_management_gcve, hcx_network_profile.net_vmotion_gcve
  ]

  management_network  = hcx_network_profile.net_management_gcve.id
  replication_network = hcx_network_profile.net_management_gcve.id
  uplink_network      = hcx_network_profile.net_management_gcve.id
  vmotion_network     = hcx_network_profile.net_vmotion_gcve.id
  dvs                 = "nsx-overlay-transportzone"

  service {
    name = "INTERCONNECT"
  }

  service {
    name = "WANOPT"
  }

  service {
    name = "VMOTION"
  }

  service {
    name = "BULK_MIGRATION"
  }

  service {
    name = "NETWORK_EXTENSION"
  }
}

The hcx_compute_profile resource defines the compute, storage, and networking components at the local site that will participate in a service mesh. Compute and storage settings are defined at the beginning of the block. The management profile previously created is also specified for the replication and uplink networks. Finally, the service statements define which HCX features are enabled for the compute profile. If you attempt to enable a feature that you are not licensed for, Terraform will return an error.

There are two things to note with this resource. First, the dvs parameter is not accurately named. It would be more accurate to name this parameter network_container or something similar. In this example, I am referencing an NSX transport zone instead of a DVS. This is a valid setup as long as you have NSX registered with your HCX server, so some work is needed to update this provider to reflect that capability. Second, I’ve added a depends_on statement. I noticed during my testing that this provider would occasionally attempt to remove resources out of order, which ultimately would cause terraform destroy to fail. Using the depends_on statement fixes this issue, but some additional logic will need to be added to the provider better understand resource dependencies. I’ve also added depends_on statements to the following blocks for the same reason.

resource "hcx_service_mesh" "service_mesh_1" {
  name                   = "Service Mesh Name"
  site_pairing           = hcx_site_pairing.gcve
  local_compute_profile  = hcx_compute_profile.compute_profile_1.name
  remote_compute_profile = "GCVE Compute Profile"
  depends_on = [ hcx_compute_profile.compute_profile_1 ]

  app_path_resiliency_enabled   = false
  tcp_flow_conditioning_enabled = false

  uplink_max_bandwidth = 10000

  service {
    name = "INTERCONNECT"
  }

  service {
    name = "WANOPT"
  }

  service {
    name = "VMOTION"
  }

  service {
    name = "BULK_MIGRATION"
  }

  service {
    name = "NETWORK_EXTENSION"
  }
}

The hcx_service_mesh resource is where the magic happens. This block creates the service mesh between your on-prem data center and your GCVE SDDC by deploying multiple appliances at both sites and building encrypted tunnels between them. Once this process is complete, you will be able to migrate VMs into GCVE. Notice that the configuration is relatively basic, referencing the site pairing and local compute profile configured by Terraform. You will need to know the name of the compute profile in GCVE, but if you are using the default configuration, it should be GCVE Compute Profile. Similar to the compute profile, the service parameters define which features are enabled on the service mesh. Typically, the services enabled in your compute profile should match the services enabled in your service mesh.

resource "hcx_l2_extension" "l2_extension_1" {
  site_pairing    = hcx_site_pairing.gcve
  service_mesh_id = hcx_service_mesh.service_mesh_1.id
  source_network  = "Name of local network to extend"
  network_type    = "NsxtSegment"
  depends_on = [ hcx_service_mesh.service_mesh_1 ]

  destination_t1 = "Tier1"
  gateway        = "192.168.10.1"
  netmask        = "255.255.255.0"
}

This final block is optional but helpful in testing a migration. This block extends a network from your data center into GCVE using HCX Network Extension. This example extends an NSX segment, but the hcx_l2_extension resource documentation provides the parameters needed to extend a DVS-based network. You will need to know the name of the tier 1 router in GCVE you wish to connect this network to.

Variables Used

The following input variables are required for this example:

hcx_admin_username: Username for on-prem HCX appliance management. Default value is admin.
hcx_admin_password: Password for on-prem HCX appliance management
hcx_username: Username for on-prem HCX instance
hcx_password: Password for on-prem HCX instance
gcve_hcx_username: Username for GCVE HCX instance. Default value is [email protected]
gcve_hcx_password: Password for GCVE HCX instance

Using Environment Variables

You can use the following commands on macOS or Linux to provide these variable values via environment variables. This is a good practice when passing credentials to Terraform.

export TF_VAR_hcx_admin_username='admin'
export TF_VAR_hcx_admin_password='password'
export TF_VAR_hcx_username='[email protected]'
export TF_VAR_hcx_password='password'
export TF_VAR_gcve_hcx_username='[email protected]'
export TF_VAR_gcve_hcx_password='password'

You can use the unset commmand to remove set environment variables, if necessary.

Initializing and Running Terraform

See the README included in the example repo for the steps required to initialize and run Terraform. This is the same process as previous examples.

Final Thoughts

It feels good to finally be able to migrate some workloads into our GCVE environment! Admittedly, this example is a bit of a stretch and may not be useful for all HCX users. My team works heavily with HCX, and we are frequently standing up or removing an HCX service mesh for various environments. This provider will be a huge time saver for us and will be especially valuable once there are a few fixes and improvements. Configuring HCX via the UI is an excellent option for new users, but once you are standing up your tenth service mesh, it becomes apparent that using Terraform is much quicker than clicking through several dialogs. I also believe that seeing the HCX configuration represented in Terraform code provides an excellent overview of all of the configuration needed, and how the configuration of different components stack together like Legos to form a completed service mesh.

What about automating the actual migration of VMs? This example prepares our environment for migration, but automating VM migration is best suited for a different tool than Terraform. Luckily, there are plenty of HCX-specific cmdlets in PowerCLI. Check out these existing resources for some examples of using PowerCLI with HCX.

This blog series is approaching its conclusion, but in my next post I’ll dive into configuring some common network use cases, like exposing a VM to the internet and configuring a load balancer in GCVE.

Helpful Links

Migrating VMware VMs using VMware HCX
Google Cloud VMware Engine Overview Hands-on Lab, which includes HCX configuration.
adeleporte/hcx community Terraform provider for HCX
HCX Lab - Full HCX Connector configuration Terraform example
hcx_site_pairing Resource
hcx_network_profile Resource
hcx_compute_profile Resource
hcx_service_mesh Resource
hcx_l2_extension Resource
Getting Started with the PoweCLI HCX Module
PowerCLI Example Scripts for HCX

Screenshots

Below are screenshots from HCX showing the results of running this Terraform example in my lab, for reference. I have modified the example code to match the configuration of my lab environment.

HCX Network Profiles

HCX Compute Profile

HCX Service Mesh

HCX Service Mesh Appliance Details

HCX Network Extension

HCX vMotion Test

Intro to Google Cloud VMware Engine – Network and Connectivity Overview

Thu, 18 Mar 2021 00:00:00 +0000

In previous posts, I’ve shown you how to deploy an SDDC in Google Cloud VMware Engine, connect the SDDC to a VPC, and deploy a bastion host for managing your environment. In this post, we’ll take a pause on deploying anything new to take a closer look at our SDDC. This post will provide an overview of the networking configuration and capabilities, and how to connect to it from an external site.

Other posts in this series:

SDDC Networking Overview

Google Cloud VMware Engine Overview by Google, licensed under CC BY 3.0

An SDDC running in GCVE consists of VMware vSphere, vCenter, vSAN, NSX-T, and optionally HCX, all running on top of Google Cloud infrastructure. Let’s take a peek at an SDDC deployment.

VDS and N-VDS Configuration

Configuration of the single VDS in the SDDC is basic, and used to provide connectivity for HCX. The VLANs listed are locally significant to Google’s infrastructure and not something we need to worry about.

The virtual switch settings for one of the ESXi hosts provides a better picture of the networking landscape. Here we can see both the vanilla VDS deployed, along with the N-VDS managed by NSX-T. Almost all of the networking configuration we will perform will be in NSX-T, but I wanted to show the underlying configuration for curious individuals.

We’ll look at NSX-T further below, but this screenshot from NSX-T is a simple visualization of the N-VDS deployed.

VMkernel and vmnic Configuration

VMkernel configuration is straightforward, with dedicated adapters for management, vSAN, and vMotion. The IP addresses correspond with the management, vSAN, and vMotion subnets that were automatically created when the SDDC was deployed.

There are four 25 Gbps vmnics (physical adapters) in each host, providing an aggregate of 100 Gbps per host. Two vmnics are dedicated to the VDS, and two are dedicated to the N-VDS.

NSX-T Configuration

The out-of-the-box NSX-T configuration for GCVE should look very familiar to you if you have ever deployed VMware Cloud Foundation. The T0 router has redundant BGP connections to Google’s infrastructure.

There are no NAT rules configured, and the firewall has a default allow any any rule. This may not be what you were expecting, but by the end of this post, it should be more clear. We will look at traffic flows in the SDDC Networking Capabilities section below.

The configured transport zones consist of three VLAN TZs, and a single overlay TZ. The VLAN TZs facilitate the plumbing between the T0 router and Google infrastructure for BGP peering. The TZ-OVERLAY zone is where workload segments will be placed.

Finally, there is one edge cluster consisting of two edge nodes to host the NSX-T logical routers.

SDDC Networking Capabilities

Now that we’ve peeked behind the curtain, let’s talk about what you can actually do with your SDDC. This is by no means an exhaustive list, but here are some common use cases:

Create workload segments in NSX-T
Expose VMs or services to the internet via public IP
Leverage NSX-T load balancing capabilities
Create north-south firewall policies with the NSX-T gateway firewall
Create east-west firewall policies (i.e., micro-segmentation) with the NSX-T distributed firewall
Access and consume Google Cloud native services
Migrate VMs from your on-prem data center to your GCVE SDDC with VMware HCX

I will be covering many of these topics in future posts, including automation examples. Next, let’s look at the options for ingress and egress traffic.

Egress Traffic

Google Cloud VMware Engine Egress Traffic Flows by Google, licensed under CC BY 3.0

One of the strengths of GCVE is that it provides you with options. As you can see on this diagram, you have three options for egress traffic:

Egress through the GCVE internet gateway
Egress through an attached VPC
Egress through your on-prem data center via Cloud Interconnect or Cloud VPN

In Deploying a GCVE SDDC with HCX, I walked through the steps to enable Internet Access and Public IP Service for your SDDC. This is all that is needed to provide egress internet access through the internet gateway. Internet-bound traffic will be routed from the T0 router to the internet gateway, which NATs all traffic behind a public IP.

Egress through an attached VPC or on-prem datacenter requires additional steps that are beyond the scope of this post, but I will provide documentation links at the end of this post for these scenarios.

Ingress Traffic

Google Cloud VMware Engine Ingress Traffic Flows by Google, licensed under CC BY 3.0

Ingress traffic to GCVE follows similar paths as egress traffic. You can ingress via the public IP service, connected VPC, or through your on-prem data center. Using the public IP service is the least complicated option and requires that you’ve enabled Public IP Service for your SDDC.

Public IPs are not assigned directly to VM. Instead, a public IP is allocated and NATed to a private IP in your SDDC. is You can allocate a public IP in the GCVE portal by supplying a name for the IP allocation, region, and the private address.

Connecting to your SDDC

My previous post, Deploying a GCVE SDDC with HCX, outlines the steps to set up client VPN access to your SDDC, and Bastion Host Access with IAP provides an example bastion host setup for managing your SDDC. These are “day 1” options for connectivity, so you will likely need some other method to connect to your on-prem data center to your GCVE SDDC. I covered cloud connectivity options in Cloud Connectivity 101, and many of the methods outlined that post are available for connecting to GCVE. Today, your options are to use Cloud Interconnect or an IPSec tunnel via Cloud VPN or NSX-T IPSec VPN.

In our lab, we are lucky to have a connection to Megaport, so I am using Partner Interconnect for my testing with GCVE. This is a very easy solution for connecting to the cloud, and their documentation provides simple step-by-step instructions to get up and running. Once complete, BGP peering will be established between the Megaport Cloud Router and a Google Cloud Router.

Advertising Routes to GCVE

VPC peering in Google Cloud does not support transitive routing. This means that I had to add a custom advertised IP range for my GCVE subnets to the Google Cloud Router. After adding this configuration, I was able to ping IPs in my SDDC. You will need to configure your DNS server to resolve queries for gve.goog to be able to access vCenter, NSX and HCX by their hostnames.

ICMP in GCVE

One nuance in GCVE that threw me off is that ICMP is not supported by the internal load balancer, which is in the path for egress traffic if you are using the internet gateway. Trying to ping 8.8.8.8 will fail, even if your SDDC is correctly connected to the internet. To test internet connectivity from a VM in your SDDC, use another tool like curl or follow the instructions here to install tcpping for testing.

Next Steps

Next, we will stage our SDDC networking segments and connect HCX to begin migrating workloads to GCVE. I highly recommend you read the Private cloud networking for Google Cloud VMware Engine whitepaper, which goes into many of the subjects I’ve touched on in this blog in greater detail.

Intro to Google Cloud VMware Engine – Bastion Host Access with IAP

Wed, 03 Mar 2021 00:00:00 +0000

Welcome back! This post will build on the previous posts in this series by deploying a Windows Server 2019 bastion host to manage our Google Cloud VMware Engine (GCVE) SDDC. Access to the bastion host will be provided with Identity-Aware Proxy (IAP). Everything will be deployed and configured with Terraform, with all of the code referenced in this post is available at https://github.com/shamsway/gcp-terraform-examples in the gcve-bastion-iap sub-directory.

Other posts in this series:

Deploying a GCVE SDDC with HCX
Connecting a VPC to GCVE
Network and Connectivity Overview
HCX Configuration
Common Networking Scenarios
Identity Aware Proxy Overview

Standing up initial cloud connectivity is challenging. I walked through the steps to deploy a client VPN in Deploying a GCVE SDDC with HCX, but this post will show how to use IAP as a method for accessing a new bastion host. Using IAP means that the bastion host will be accessible without having to configure a VPN or expose it to the internet. I am a massive fan of this approach, and while there are some tradeoffs to discuss, it is a simpler and more secure approach than traditional access methods.

IAP can be used to access various resources, including App Engine and GKE. Accessing the bastion host over RDP (TCP port 3389) will be accomplished using IAP for TCP forwarding. Once configured, IAP will allow us to establish a connection to our bastion host over an encrypted tunnel on demand. Configuring this feature will require some specific IAM roles, as well as some firewall rules in your VPC. If you have Owner permissions in your GCP project, then you’re good to go. Otherwise, you will need the following roles assigned to complete the tasks outlined in the rest of this post:

Compute Admin (roles/compute.admin)
Service Account Admin (roles/iam.serviceAccountAdmin)
Service Account User (roles/iam.serviceAccountUser)
IAP Policy Admin (roles/iap.admin)
IAP settings Admin (roles/iap.settingsAdmin)
IAP-secured Tunnel User (roles/iap.tunnelResourceAccessor)
Service Networking Admin (roles/servicenetworking.networksAdmin)
Project IAM Admin (roles/resourcemanager.projectIamAdmin)

The VPC firewall will need to allow traffic sourced from 35.235.240.0/20, which is the range that IAP uses for TCP forwarding. This rule can be further limited to specific TCP ports, like 3389 for RDP or 22 for SSH.

Bastion Host Deployment with Terraform

The example Terraform code linked at the beginning of the post will do the following:

Create a service account, which will be associated with the bastion host
Create Windows 2019 Server instance, which will be used as a bastion host
Create firewall rules for accessing the bastion host via IAP, and accessing resources from the bastion host
Assign IAM roles needed for IAP
Set a password on the bastion host using the gcloud tool

After Terraform completes configuration, you will be able to use the gcloud tool to enable TCP forwarding for RDP. Once connected to the bastion host, you will be able to log into your GCVE-based vSphere portal. To get started, clone the example repo with git clone https://github.com/shamsway/gcp-terraform-examples.git, then change to the gcve-bastion-iap sub-directory. You will find these files:

main.tf – Contains the primary Terraform code to complete the steps mentioned above
variables.tf – Defines the input variables that will be used in main.tf
terraform.tfvars – Supplies values for the input variables defined in variables.tf
outputs.tf – Defines the output variables to be returned from main.tf

Let’s take a closer look at what is happening in each of these files.

main.tf Contents

provider "google" {
  project = var.project
  region  = var.region
  zone    = var.zone
}

data "google_compute_network" "network" {
  name = var.network_name
}

data "google_compute_subnetwork" "subnet" {
  name   = var.subnet_name
  region = var.region
}

Just like the example from my last post, main.tf begins with a provider block to define the Google Cloud project, region, and zone in which Terraform will create resources. The following data blocks, google_compute_network.network and google_compute_network.subnet, reference an existing VPC network and subnetwork. These data blocks will provide parameters necessary for creating a bastion host and firewall rules.

resource "google_service_account" "bastion_host" {
  project      = var.project
  account_id   = var.service_account_name
  display_name = "Service Account for Bastion"
}

The first resource block creates a new service account that will be associated with our bastion host instance.

resource "google_compute_instance" "bastion_host" {
  name         = var.name
  machine_type = var.machine_type

  boot_disk {
    initialize_params {
      image = var.image
    }
  }

  network_interface {
    subnetwork = data.google_compute_subnetwork.subnet.self_link
    access_config {}
  }

  service_account {
    email  = google_service_account.bastion_host.email
    scopes = var.scopes
  }

  tags   = [var.tag]
  labels = var.labels
  metadata = var.metadata
}

The google_compute_instance.bastion_host block creates the bastion host. There are a few things to take note of in this block. subnetwork is set based on one of the data blocks at the beginning of main.tf, data.google_compute_subnetwork.subnet.self_link. The self_link property provides a unique reference to the subnet that Terraform will use when submitting the API call to create the bastion host. Similarly, the service account created by google_service_account.bastion_host is assigned to the bastion host.

tags, labels, and metadata all serve similar, but distinct, purposes. tags are network tags, which will be used in firewall rules. labels are informational data that can be used for organizational or billing purposes. metadata has numerous uses, the most common of which is supplying a boot script that the instance will run on first boot.

resource "google_compute_firewall" "allow_from_iap_to_bastion" {
  project = var.project
  name    = var.fw_name_allow_iap_to_bastion
  network = data.google_compute_network.network.self_link

  allow {
    protocol = "tcp"
    ports    = ["3389"]
  }

  # https://cloud.google.com/iap/docs/using-tcp-forwarding#before_you_begin
  # This range is needed to allow IAP to access the bastion host
  source_ranges = ["35.235.240.0/20"]

  target_tags = [var.tag]
}

resource "google_compute_firewall" "allow_access_from_bastion" {
  project = var.project
  name    = var.fw_name_allow_mgmt_from_bastion
  network = data.google_compute_network.network.self_link

  allow {
    protocol = "icmp"
  }

  allow {
    protocol = "tcp"
    ports    = ["22", "80", "443", "3389"]
  }

  # Allow management traffic from bastion
  source_tags = [var.tag]
}

The next two blocks create firewall rules: one for accessing the bastion host via IAP, and the other for accessing resources from the bastion host. google_compute_firewall.allow_from_iap_to_bastion allows traffic from 35.235.240.0/24 on tcp/3389 to instances that have the same network tag as the one that was assigned to the bastion host. google_compute_firewall.allow_access_from_bastion allows traffic from the bastion host by referencing the same network tag to anything else in our project, using common management ports/protocols.

resource "google_iap_tunnel_instance_iam_binding" "enable_iap" {
  project    = var.project
  zone       = var.zone
  instance   = var.name
  role       = "roles/iap.tunnelResourceAccessor"
  members    = var.members
  depends_on = [google_compute_instance.bastion_host]
}

The google_iap_tunnel_instance_iam_binding.enable_iap block assigns the roles/iap.tunnelResourceAccessor IAM role to the accounts defined in the members variable. This value could be any valid IAM resource like a specific account or a group. This role is required to be able to access the bastion host via IAP.

resource "google_service_account_iam_binding" "bastion_sa_user" {
  service_account_id = google_service_account.bastion_host.id
  role               = "roles/iam.serviceAccountUser"
  members            = var.members
}

The google_project_iam_member.bastion_sa_user block allows accounts specified in the members variable to use the newly created service account via the Service Account User role (roles/iam.serviceAccountUser). This allows the users or groups defined in the members variable to access all of the resources that the service account has rights to access. More information on this can be found here.

resource "google_project_iam_member" "bastion_sa_bindings" {
  for_each = toset(var.service_account_roles)

  project = var.project
  role    = each.key
  member  = "serviceAccount:${google_service_account.bastion_host.email}"
}

google_project_iam_member.bastion_sa_bindings completes the IAM-related configuration by granting roles defined in the service_account_roles variable to the service account. This service account is assigned to the bastion host, which defines what the bastion host can do. The default roles assigned are listed below, but they can be modified in variables.tf.

Log Writer (roles/logging.logWriter)
Monitoring Metric Writer (roles/monitoring.metricWriter)
Monitoring Viewer (roles/monitoring.viewer)
Compute OS Login (roles/compute.osLogin)

resource "time_sleep" "wait_60_seconds" {
  create_duration = "60s"
  depends_on      = [google_compute_instance.bastion_host]
}

data "external" "gcloud_set_bastion_password" {
  program    = ["bash", "-c", "gcloud compute reset-windows-password ${var.name} --user=${var.username} --format=json --quiet"]
  depends_on = [time_sleep.wait_60_seconds]
}

These final two blocks are what I refer to as “cool Terraform tricks.” The point of these blocks is to set the password on the bastion host. There are a few ways to do this, but unfortunately, there is no way to set a Windows instance password with a native Terraform resource. Instead, an external data source is used to run the appropriate gcloud command, with JSON formatted results returned (this is a requirement of the external data source.) The password cannot be set until the bastion host is fully booted, so external.gcloud_set_bastion_pasword depends on time_sleep.wait_60_seconds, which is a simple 60-second timer that gives the bastion host time to boot up before the gcloud command is run.

There is a chance that 60 seconds may not be long enough for the bastion host to boot. If you receive an error stating that the instance is not ready for use, you have two options:

Run terraform destroy to remove the bastion host. Edit main.tf and increase the create_duration to a higher value, then run terraform apply again.
Run the gcloud compute reset-windows-password command manually

Ideally, the password reset functionality would be built into the Google Cloud Terraform provider, and I wouldn’t be surprised to see it added in the future. If you’re reading this post in 2022 or beyond, it’s probably worth a quick investigation to see if this has happened.

output.tf Contents

output "bastion_username" {
  value = data.external.gcloud_set_bastion_password.result.username
}

output "bastion_password" {
  value = data.external.gcloud_set_bastion_password.result.password
}

These two outputs are the results of running the gcloud command. Once Terraform has completed running, it will display the username and password set on the bastion host. A password is sensitive data, so if you want to prevent it from being displayed, add sensitive = true to the bastion_password output block. Output values are stored in the Terraform state file, so you should take precautions to protect the state file from unauthorized access. Additional information on Terraform outputs is available here.

terraform.tfvars Contents

terraform.tfvars is the file that defines all the variables that are referenced in main.tf. All you need to do is supply the desired values for your environment, and you are good to go. Note that the variables below are all examples, so simply copying and pasting may not lead to the desired result.

members              = ["user:[email protected]"]
project              = "your-gcp-project"
region               = "us-west2"
zone                 = "us-west2-a"
service_account_name = "bastion-sa"
name                 = "bastion-vm"
username             = "bastionuser"
labels               = { owner = "GCVE Team", created_with = "terraform" }
image                = "gce-uefi-images/windows-2019"
machine_type         = "n1-standard-1"
network_name         = "gcve-usw2"
subnet_name          = "gcve-usw2-mgmt"
tag                  = "bastion"

Additional information on the variables used is available in README.md. You can also find information on these variables, including their default values should one exist, in variables.tf.

Initializing and Running Terraform

Terraform will use Application Default Credentials to authenticate to Google Cloud. Assuming you have the gcloud cli tool installed, you can set these by running gcloud auth application-default. Additional information on authentication can be found in the Getting Started with the Google Provider Terraform documentation. To run the Terraform code, follow the steps below.

Following these steps will create resources in your Google Cloud project, and you will be billed for them.

Run terraform init and ensure no errors are displayed
Run terraform plan and review the changes that Terraform will perform
Run terraform apply to apply the proposed configuration changes

Should you wish to remove everything created by Terraform, run terraform destroy and answer yes when prompted. This will only remove the VPC network and related configuration created by Terraform. Your GCVE environment will have to be deleted using these instructions, if desired.

Accessing the Bastion Host with IAP

Now, you should have a fresh Windows 2019 Server running in Google Cloud to serve as a bastion host. Use this command to create a tunnel to the bastion host:

gcloud compute start-iap-tunnel [bastion-host-name] 3389 --zone [zone]

You will see a message that says Listening on port [random number]. This random high port is proxied to your bastion host port 3389. Fire up your favorite RDP client and connect to localhost:[random number]. Login with the credentials that were output from running Terraform. Once you’re able to connect to the bastion host, install the vSphere-compatible browser of your choice, along with any other management tools you may need.

If you’re a Windows user, there is an IAP-enabled RDP client available here.

Accessing GCVE Resources from the Bastion Host

Open the GCVE portal, browse to Resources, and click on your SDDC, then vSphere Management Network. This will display the hostnames for your vCenter, NSX and HCX instances. Copy the hostname for vCenter and paste it into a browser in your bastion host to verify you can access your SDDC.

Cloud DNS forwarding config to enable resolution of GCVE resources

Access to GCVE from your VPC is made possible by private service access and a DNS forwarding configuration in Cloud DNS. The DNS forwarding configuration enables name resolution from your VPC for resources in GCVE. It is automatically created in Cloud DNS when private service access is configured between your VPC and GCVE. This is a relatively new feature and a nice improvement. Previously, name resolution for GCVE required manually changing resolvers on your bastion host or configuring a standalone DNS server.

Wrap Up

A quick recap of everything we’ve accomplished if you’ve been following this blog series from the beginning:

Deployed an SDDC in GCVE
Created a new VPC and configured private service access to your SDDC
Deployed a bastion host in your VPC, accessible via IAP

Clearly, we are just getting started! My next post will look at configuring Cloud Interconnect and standing up an HCX service mesh. With that in place, we can begin migrating some workloads into our SDDC.

Terraform Documentation Links

Intro to Google Cloud VMware Engine – Connecting a VPC to GCVE

Fri, 19 Feb 2021 00:00:00 +0000

My previous post walked through deploying an SDDC in Google Cloud VMware Engine (GCVE). This post will show the process of connecting a VPC to your GCVE environment, and we will use Terraform to do the vast majority of the work. The diagram below shows the basic concept of what I will be covering in this post. Once connected, you will be able to communicate from your VPC to your SDDC and vice versa. If you would like to complete this process using the cloud console instead of Terraform, see Setting up private service access in the VMware Engine documentation.

Other posts in this series:

I’m assuming you have a working SDDC deployed in VMware Engine and some basic knowledge of how Terraform works so you can use the provided Terraform examples. If you have not yet deployed an SDDC, please do so before continuing. If you need to get up to speed with Terraform, browse over to https://learn.hashicorp.com/terraform. All of the code referenced in this post will be available at https://github.com/shamsway/gcp-terraform-examples in the gcve-network sub-directory. You will need to have git installed to clone the repo, and I highly recommend using Visual Studio Code with the Terraform add-on installed to view the files.

Private Service Access Overview

GCVE SDDCs can establish connectivity to native GCP services with private services access. This feature can be used to establish connectivity from a VPC to a third-party “service producer,” but in this case, it will simply plumb connectivity between our VPC and SDDC. Configuring private services access requires allocating one or more reserved ranges that cannot be used in your local VPC network. In this case, we will supply the ranges that we have allocated for our VMware Engine SDDC networks. Doing this prevents issues with overlapping IP ranges.

Leveraging Terraform for Configuration

I have provided Terraform code that will do the following:

Create a VPC network
Create a subnet in the new VPC network that will be used to communicate with GCVE
Create two Global Address pools that will be used to reserve addresses used in GCVE
Create a private connection in the new VPC, using the two Global Address pools as reserved ranges
Enable import and export of custom routes for the VPC

After Terraform completes configuration, you will be able to establish peering with the new VPC in GCVE. To get started, clone the example repo with git clone https://github.com/shamsway/gcp-terraform-examples.git, then change to the gcve-network sub-directory. You will find these files:

main.tf – Contains the primary Terraform code to complete the steps mentioned above
variables.tf – Defines the input variables that will be used in main.tf
terraform.tfvars – Supplies values for the input variables defined in variables.tf

Let’s take a look at what is happening in main.tf, then we will supply the necessary variables in terraform.tfvars and run Terraform. You will see var.[name] appear over and over in the code, as this is how Terraform references variables. You may think it would be easier to place the desired values directly into main.tf instead of defining and supplying variables, but it is worth the time to get used to using variables with Terraform. Hardcoding values in your code is rarely a good idea, and most Terraform code that I have consumed from other authors use variables heavily.

main.tf Contents

provider "google" {
  project = var.project
  region  = var.region
  zone    = var.zone
}

The file begins with a provider block, which is common in Terraform. This block defines the Google Cloud project, region, and zone in which Terraform will create resources. The values used are specified in terraform.tfvars, which is the same method we will use throughout this example.

resource "google_compute_network" "vpc_network" {
  name                    = var.network_name
  description             = var.network_descr
  auto_create_subnetworks = false
}

The first resource block creates a new VPC in the region and zone specified in the provider block. Setting auto_create_subnetworks to false specifies that we want a custom VPC instead of auto-creating subnets for each region.

resource "google_compute_subnetwork" "vpc_subnet" {
  name          = var.subnet_name
  ip_cidr_range = var.subnet_cidr
  region        = var.region
  network       = google_compute_network.vpc_network.id
}

The next block creates a subnet in the newly created VPC. Notice that the last line references google_compute_network.vpc_network.id for the network value, meaning that it uses the ID value of the VPC created by Terraform.

resource "google_compute_global_address" "private_ip_alloc_1" {
  name          = var.reserved1_name
  address       = var.reserved1_address
  purpose       = var.address_purpose
  address_type  = var.address_type
  prefix_length = var.reserved1_address_prefix_length
  network       = google_compute_network.vpc_network.id
}

This block and the following block (google_compute_global_address.private_ip_alloc_2) create a private IP allocation used for the private services configuration.

resource "google_service_networking_connection" "gcve-psa" {
  network                 = google_compute_network.vpc_network.id
  service                 = var.service
  reserved_peering_ranges = [google_compute_global_address.private_ip_alloc_1.name, google_compute_global_address.private_ip_alloc_2.name]
  depends_on              = [google_compute_network.vpc_network]
}

These last two blocks are where things get interesting. The block above configures the private services connection using the VPC network and private IP allocation created by Terraform. Service is a specific string, servicenetworking.googleapis.com, since Google is the service provider in this scenario. This value is set in terraform.tfvars, as we will see in a moment. If you find this confusing, check the available documentation for this resource, and it should help you to understand it.

resource "google_compute_network_peering_routes_config" "peering_routes" {
  peering = var.peering
  network = google_compute_network.vpc_network.name
  import_custom_routes = true
  export_custom_routes = true
  depends_on           = [google_service_networking_connection.gcve-psa]
}

The final block enables the import and export of custom routes for our VPC peering configuration.

Note that the final two blocks contain an argument that none of the others do: depends_on. The Terraform documentation describes depends_on in-depth here, but basically, this is a hint for Terraform to describe resources that rely on each other. Typically, Terraform can determine this automatically, but there are occasional cases where this statement needs to be used. Running terraform destroy without this argument in place may lead to errors, as Terraform could delete the VPC before removing the private services connection or route peering configuration.

terraform.tfvars Contents

project                         = "your-gcp-project"
region                          = "us-west2"
zone                            = "us-west2-a"
network_name                    = "gcve-usw2"
network_descr                   = "Network for testing of GCVE in USW2"
subnet_name                     = "gcve-usw2-mgmt"
subnet_cidr                     = "192.168.82.0/24"
reserved1_name                  = "gcve-managemnt-ip-alloc"
reserved1_address               = "192.168.80.0"
reserved1_address_prefix_length = 23
reserved2_name                  = "gcve-workload-ip-alloc"
reserved2_address               = "192.168.84.0"
reserved2_address_prefix_length = 23
address_purpose                 = "VPC_PEERING"
address_type                    = "INTERNAL"
service                         = "servicenetworking.googleapis.com"
peering                         = "servicenetworking-googleapis-com"

Additional information on the variables used is available in README.md. You can also find information on these variables, including their default values should one exist, in variables.tf.

Initializing and Running Terraform

Following these steps will create resources in your Google Cloud project, and you will be billed for them.

Run terraform init and ensure no errors are displayed
Run terraform plan and review the changes that Terraform will perform
Run terraform apply to apply the proposed configuration changes

Review VPC Configuration

Once terraform apply completes, you can see the results in the Google Cloud Console.

IP ranges allocated for use in GCVE are reserved.

Private service access is configured.

Import and export of custom routes on the servicenetworking-googleapis-com private connection is enabled.

Complete Peering in GCVE

The final step is to create the private connection in the VMware Engine portal. You will need the following information to configure the private connection.

Project ID (found under Project info on the console dashboard.) Project ID may be different than Project Name, so verify you are gathering the correct information.
Project Number (also found under Project info on the console dashboard.)
Name of the VPC (network_name in your variables.tf file.)
Peered project ID from VPC Network Peering screen

Save all of these values somewhere handy, and follow these steps to complete peering

Open the VMware Engine portal, and browse to Network > Private connection.
Click Add network connection and paste the required values. Supply the peered project ID in the Tenant project ID field, VPC name in the Peer VPC ID field, and complete the remaining fields.
Choose the region your VMware Engine private cloud is deployed in, and click submit.

After a few moments, Region Status should show a status of Connected. Your VMware Engine private cloud is now peered with your Google Cloud VPC. You can verify peering is working by checking the routing table of your VPC.

Verify VPC Routing Table

Once peering is completed, you should see routes for networks in your GCVE SDDC in your VPC routing table. You can view these routes in the cloud console or with:

 gcloud couple networks peerings list-routes service-networking-googleapis-com –network=[VPC Name] –region=[Region name] –direction=incoming

Verifying routes with the gcloud cli

Viewing routes in the console

Wrap Up

Well, that was fun! You should now have established connectivity between your VMware Engine SDDC and your Google Cloud VPC, but we are only getting started. My next post will cover creating a bastion host in GCP to manage your GCVE environment, and I may take a look at Cloud DNS as well.

This post comes at a good time, as Google has just announced several enhancements to GCVE, including multiple VPC peering. I’m planning on exploring these enhancements in future posts.

Terraform Documentation Links

Intro to Google Cloud VMware Engine - Deploying a GCVE SDDC with HCX

Thu, 04 Feb 2021 00:00:00 +0000

Welcome to the first post in a new series focusing on Google Cloud VMware Engine (GCVE)! This first post will walkthrough prerequisites, deploying an SDDC with VMware HCX, and accessing vCenter via VPN Gateway (i.e., OpenVPN).

Before we dive into deploying an SDDC, I want to set expectations for this blog series. My goal when working in the cloud is to create, modify and destroy resources programmatically. My tool of choice is Terraform, but I will also use CLI-based tools like gcloud. Occasionally I will inspect API calls directly and perform API calls with Python or cURL. I have found that learning a product’s API is an excellent way to master it. Cloud consoles (GUIs) are adequate when getting started, but interfacing with the API, whether through Terraform or an SDK, is how these platforms are designed to work.

This first post will be different from the others because the GCVE API documentation is not yet public, nor is there any Terraform functionality available to create or destroy GCVE resources. API documentation and Terraform for GCVE is coming, so when it is available, I will certainly blog about it! For now, I will walk through the GCVE GUI to detail SDDC and VPN gateway creation. Have no fear – there will be plenty of Terraform in future posts.

Other posts in this series:

Prerequisites for Creating a GCVE SDDC

If you’ve read any of my previous blog posts on cloud networking, you will already know that the most important thing to do before deploying anything into the cloud is rigorous planning. Deploying an SDDC in GCVE is no different. You will need to designate several unique IP ranges to be used for SDDC infrastructure and workloads, ensure the proper firewall ports are allowed to manage your SDDC, and prepare your GCP environment before deploying an SDDC. All of these prerequisites are detailed in the GCVE prerequisites documentation, which I highly recommend reading. Google’s documentation is thorough, and there is nothing better than reading through all of the docs if you want to understand how this solution works. Here is an overview of the required steps:

Plan the IP ranges you will use with GCVE. These are all RFC 1918 private addresses. You will need ranges for each of the following:
- vSphere and vSAN (/21 - /24 accepted). Depending on the size of the range you choose, it will be divided into additional subnets for management, vMotion, vSAN, and NSX. Details on the layout for these subnets are available here.
- HCX (/27 or higher)
- Edge Services, required for client VPN and internet access (/26)
- Client subnet, assigned to clients connecting via VPN Gateway (/24)
- Workload subnets, which will be configured in NSX-T after your SDDC is deployed. These are entirely up to you to determine, but my advice is to reserve plenty of IPs to use.
Ensure your local firewall is configured for communication with vCenter and workload VMs. Ports used for communication are documented in the prerequisites.
Enable the VMware Engine API in your Google Cloud Project
Enable the VMware Engine node quota

Once these are completed, you are ready to create your SDDC!

Creating a GCVE SDDC

To create a GCVE SDDC, browse to Compute > VMware Engine in the GCP Console. This will bring you to the GCVE homepage.

Click Create a Private Cloud to get started.

Specify your cloud name, location, node count, and predetermined network ranges. If you cannot choose your desired region, ensure you have requested VMware Engine nodes quota for that region. Your quota will also determine how many nodes you can request. The minimum node count is three nodes. After clicking Review and Create, you will be shown a confirmation page. Review your choices and click Create.

You will be taken to a summary page for your new cluster once provisioning begins. Note that the state is Provisioning in the screenshot above, and it will take between 30 minutes to 2 hours to complete. My experience has been that it takes just over 30 minutes to provision an SDDC, which is pretty impressive. You can click on the Activity to tab view recent events, tasks, and alerts. Drilling into those will provide specifics on any activity in your SDDC, including the provisioning process.

Setting Up the GCVE VPN Gateway

There are several ways to access your GCVE environment, including Cloud Interconnect and Cloud VPN. I will explore these topics in future posts. To establish initial connectivity to GCVE, a VPN gateway can be used. This is an OpenVPN-based client VPN that will allow you to connect to your SDDC’s vCenter and perform any initial configuration that you desire.

Before the VPN gateway can be deployed, you will need to configure the “Edge Services” range for the region where your SDDC is deployed. To do this, browse to Network > Regional settings in the GCVE portal, and click Add Region.

Choose the region where your SDDC is deployed and enable Internet Access and Public IP Service. Supply the Edge Services range you earmarked during planning and click Submit. Enabling these services will take 10-15 minutes. Once complete, they will show as Enabled on the Regional Settings page. Enabling these settings will allow Public IPs to be allocated to your SDDC, which is a requirement for deploying a VPN Gateway. To begin the deployment, browse to Network > VPN Gateways and click Create New VPN Gateway.

Supply the name for the VPN gateway and the client subnet reserved during planning and click Next.

Choose specific users to grant VPN access, or enable Automatically add all users, and click Next.

Next, specify which networks to make accessible over VPN. I opted to add all subnets automatically. Click Next, and a summary screen will be displayed. Verify your choice and click Submit to create the VPN Gateway.

You will be returned to the VPN Gateways page, and the new VPN gateway will have a status of Creating. Once the status shows as Operational, click on the new VPN gateway.

Click Download my VPN configuration to download a ZIP file containing pre-configured OpenVPN profiles for the VPN gateway. Profiles for connecting via UDP/1194 and TCP/443 are available. Choose whichever is your preference and import it into Open VPN, then connect. In the GCVE portal, browse to Resources and click on your SDDC.

Finally, you can click Launch vSphere Client. Login with username [email protected] and password VMwareEngine123!. Huzzah! You are now free to explore your newly created SDDC in GCVE. Your first task should be updating the password for the [email protected] account.

Wrap Up

As you can see, deploying in SDDC in GCVE is easier than setting up client VPN access. Now, a standalone SDDC is cool, but in the next post we will look at connecting it to a VPC. This will be almost entirely automated with Terraform, apart from a tiny bit of work that needs to be done in the GCVE portal. Later posts will cover creating a bastion host, connecting with Cloud VPN and Cloud Interconnect, configuring HCX for workload migration, and all sorts of other use cases. Are you using GCVE? If so, please reach out to me on Twitter (@NetworkBrouhaha) and let me know what topics you’d like to see covered.

Cloud Connectivity 202 - Extending Layer 2 Into the Cloud

Mon, 11 Jan 2021 00:00:00 +0000

In this post, I will talk about extending layer 2 into the cloud, including scenarios for when it is a good idea, and the numerous dangers involved. I let out a good, long sigh after writing that sentence. Beware, there are dragons ahead.

If you’ve been around networks long enough, you’ve probably seen a network taken to its knees by a loop. The Spanning Tree Protocol (STP) exists to prevent this issue, but a simple misconfiguration can prevent STP from doing its job. I’ve personally seen a hospital network taken down for days from a network loop, and I’ve heard many similar stories from other network engineers. A lot of time has been spent coming up with alternatives to STP-based networks, and for good reason. A major part of the CCIE Data Center curriculum I studied was alternatives to STP like TRILL and VXLAN EVPN with MP-BGP. Here’s the point: extending layer 2 can be dangerous, especially when done without precautions in place.

Why Do We Need Layer 2?

The very first commercially available Ethernet standard, 10Base5 used a coaxial cable as a shared medium. Multiple devices could be attached to the same cable, and each was identified by its MAC address. Later Ethernet standards kept backward compatibility with this model. Although significant improvements have been made over the years, much of the complexity of layer 2 forwarding semantics stem from this initial design.

Devices can be connected to a hub or a switch without any sort of central coordination. Neighbors are discovered by broadcasting ARP requests, which makes building an ad-hoc network very easy. Things get messier when a broadcast packet enters a looped network. There is no time to live (TTL) value associated with a layer 2 frame, so it will be forwarded and reforwarded ad infinitum, resulting in a broadcast storm. Before we even discuss extending layer 2, it’s important to know that a network can be completely crushed if STP Is improperly configured and a loop is introduced.

Risks Of Extending Layer 2

Before cloud providers were available, extending layer 2 was typically accomplished over a Data Center Interconnect (DCI) that could serve as a trunk, so one or more VLANs could be extended across the link. The risk here is from fate-sharing. Before extending layer two, any issue would be confined to one site. Once layer two is extended, a problem at one site will extend to the other. There are many good reasons why cloud providers use availability zones, and fate-sharing is one of them. The other risk, of course, is that the larger the layer 2 domain, the more opportunities there are to introduce a loop.

Beyond the risks of creating a network outage, extending layer 2 also has some basic disadvantages. A default gateway can only exist at one site, so routed traffic between hosts at the same site may be tromboned across the DCI. If using an overlay to facilitate layer 2 extension, you need to be mindful of the implications that has on MTU across the link. “Silent hosts”, or hosts that don’t properly respond to ARP requests, can also be problematic in this scenario.

Reasons For Extending Layer 2

Hopefully, I’ve made a good case for why you should be very careful when extending layer 2. There are some good reasons to do so, but it is never something I would recommend doing for a long period of time. The main use cases I see for extending layer 2 are data center evacuation and migrating to the cloud while preserving assigned IP addresses. These are good use cases for layer 2 extension since the extension will be finite. Once the evacuation or migration is complete, the extension can be removed. I have seen layer 2 extensions used for disaster recovery purposes, and I would caution users who want to do this to exercise extreme caution. Indefinite layer 2 extension is a recipe for trouble.

Methods for Extending Layer 2 to the Cloud

In my post on cloud connectivity, I listed the typical methods for connecting to a cloud provider. You will notice that all of the connections are based on layer 3, apart from a layer 2 VPN, which rides on top of a layer 3 connection. The prior approach of extending VLANs over a circuit simply isn’t an option in the cloud. In the same post, I pointed out that most clouds don’t use traditional layer 2 forwarding in their networks. This certainly presents a problem when trying to extend layer 2 as well! The best solution available is to use an overlay, like VXLAN or GENEVE, in your cloud of choice. This is where VMware-powered cloud offerings shine since they leverage NSX-T in the cloud-based SDDC. GENEVE, used in NSX-T, will emulate the needed layer 2 functions, as well as provide a layer 2 VPN (L2VPN) appliance to perform the extension. VMware HCX is also compatible with these solutions and provides layer 2 extension for migrated workloads.

NSX-T L2VPN and HCX provide guidelines in their documentation that should be carefully studied before deployment. It is also important to know that you cannot use NSX-T L2VPN and HCX at the same time. The HCX documentation states “Virtual machine networks should only be extended with a single solution. For example, HCX Network Extension or NSX L2 VPN can be used to provide connectivity, but both should not be used simultaneously. Using multiple bridging solutions simultaneously can result in a network outage.” Since HCX supports extension to multiple sites, you must deploy the service meshes in a way that does not introduce a loop. See this diagram:

Other solutions exist beyond VMware products, like LISP, but I have not personally seen them used. Ultimately you are restricted to solutions supported by your cloud of choice, which are typically delivered in the form of supported third-party network appliances. If you’re aware of another viable method for extending layer 2 into the cloud, please leave a comment or send me a message on Twitter. I’d love to see what others are using to accomplish this.

Alternatives To Extending Layer 2

Imagine a world where there was never a hard-coded IP address anywhere. Addresses are assigned automatically, and DNS is instantly updated whenever an address is assigned or changed. Changing addresses is a non-issue since everything relies on DNS name resolution instead of an IP address. Sound too good to be true? It’s not! This is possible today, and it has been for years. Cloud-native networking works exactly in this manner, but it is possible to accomplish on any network, with some effort.

For a variety of reasons, enterprise networks have not operated this way, and it is the primary driver behind the desire to extend layer 2. IPs are frequently hard-coded instead of relying on DHCP for address assignment and DNS for name resolution and service discovery. Running a network is difficult work, and forward-thinking practices like the ones I’m describing aren’t always prioritized. It’s easy to decry this as laziness, or poor planning, but I spent years working in networks similar to what I’ve described. I understand the amount of effort it takes to change the way a network fundamentally works, especially when you’re on a small team managing a huge network. If you never need to migrate a workload to another location, it may not be worth the effort.

The primary alternative to extending layer 2 is to use layer 3 routed connectivity. This is exactly how the cloud was designed to work; why there isn’t any real concept of layer 2 in the cloud, and why cloud providers don’t allow you to extend your VLANs onto their network. Unfortunately, this concept is difficult to swallow if you have a network like the one I describe above - heavily reliant on hard-coded addresses in many places. If this is the case, you will need to work with management to acquire the resources needed to convert to a “cloud friendly” network on premises before moving workloads to the cloud. This may be a small effort or a massive one, depending on the size of your network, but it will position you to be better prepared for the future. The cloud is here to stay, and it’s a wonderful environment to work in once you get the hang of it.

Wrap Up

Whether or not it’s a good idea to extend layer 2 is a debate that has been going on for years, and I doubt it will stop any time soon. The guidance I will leave you with is to use layer 3 everywhere you can, and extend layer 2 only if you must. Do everything you can to configure your applications to use DNS instead of hard-coded IPs. Study cloud-native networks and start embracing those concepts in your own network. By doing this, you will be much better prepared for migrating your workloads to the cloud or moving to a hybrid cloud environment.

Cloud Connectivity 201 - Reliable Connectivity

Tue, 08 Dec 2020 00:00:00 +0000

My last post laid out several options for connecting to the cloud. In this post, I’ll dive into reliable connectivity to the cloud. In the real world, circuits drop, transceivers fail, and software bugs cause lockups or unexpected reboots. This is one of the fundamental tenants of working with technology. If you expect five nines of uptime, you must plan for component failures and unexpected outages.

When it comes to networking, reliability is typically achieved with redundancy and resiliency. Redundancy means two or more components (e.g. switches, routers, circuits) are in place to tolerate the failure of any single device. Resiliency means that the overall system continues to operate when a component failure occurs. This usually requires some configuration on the part of the operator to ensure an automated switchover occurs, and hopefully testing of every failure scenario to ensure that switchover works as expected. Having two of everything doesn’t provide any benefit if the redundant component doesn’t take over when the primary fails!

The rest of this post will focus on the specifics of reliable cloud connectivity, but it is important to consider that reliability is constructed in stages. I’m assuming that your existing network infrastructure is already reliable. If that is not the case, it won’t matter how awesome your cloud connectivity is. You must build upon a stable foundation.

Software Defined WAN (SD-WAN)

In my last post, I said that SD-WAN “may be the most exciting advancement in the world of networking in the past decade.” I’m referring to SD-WAN in general terms since there is a lot of vendor-specific secret sauce baked into the various offerings. With that in mind, here are the things that SD-WAN gets right when it comes to reliability:

Redundancy is assumed. Reference architectures for SD-WAN assume there are at least two paths available for connectivity. This may be as simple as a primary internet circuit and an LTE connection for backup.
Failover is automatic. SD-WAN devices constantly verify connectivity between the participating edge devices, as well as the health of that connection. When a connection goes down, traffic is automatically moved to another available connection. Even more impressively, if a connection is experiencing packet loss or high latency, traffic can be migrated to prevent a performance hit. With a traditional solution, this is a difficult problem to detect unless you are on top of your monitoring game.

SD-WAN is still gaining traction, but the technology is exciting. There are several other benefits in terms of security, manageability, monitoring, and automation. If you’re building a new solution, or looking to replace aging edge hardware, SD-WAN deserves a hard look. You can build a fast, fault tolerant connection to the cloud, and even replace legacy WAN technologies like MPLS. The most important thing to consider with cloud connectivity via SD-WAN is that your chosen vendor has supported software appliances in your cloud(s) of choice. Read through their cloud reference architecture to ensure they meet your requirements. Cloud-based SD-WAN appliances are software-based, so they will have bandwidth limitations that should be taken into account as well.

Dynamic Routing and Automated Failover

If you’re not in a position to roll out SD-WAN, you will need to use the tried-and-true networking technologies that have been around for decades. Dynamic routing is the primary tool used to provide reliable network connectivity and facilitate traffic failover during an outage. There are several dynamic routing protocols that can achieve this, but when it comes to cloud connectivity you will be using BGP. Why not OSPF or EIGRP? These protocols rely on IP broadcast or multicast to find neighbor routers and form peering adjacencies, and are generally intended for use within a LAN. BGP peering is established using TCP, and it was designed to operate over a WAN or the Internet. It is, in fact, the duct tape and baling wire that holds the internet together!

BGP requires a few bits of information to get working. First, you will need to enable the BGP process on your router and assign an Autonomous System Number (ASN), which will be explained in the next section. Next, you will define which networks will be advertised by BGP. The final step is to configure a neighbor router to peer with, along with the ASN of that router. Once the peering relationship is formed, your router receives all of the known BGP routes from the neighbor.

In some cases, BGP relies on another routing protocol to be able to operate. When a router tries to establish a peering relationship, the TCP packets it sends need to be able to make it to the intended neighbor router. When using BGP to connect to the cloud, the routers involved are typically assigned addresses from within a /30 (or /31) subnet. This means no routing is required for the two devices to communicate – they are both connected to the same subnet and can communicate directly. Whether this communication happens over an IPsec tunnel or a point-to-point circuit doesn’t really matter, as long as the two devices can talk to each other.

The magic happens when BGP peering is established over multiple paths to the same destination, which is what provides both redundancy and resiliency. We’ll get into the details in a moment, but this the way we can automatically failover if one link goes down. BGP will see the destination subnets advertised from multiple neighbors, and it will pick which path is the “best” based on an algorithm. If a circuit goes down and the neighbor is no longer reachable, BGP will choose the next best path, and traffic will be forwarded accordingly. These paths can be the same type of connection, or different. Here are a few examples:

Redundant, route-based IPSec VPNs. Route-based VPNs allow BGP peering across them, providing a failover method if one tunnel fails.
A combination of a direct connection and a route-based VPN as a backup
Multiple direct connections

If you’re using a solution like Megaport for connectivity, their website has several example network diagrams for redundant connectivity.

Autonomous System Numbers (ASNs)

BGP uses the term “Autonomous System” to represent an entity or location, and each autonomous system is assigned a number (ASN). There are a few different flavors of ASNs, but for our purpose, we only need to worry about public and private ASNs. Much like IP addresses, public ASNs must be assigned by a regional internet registry (RIR), while private ASNs (64512-65535) can be used freely within a private network. Each ASN in the routing topology must be unique, so you will need to plan your ASN usage to prevent overlaps, just as you would with your IP ranges. In some cases, a cloud provider will allow you to specify a private ASN for your cloud resources, and in other cases, it is set by the provider and cannot be changed. Check your cloud provider’s documentation to see what ASN they use so you can plan accordingly. If you have been assigned a public ASN then your life is a bit easier, since it is unique to your organization. Some commonly used ASNs are listed below.

Cloud Provider	ASN	Notes
AWS	7224 64512	ASN 7442 is used for Direct Connect peering. ASN 64512 is the default ASN for VGW or Direct ConnectGateway, but you can specify a different private ASN when those resources are created.
Azure	8074 8075 12076 65515-65520	ASN 12075 is used for ExpresssRoute peering. ASNs 65515-65520 are reserved by Azure and should not be used on the customer side.
GCP	16550	ASN 16650 is used for Interconnect peering.
Oracle	31898 64555	ASN 31898 is used for Private and Public peering. ASN 64555 is reserved by Oracle and should not be used on the customer side.
	23456 429496729 64496-64511 65535-65551	These ASNs are reserved by IANA and should not be used on the customer side.

Each route that BGP learns also includes the AS Path needed to reach that destination. The path is the list of ASNs that are traversed to reach the destination. Consider this diagram:

The AS path from router A to router D is 20 30. Likewise, the AS path from router D to router A is 20 10. The AS path is an important part of the algorithm BGP uses to choose the best path to a destination, which we will learn more about in a moment. It is also used to prevent loops. If a BGP router receives a route with its own AS in the path, the route is discarded as a loop-prevention mechanism.

Internal BGP (iBGP) vs External BGP (eBGP)

I’m not going to get too in-depth on the inner workings of BGP, but it is worth knowing the difference between internal BGP (iBGP) and external BGP (eBGP). iBGP is the term for BGP peering between two routers within the same autonomous system. eBGP is the term for peering between two different autonomous systems. In the example diagram above, routers B and C are connected via iBGP, and the remaining connections are eBGP. Connections to a cloud provider will always use eBGP. If you are running iBGP within your network, then there is a good chance you already know more about BGP than I’m going to cover in this post!

BGP Best Path Algorithm

If you’ve ever studied for a CCNA, you may get cold chills reading the phrase “BGP Best Path Algorithm”. Immediately your brain recalls the “N WLLA OMNI” acronym used to remember the steps in the algorithm, although you may struggle to remember what the letters stand for. For our purposes, the primary metric we need to worry about is the “A” – AS path length. As long as the other metrics are equal, which they typically are, BGP simply counts the number of autonomous systems it has to traverse to reach a destination to determine the AS path length. The path with the least number of autonomous systems traversed is considered the best path. Whether or not this path is the best performing or least latent is another question altogether, but these are not metrics that BGP considers as part of its algorithm. Here is another example topology to consider:

There are now two paths from router A to router D, 20 30 and 40 50 30. The latter may be higher bandwidth or more reliable, but BGP will pick the path through AS 20 since that will result in the shortest AS Path (20 30).

One tool that can be used to influence BGP path selection is AS prepending. AS prepending means you artificially add additional AS numbers into the path. These prepending rules are applied to a neighbor relationship between routers and should use the ASN of the local system. Prepending an ASN other than your own may have unintended consequences. AS prepending may be performed outbound on routes being advertised to neighbors or inbound on routes being received from neighbors.

Here is an example to illustrate this concept. If router D (AS 30) prefers traffic from router A (AS 10) to arrive over the link with router F (AS 50), it can prepend 30 30 to the AS path it advertises to router C (AS 20). This new path will be advertised to router A, and the two paths router A will see to router D are 20 30 30 30 and 40 50 30. Now, the best path is through AS 40 and 50, since it is a shorter AS path. Using AS prepending is a way to influence the routing topology to behave differently from the default BGP behavior. You will frequently see this practice referred to as “traffic engineering”, although AS prepending is just one of many tools available to manipulate BGP.

Equal Cost Multipath (ECMP)

By default, BGP will calculate a single path to each destination. As I mentioned earlier, that path may change if a failure happens somewhere in the network, which is exactly what we want to see. But what about the scenario where we have two high bandwidth links to a destination? It may feel like a waste of money to have an expensive circuit provisioned to merely serve as a backup. This is where Equal Cost Multipath (ECMP) comes in. ECMP allows for two or more “equal cost” routes to be installed in the routing table. ECMP is used frequently in networking to leverage multiple links at the same time, but different routing protocols handle it differently. There may even be differences in behavior between hardware vendors. Generally with BGP, ECMP has to be enabled, so refer to the vendor documentation to figure out the right knob to turn.

When properly configured, ECMP will allow you to take full advantage of the available links. A common scenario is two direct connections to a cloud provider (ideally purchased from two different carriers for diversity). When both links are working, traffic is transferred across either link based on a “hash”. Usually this is done by reading the source and destination addresses and port numbers of a packet and feeding them into a predefined calculation to produce a hash value. Even values would be assigned to one link, and odd to the other, providing a rudimentary way of balancing traffic across the links. If one circuit fails, that route is removed from the routing table, and all traffic would traverse the remaining link. Redundant and cost-effective!

Additional Considerations

By now you’ve probably read all you care to about BGP, but there are a few more components I’d like to mention.

Bi-directional Forwarding Detection (BFD) is a network protocol that is complimentary to BGP. By default, BGP can take several seconds or minutes to detect a failure. BFD can be used to speed up failure detection. If you cannot tolerate an outage of more than a few seconds, configure BFD along with BGP.
Prefix lists define a list of routes, and they can be used to filter the routes that are either advertised or received by BGP. Applying prefix lists to your BGP neighbors is a best practice. By default, your equipment will receive all routes advertised by its BGP neighbors, and you certainly want to prevent unintended routes from being advertised into your network. Refer to your network vendor documentation for prefix-list syntax and application.
BGP peering over a public network can be a security risk since attackers could use the advertised routing information to perform reconnaissance on your network. To mitigate this, BGP peering can be encrypted with a pre-shared authentication key. This is also considered a best practice, and you will need to read the docs to determine how this is configured on your equipment. To prevent unnecessary troubleshooting, I typically try to stand up a BGP peering connection without authentication. I’ll go back and enable authentication once I’ve confirmed it’s working as expected.

Wrap Up

Multiple redundant links with automated failover is your goal if you want reliable connectivity to the cloud. If it’s up to me, I’m using SD-WAN everywhere that I can. All of the complicated bits of BGP are abstracted away, leaving only the benefits. There is always a trade-off, which in this case is vendor lock-in, and perhaps cost depending on how large your network is. But BGP has been around for a long time, and it’s not going anywhere. There are plenty of network engineers who have spent hundreds of hours designing and supporting BGP, so you will seldom have to worry about finding someone that can provide support for your environment. Chose the solution that works best for you, plan, design, deploy, test, and test again. If everything goes as planned, you will have rock-solid connectivity to the cloud of your choice.

Cloud Connectivity 101

Tue, 17 Nov 2020 00:00:00 +0000

With my prior post in mind, we can look at the various methods available for connecting to the cloud. This isn’t intended to be an exhaustive list, but it should cover the vast majority of your options.

Public IP

Pros: Ubiquitous, inexpensive
Cons: Potentially insecure, no performance guarantees
What you’ll need: A business-class internet circuit

Regardless of the other methods listed, you’ll connect to a cloud provider over Public IP the first time you connect to their console to set up your account. Depending on what you host in the cloud, this may be the only connectivity you need. It’s the easiest way to get to the cloud, but also the least secure. If you’re transferring sensitive information over the public internet, make sure you leverage modern application-based encryption, whether over TLS/HTTPS, SSH or some other protocol.

Assuming we’re talking about Infrastructure as a Service (IaaS) and running virtual instances in the cloud, there are variations between providers around how public IPs are assigned. It’s important to understand how your provider assigns addresses, and how they recommend exposing them to the internet. Frequently a load balancer is used for this, and it’s a good idea to use one even if you’re starting small. It’s easier to start with a load balancer than move to one later.

Public connectivity is not well suited for the typical backend administrative access you’d see in a traditional data center. No security team would recommend exposing all your servers to the internet. Addressing this will likely require adding in one of the connection methods listed below, like a VPN.

Cloud providers allow tight integration between resources you deploy and their hosted DNS solution, but it will still be up to you to configure it properly. Take time to learn the various network services provided, as they will be similar to the network appliances you deploy on prem, but the capabilities and requirements may be different from what you’re used to. As I mentioned previously, there are many options for applying security policy in the cloud, so developing a solid security strategy should be one of the first things you do.

The remaining connection methods, apart from Direct Connection and Network as a Service, build on top of Public IP connectivity.

Traditional VPN

Pros: Secures sensitive traffic
Cons: Requires additional hardware or software, potential performance bottleneck, does not scale well
What you’ll need: Hardware or software capable of terminating an IPSec or SSL VPN tunnel

Layering a VPN on top of Public IP connectivity will provide a secure connection into your cloud resources. All cloud providers support IPSec tunnels, and most support establishing BGP peering over the tunnel to exchange routes. If you need client-based SSL VPN you will need to deploy a network appliance in your cloud environment, but there are numerous options available.

Using a VPN is the easiest way to build a hybrid cloud solution, and it will give you a secure way to access any instances you’ve deployed via their private IP. If you were paying attention to what I wrote earlier in my previous post, you know that you should have a good IP addressing scheme that includes non-overlapping IP ranges for everything you’re deploying on prem and in the cloud. This becomes even more important if you’re deploying resources into multiple regions in a single cloud provider, or multiple cloud providers.

VPNs, like most technologies, evolve over time, so if you have an aging VPN solution in your environment you may want to consider replacing it with a modern one. Policy-based IPSec VPNs have been around for decades, but cloud providers are encouraging route-based VPNs, which is what you’ll have to use if you want dynamic routing over VPN. As your environment scales, dynamic routing becomes increasingly important, so if you aren’t comfortable with BGP now is the time to learn it.

Layer 2 VPN

Pros: IP Mobility
Cons: Requires additional hardware or software, potential performance bottleneck, not recommended for long term deployments
What you’ll need: Specialized hardware or software capable of building an L2 tunnel

A Layer 2 VPN is used to span a Layer 2 segment, typically a VLAN, across a WAN link. I don’t have scientific data to back this up, but I’d bet a milkshake that if you asked a hundred network engineers if spanning Layer 2 across sites is a good idea, ninety-nine of them would say no. Spanning layer 2 across sites, or over a VPN, introduces complexity and does not scale well. There are some good reasons to do it, but I would never recommend it as a long-term solution.

Normally, a layer 2 VPN is used to migrate existing VMs to another site or cloud provider, while preserving assigned IP addresses. Common scenarios are disaster recovery and data center evacuation. I won’t go on another rant about the importance of DNS, but you can see why I climbed up on that soapbox. Putting that aside, if everyone involved knows the risks introduced by stretched L2, and it’s temporary, it can be a handy tool.

There are a handful of options for L2 VPN, including VMware HCX and NSX L2VPN. These have been verified to work in supported cloud providers, but if you choose another solution, be careful to make sure that it is supported by your cloud provider. There is no traditional layer 2 forwarding in most native cloud provider networks, so an overlay like VXLAN or GENEVE is used to emulate layer 2 semantics. These overlays encapsulate packets in UDP, so large packets will be fragmented when transmitted over a WAN link. Unless there is a solution for local traffic egress, there will be tromboning of traffic across the L2VPN for any remote endpoints to reach their default gateway.

My advice is to stick to routed layer 3 traffic if possible, even if it takes some work to get there. L2VPN is a tool that can be deployed if absolutely necessary.

SD-WAN

Pros: Flexibility, scalability, potential cost savings
Cons: Requires additional hardware or software, potential for vendor lock-in
What you’ll need: Hardware or software capable of building an SD-WAN mesh, like VeloCloud

Software Defined WAN (SD-WAN) may be the most exciting advancement in the world of networking in the past decade. Building and maintaining VPN connections is a tough job, especially at scale. SD-WAN makes this process much simpler, since all the heavy lifting of creating tunnels, monitoring connectivity, and intelligently routing traffic between locations is handled by a controller. There are potential cost savings as well. Many businesses have replaced their expensive MPLS networks with SD-WAN meshes running over redundant internet connections.

Currently SD-WAN is not standardized, and each vendor offering has its own unique feature set. You will need to do some homework on your own to find the solution that works best for your environment and cloud providers. If you’re considering hybrid-cloud or multi-cloud deployments, you should certainly look at SD-WAN for connecting your environments.

Direct Connection

Pros: High-bandwidth, low-latency cloud connectivity
Cons: Cost
What you’ll need: A point-to-point circuit from a local telco that provides connectivity to the cloud provider of your choice, and a router or firewall capable of terminating the circuit

One of the challenges of working with the cloud is nomenclature. Cloud providers offer so many capabilities, and each offering has a name that has been carefully crafted by their marketing department. You may see a direct connection to a cloud provider referred to as Direct Connect, Express Route, Cloud Interconnect, Fast Connect, or Direct Express Cloud Bonanza. Okay, the last one is fake, but you get the point.

While there is some variation between how the various cloud providers handle direct connections, this is a straightforward path to the cloud. If you are in a standalone data center, you will likely be working with your local telco to provision a circuit from your data center to the cloud provider of your choice. If you have an existing MPLS network, you may be able to have a “leg” connected to a cloud provider as an alternative. Many colocation facilities are offering direct circuits or cross-connects to the closest geographic cloud regions, so check your colocation offerings if that is where your equipment resides.

Review your cloud provider’s documentation for the technical requirements and ordering process for a direct connection. Once the physical circuit is installed, there will be a setup process to complete in the cloud provider portal. Depending on whether you want connection to private resources (e.g. virtual instances deployed with private addresses) or public resources (e.g. provider offerings like object storage), you will need to follow the provider documentation to set up routing across your connection. Most likely this will involve bringing up a BGP peering with the provider between your network and theirs. You may be required to have your own Autonomous System Number (ASN) and dedicated public IP address range to access public resources over a direct connection.

I will explore the specifics of the various cloud provider direct connection options in a future post.

Network as a Service (NaaS)

Pros: High-bandwidth, low-latency cloud connectivity, and the ability to dynamically provision cloud connections
Cons: Cost, only available in limited locations
What you’ll need: A cross connect to the NaaS provider, and compatible hardware to terminate the connection

The last connection method I’ll mention is what some refer to as Network as a Service (NaaS). This is similar to a direct connection, but with much more flexibility. Megaport and Equinix Cloud Exchange Fabric are two examples of this type of service. Typically, you will need to be in a colocation facility to connect to a NaaS provider. If you’re in a standalone data center, you could provision a circuit to your closest NaaS provider and connect via that method.

Once physically connected to your network, NaaS providers allow you to dynamically provision virtual circuits to cloud providers, managed service providers (MSPs), other data centers, or directly to an ISP. The strength of this solution is in its flexibility. Many NaaS providers provide an API to provision virtual circuits, meaning you could dynamically create and destroy connections to various cloud providers as needed.

If you are looking for high-speed, low-latency connectivity to the cloud, and NaaS is available to you, it’s a great choice.

Wrap Up

I’ll be exploring additional topics pertaining to cloud connectivity in future posts, but I hope this is a helpful rundown of the options. To recap, you should have a fully developed plan before you provision any sort of cloud connectivity. The actual connectivity, whether it be over the internet, VPN, or direct connection, will depend on several factors. Budget is likely the biggest hurdle for most, and you will pay for better performance.

When making your decision, consider the words of Eyvonne Sharp, “A myopic focus on cost, instead of business value, is the bane of IT. It is also a harbinger of irrelevance.”

Network Principles for Cloud Connectivity

Thu, 12 Nov 2020 00:00:00 +0000

Over the past few years, I’ve encountered all sorts of confusion about how cloud resources should be architected, accessed, and consumed. In this post, which is the first in a series, I will walk through some networking basics relevant to cloud connectivity. The next post covers methods for connecting to cloud providers, and subsequent posts will dive deeper into specific topics.

Network Basics

The internet runs on – you guessed it – the Internet Protocol (IP), which is what you’ll use to connect to the various cloud providers. Overall, we’re running out of unallocated public IPv4 addresses, although that shortage doesn’t seem to apply to cloud providers just yet. IPv4 is well understood, and it is not going anywhere for many years, but now is the time to consider your IPv6 strategy. There is no reason to be afraid of IPv6, and it solves many of the workarounds put in place for IPv4 over the past decades, like Network Address Translation (NAT).

Overwhelmingly, the problem that comes up again and again when it comes to cloud connectivity, is basic routing and overlapping IP address spaces. With IPv4, this is a planning exercise that many people fail to do. With IPv6, this problem disappears, due to the massive number of unique addresses available. The crux of this issue is that every router that moves packets across a network can only have one entry in its routing table for each destination subnet. When all subnets are unique, this is easily accomplished, but almost every corporate network is using RFC 1918 private addresses somewhere. To put it simply, if you’re using 10.0.1.0/24 on your corporate network, you can’t use that specific range anywhere else. If you do, anything deployed in that duplicate network will not be able to communicate with the original range.

One trick commonly deployed to address overlapping network ranges is to use longest prefix match. Basically, this means that the route to the most specific network is the one that will be used. For example, if a router has valid routes for 10.0.0.0/8, 10.0.1.0/24, and 10.0.1.0/30, the last route is the most specific. This is easy to spot, as the subnet mask, /30, is a larger number than the other routes listed. Some newer network fabrics even create “host routes” - /32 in IPv4 or /128 in IPv6 – for every endpoint. Since these are the most specific routes possible, they always “win”, and they allow for great flexibility in terms of where endpoints are connected. In the world of networking, every decision like this involves a tradeoff. In this case, the tradeoff is flexibility versus a potentially large routing table. Luckily, we are in an era where network hardware is not as resource constrained as it was in decades past, so having hundreds of thousands of known routes is less of a risk. I’m placing a bet that we’ll see this approach used more and more. Regardless, this is a tool you can use when designing your network to make efficient use of your IP address ranges. If I’m able to get across one point through this post, it’s the important of planning ahead. Plan your IP addressing scheme, for both IPv4 and IPv6, since it’s inevitable that you will need to use both eventually.

One last point to mention is the importance of DNS. In a perfect world, no one would ever memorize an IP address (unless it’s for a DNS server, and 8.8.8.8 is mercifully easy to remember). Unfortunately, this is not the case. Hardcoding IP addresses is common practice, and I still run across people who don’t “trust” DNS or claim that some obscure application doesn’t support it. Name resolution and service discovery are incredibly important functions in the cloud, to the point that DNS has a 100% uptime SLA with some providers. In many cases, DNS is the only service that is guaranteed to be available all the time. If everything on your network is using DNS hostnames, and DNS is accurate and easy to update, changing the IP address of an endpoint becomes a much easier task. DNS has been around almost as long as the internet itself, so it’s high time that we do the heavy lifting to use DNS hostnames everywhere instead of IP addresses.

Security and Statefulness

I would be remiss to write about cloud connectivity without mentioning security. Figuring out how to allow legitimate traffic through firewalls and Access Control Lists (ACLs) is a joy that every network engineer gets to experience, and the sheer number of malicious actors means we all must keep security at top of mind when planning and operating our networks. Securing networks means adding complexity, and like an IP addressing scheme, needs careful planning.

While properly securing networks is critical, it’s important to understand exactly how it’s done, and what effect is has on our network. IP was designed with the end-to-end principle in mind. The original designers envisioned networks that purely moved packets, with any necessary intelligence implemented at the endpoints. A destination address would be all that is needed to get a packet where it needs to go. While that is not the reality of modern networking, it is still a worthy goal.

Network appliances, like firewalls, typically introduce some sort of state tracking. This is why some firewalls are referred to as “stateful firewalls” - they track the connection state of traffic flows as they traverse the firewall, and they use that state to determine whether to forward additional traffic. NAT Gateways and Load Balancers track state for similar reasons. As more state is stored on network appliances, we move further and further away from the ideal of the end-to-end principle. This isn’t necessarily a bad thing, but it is worth understanding, as well as architecting networks that minimize the reliance on state stored in appliances. I have seen massive web applications that are completely reliant on proprietary load balancer features. This is a situation that should be avoided. Just because you can use your network to solve a specific problem doesn’t mean that you should.

In terms of security policy, my best advice is to choose specific points for policy enforcement, and make sure they are well understood. In a typical enterprise network, this is usually obvious since policy enforcement happens at the firewall, although this has started to change with the advent of micro-segmentation. Cloud providers provide several options for applying security policy. Using AWS as an example, security policy can be applied at the instance (VM) level with security groups, or at the subnet level with network ACLs. Whichever method you choose, stay consistent, and document your security practices. A good rule of thumb is to apply security policy as close to the endpoint as possible.

Wrap Up

Certainly, there are many technical requirements to consider when it comes to connecting to the cloud. I hope you will consider the planning and design requirements first, to put yourself in a position for success. Networking basics, like IP schemes and routing, as well as a good understanding of network security, and how state affects your traffic flows are subjects every technical cloud consumer needs to understand.

Please leave a comment or reach out to me on Twitter and let me know if there is anything I’ve missed! My next post will cover the various methods available for connecting to the cloud.

Running Ansible Playbooks with GitHub Actions

Tue, 19 May 2020 00:00:00 +0000

I recently co-presented a session titled Codify Your Environment with Terraform and Ansible at the inaugural Rubrik Forward Digital Summit. My demo used GitHub Actions to run a number of Ansible playbooks. The code is hosted at https://github.com/rfitzhugh/Forward-2020-Codify-Your-Environment. One of my main takeaways for those attending the session is that Continuous Integration (CI) tools like GitHub Actions can be used to test and run automation against your infrastructure fairly easily. Let’s take a look.

The GitHub Actions Flow

CI tools were created to make life easier for developers by allowing them to test changes to their code rapidly. Modern CI tools leverage cloud infrastructure and containers to run these tests when code is committed. A typical workflow looks something like this:

Normally, someone creates a branch on a Git repository, commits some changes, and opens a Pull Request. This alerts the repo owner(s) that there are proposed code changes. A pipeline is usually triggered when the pull request is opened to run tests against the new code. The test result is returned when completed, and is a major consideration as to whether or not the new code is merged. Another pipeline may be triggered after the pull request is merged, or at other points in the process. The timing for when pipelines run is specified in the CI pipeline configuration, and varies by project and need.

Rubrik Chief Technologist Chris Wahl wrote a snazzy blog post that demonstrates this workflow with Terraform. His post goes into greater detail on GitHub Actions than I will, so if you’re new to the tool then I recommend taking a moment to read his post before continuing.

When it comes to using this process to execute Ansible Playbooks against your on-prem infrastructure, the process is slightly different. It looks like this:

The beginning is the same, up to where the CI pipeline is triggered to run. In this case, we only want the pipeline to run after a Pull Request is merged into the master branch. The master branch should be considered the source of truth for your infrastructure, and changes must never be made directly to the master branch. All modifications to the Playbooks in the repo will be proposed via Pull Request, reviewed, commented on and ultimately accepted or denied. Only after the PR is accepted and merged to master will the Ansible Playbook(s) be executed. This is an important concept to grasp as you certainly do not want to be making unintended changes due to a pipeline being run unexpectedly.

The other difference from the first example is that a local runner is used to execute the pipeline instead of a Docker container. This runner is installed in a Virtual Machine, in this case in our lab, so it has connectivity to all of the infrastructure that we’d like to automate. Instructions for adding a local runner to a repo and making it available to GitHub actions are located here: https://help.github.com/en/actions/hosting-your-own-runners/adding-self-hosted-runners. Notice the ominous “Warning: We recommend that you do not use self-hosted runners with public repositories.” This is indeed sound advice - some additional security considerations will be covered near the end of this post.

Configuring GitHub Actions for Automation

A basic GitHub Actions configuration file needs to contain three things:

When to Run
Where to Run
What to Run

There are, of course, a bazillion different ways this can be configured. The examples below will focus on the Ansible workflow described above. A full configuration can be found here: https://github.com/rfitzhugh/Forward-2020-Codify-Your-Environment/blob/master/.github/workflows/run-playbooks.yml.

When to Run

For many CI pipelines, this will be when a Pull Request is opened, but as mentioned above, that won’t work for our case. We only want the pipeline to run after a Pull Request is merged (i.e. approved). Here’s what that configuration looks like:

on:
  push:
    branches:
      - master

_{^{Do this}}

The “push” action is triggered when merging code into the master branch. The first time I tried to figure out how to set this configuration, I had trouble finding the correct keywords. I ended up digging around in issues and forums, and found this solution:

on:
  pull_request:
    branches: [ master ]
      types: [closed]

_{^{Don’t do this}}

While this does execute a pipeline after a Pull Request is merged, it will also run the pipeline when the PR is simply closed. Closing the PR means no code was merged in, hence no changes, but it still results in the pipeline running when you wouldn’t expect it to. Hopefully this wouldn’t be a major problem since Ansible is idempotent, and the repo still matches the running state of your infrastructure. Still, stick with the first example. Do not use on: pull_request for your automation CI pipeline!

Where to Run

jobs:
  run-playbooks:
    runs-on: self-hosted

There is one job (“run-playbooks”) in this example, made up of one or more steps. This example instructs GitHub actions to execute the steps on your self-hosted runner. The main reason for this is the runner has connectivity to the infrastructure you’re attempting to automate. You also have the ability to pre-stage any necessary dependencies on the runner. This may be Ansible collections, Python modules, or other prerequisites.

What to Run

steps:
- uses: actions/checkout@v2
- name: Run Ansible Playbook
  run: ansible-playbook your_playbook.yaml

Finally, the files from repo are checked out via Git, and the Ansible playbook is fired off. This can be repeated, as necessary, if there are multiple playbooks. If Ansible returns an error, the pipeline aborts and the repo owners will be notified that the pipeline failed. If you are running multiple playbooks, you may have a workflow that is partially complete. In this case, there is probably some clean-up to do. Revert things to their previous state, and troubleshoot the error that was returned before trying again.

Security Considerations

First, you are placing sensitive information about your environment in a Git repository. There is no reason I can think of to use a public repo for this. Private repos are now free on GitHub, so use that capability and keep prying eyes away from your carefully designed playbooks.

Second, whatever you put in your CI pipeline will be executed in your environment, on your local runner. This is why all Pull Requests should be reviewed by someone other than the original author. Have a second set of eyes review any proposed changes, and if possible, test thoroughly in a sandbox environment before proposing a Pull Request.

Third, rely on GitHub Secrets to store passwords, tokens, and other sensitive information. The configuration example above demonstrates how to use secrets associated with a repo, and inject them into environment variables on the local runner so Ansible can reference them. Keep in mind that some secrets may be displayed in your CI logs if you have verbose logging enabled, so use caution.

I’ve seen recommendations to build an “emergency off switch” that can abort automation workflows from running. This could be a firewall rule that blocks all communication to and from the local runner, or a script that immediately powers it off. An abort mechanism is worth considering as you increase your reliance on automation.

Wrap Up

Back in 2018 when I wrote my review of Network Programmability and Automation, one of my few complaints about the book was that it was light on how CI pipelines are used within an automation workflow. I hope this post was helpful in shedding some light on this topic. There are many different available tools and approaches to this, so you may end up with a different way of tackling this in your own environment. I’d love to hear how you’re managing automation, or if there is anything I can add to my approach to improve it. Hit me up any time on Twitter @NetworkBrouhaha.

Budget Bliss with AWS Lambda and the You Need A Budget API

Wed, 22 Jan 2020 00:00:00 +0000

I’ve been a crappy budgeter my whole life. It’s a chore, it’s difficult to manage, and for some reason there is never quite enough money. I’ve tried several products over the years, like Microsoft Money and Mint. I was happy with Mint for a long time, but it never quite fit the bill (forgive the pun). About a year ago I heard about You Need A Budget (YNAB), and I was immediately intrigued. The YNAB “method” made more sense to me than other tools out there, and it offers a robust and well-documented API. A little poking around in the API docs confirmed that it would be fairly easy to set up a system I had wanted for many years: a positive feedback loop of daily budget updates. My vision was to get a text message once or twice a day with the amount of money left in some specific budget categories.

There is a lot of research on feedback loops, both positive and negative, in regards to human behavior. A simple example is a speed trailer placed in a neighborhood. Research shows that the simple act of displaying someone’s speed causes them to slow down. I had the idea that the same principle would apply to budgeting. If my wife and I had a better idea of where things were at budget-wise on a daily basis, we could make smarter spending decisions. I knew this would require some diligence since every transaction has to be assigned a category and approved within YNAB before it is reflected in the budget. This is not too difficult to keep up with since the text message would also serve as a reminder to get into YNAB and categorize any new transactions.

Enter AWS Lambda

If you’re even remotely paying attention to technology you’ve heard of “serverless” or “Functions as a Service” (FaaS). AWS Lambda is Amazon’s offering in this space, and it has quickly gained adoption due to its ease of use and comprehensive functionality. It’s also cheap - there is no charge when a Lambda function is not in use, and even if I ran my budget alert twice a day that is a max of 62 function executions a month. (Spoiler alert: I’ve had this running for over a month and my AWS bill has gone up by about $0.15. That includes the fees to send text messages via Amazon SNS.)

Using Python to interact with a REST API is easy. Postman will generate the code needed to make a call in a matter of clicks. From there, it is just a matter of executing the code, and sending the results in a text message. There are many ways to automate text messaging, but I chose to use Amazon SNS due to its simple integration with Lambda and low cost.

The Setup

This example will pull the current amount of money left in one budget category via a Lambda function communicating with the YNAB API, and send it to one or more people via SMS. It’s not too difficult to expand this example to send more budget category balances, if desired. Personally I’ve used AWS for years to host my personal DNS zones via Route53, and I have some home movies stored in Glacier. If you’ve never used AWS before, now is a good time to create an account using these instructions.

YNAB

There are few requirements for YNAB, apart from having done the initial setup along with having a budget created. You will need an API key, and to obtain that you must have an account with a username and password. If you are using your Google account (or some other service) to log in to YNAB, you will need to create a password before obtaining your key.

Instructions for obtaining a key, along with API documentation is available at https://api.youneedabudget.com/. To create a key follow these steps:

Sign in to the YNAB web app and go to the “My Account” page and then to the “Developer Settings” page.
Under the “Personal Access Tokens” section, click “New Token”, enter your password and click “Generate” to get an access token.

Open a terminal window and run this:

curl -H "Authorization: Bearer <ACCESS_TOKEN>" https://api.youneedabudget.com/v1/budgets

You should receive a result beginning with HTTP/1.1 200 OK followed by a JSON payload with your budget information.
Save your API key in a safe place, like 1Password or LastPass. You will need this key when you are configuring your Lambda function. This key is essentially the same as your username and password, so don’t share it with anyone. If you share code on GitHub, remember to remove your key!

Now, use the API explorer to grab an IDs from YNAB to use in your Lambda script. You will need to find the ID for the budget category (or categories) you want to track.

From https://api.youneedabudget.com/, click “API Endpoints” in the top right of the page.
Scroll down to “Categories” and click the lock icon beside GET /budgets/{budget_id}/categories

When prompted, paste in your API key, click “Authorize” then close the authorization page. Do not click “Logout”.

Click the GET /budgets/{budget_id}/categories line to expand the information for this API endpoint. Click “Try it out”.

For budget_id enter last-used. If you have multiple budgets this will load the categories for the last budget used. I only have one budget, so this will always default to the correct budget.
Click “Execute”

Scroll down and examine the response. You will see your budget categories displayed in JSON format. Find the category you want to track, and save the ID listed. Below is a screenshot of one budget category, with the ID partially obfuscated. You should see this pattern repeated for every budget category you have.

Amazon SNS

Log into the AWS Console, type “SNS” into the “Find Services” bar, and click “Simple Notification Service” when it comes up. Within SNS, there are two sections to configure: Subscriptions and Topics.

Click “Topics”, then “Create Topic”

Provide a name for your topic as well as a display name, and click “Create topic”. I used “ynab-alerts” for both. Note the “ARN” provided after the topic is created. You will need this value for your Lambda function.

Click “Subscriptions”, then “Create Subscription”

Choose the ARN for the topic that you just created, set “Protocol” to “SMS”, and plug in your cell phone number for “Endpoint”. You must follow the format given, e.g. +15555551234. Click create subscription. Amazon will send a confirmation text message to confirm the owner of the phone number entered does indeed wish to participate in the subscription. If not, you would have a very inexpensive way to spam friends (or enemies) with all sorts of texts!

Repeat step four for any additional cell phones that you want to send budget updates to.

AWS Lambda

Click the “Services” dropdown at the top of the page, type “Lambda” and hit enter. This is where we will configure the function used to retrieve information from the YNAB API, and send it to a cell phone via SMS.

Click “Create function” on the Lambda dashboard. Leave “Author from scratch” selected, provide a name for your function, choose “Python 3.7” as the runtime, and click “Create function”. Note that the default execution role is sufficient and does not need to be changed. You will now be at the configuration page for your function.

Click “Add trigger” and choose “CloudWatch Events”. Click the dropdown under “Rule” and choose “Create a new rule”. Provide a name (e.g. “YNAB-schedule”) and a Schedule expression, and uncheck “Enable trigger” for now. This is a cron-formatted rule, so you can specify the times that work best for you to get alerts. I get alerts at noon and 8:00pm, so my expression looks like cron(0 16,00 * * ? *). Once your schedule is set, click “Add”.

You should be back on the Lambda function configuration page. If “CloudWatch Events” is still highlighted in the designer, click on the Lambda function name so that the code editor is displayed.

Scroll down to view the built-in code editor. In another tab, open the example code from this GitHub Gist. Copy that code and paste it into the Lambda code editor. Replace the placeholders for API Key (line 6), Budget category (line 7), and SNS Topic ARN (line 31). The editor will save code automatically, but to be safe click File->Save.

At the top of the page, click the dropdown beside the “Test” button and choose “Configure test events”. Provide an event name (e.g. “Test”) and click “Create”. Now click the “Test” button, and if everything is working as expected, you will receive a text with the amount of money left in your budget category!

Feel free to adjust the script to change wording or check additional categories based on the steps outlined. It will require some Python knowhow to adjust the existing script and add the logic and outputs for additional categories, but I’m happy to help you if needed. Once you are satisfied with the alert you’re getting when the function runs, click on “CloudWatch Events” in the “Designer” section, then click the slider to enable your schedule. You will now receive texts matching your defined schedule.

Final Thoughts

Here’s a screenshot of the text that my wife and I get twice a day. I’m tracking four different categories, but the code is a bit hacky so I’ve opted not to share it.

Here is my monthly AWS bill with the relevant services highlighted. Clearly this is easy to fit into the budget.

I hope this is a helpful exercise for other YNAB users out there, or at least an example showing the power of APIs and serverless functions. There are many APIs available for consumption, so with a little imagination you can build all sorts of useful tools by following this example. I chose not to share the script I’m using to monitor multiple categories because it is fairly ugly and could use some enhancements. Ultimately I’d like to trigger this function with an API call, and pass in the categories to track via a JSON payload. This would allow me to schedule the same twice-daily messages, but I could also use the script to check categories from Alexa or some other method.

Did I miss anything? Please comment below with questions, or find me on twitter at @NetworkBrouhaha.

I'm joining Rubrik

Mon, 22 Oct 2018 00:00:00 +0000

This post will be short and sweet. I’m joining Rubrik as a Technical Marketing Engineer, focusing on Networking and Security.

Why:

To challenge myself and learn new things
To contribute some of my networking (and other) knowledge
To focus my energy into a solid product that I believe in
To work with some incredibly bright minds
Rubrik has a growing customer base with increasingly complex networks

Leaving SIS was a difficult decision. I made some great friends there and genuinely hope those friendships continue. I don’t have anything to say but kind words for the SIS folks, and this new direction isn’t because of anything negative happening there. Simply put, an opportunity to work at Rubrik is one that is too good to pass up. They are disruptive in their market, and by my appraisal they are doing things the right way. No one has anything but good things to say about their leadership, and they have a compelling story about their product.

I can’t wait to see what the next few months will bring. I’m excited to learn everything about Rubrik and get to know my new team members. Unfortunately this means that I’ll have to pause work on my “Hybrid Home Lab” setup, but I will continue that effort as soon as I can.

VMworld 2018 Recap

Thu, 13 Sep 2018 00:00:00 +0000

My journey to VMworld 2018 began in an unexpected way - a tweet from Chris McCain.

u a CCIE? Apply for a full VMworld pass.

What did I have to lose? Never mind that VMworld was only a month away, and I had no approval to actually travel to Las Vegas. I filled out a little survey about how much I love NSX, and pressed submit. Fast forward a week, a wild email appears:

Thanks for filling out the application, I’d love to offer you the NSX Mindset scholarship to VMworld!

Uh oh. That is not what I expected. After a good amount of scrambling on my part, and some very gracious actions by my employer, I was approved for travel and lodging in Las Vegas for VMworld 2018. What an exciting an unexpected turn of events! I registered, made arrangements, and started prepping to head to a conference I never expected to attend. With CiscoLive fresh on my mind, I made a concerted effort to keep my schedule realistic. I definitely wanted plenty of time to meet new people, hit some hands on labs, and some down time so I didn’t exhaust myself. I exercised self control while scheduling sessions, but it was not easy. There was a long list of options that piqued my interest.

It’s no secret that my roots are in networking. I’ve spent plenty of time in the compute and storage silos, and I attend my local VMUG, but I am an “outsider” when it comes to the #vCommunity. I expected that VMworld was a lot like CiscoLive in terms of form and function. I found this to be mostly true, but there are some differences that I will call out throughout this post.

August crept by slowly, but the time finally came to board a plane bound for Las Vegas. After checking into my hotel and grabbing my VMworld badge, it was time for Opening Acts, followed by VMunderground. This was the first of several differences I noticed between VMworld and CiscoLive. If VMworld was a planet, it would have several orbiting moons that represent all the community events happening in conjunction. vDodgeball, vSoccer, vFit runs, vBeards gatherings - there is something for everyone. From what I can tell, these all have roots in VMUG (or vBrownbag). VMware made a smart decision supporting and empowering VMUG leaders. It has spawned a vibrant community, and it sets VMworld apart from other events.

Opening Acts was a great way to kick off VMworld. The panel on “Beating IT Burnout” was a highlight, and it was fun watching my friend Thom up on stage. Alicia Preston spoke about practicing mindfulness to combat burnout. This presentation spawned several other hallway conversations throughout the week. If you missed it, take the time to watch. VMunderground was also a great time, and I got the opportunity to meet several folks that I would continue to see at blogger tables and VMTN area. I definitely recommend this event for anyone that is new to VMworld.

Sessions

Overall the session content was very good, and I was surprised at the depth of the networking material. In general, I found the sessions to be a bit more technical at CLUS than VMworld, but not by much. One thing I missed from CLUS was having access to a copy of the slides for each session. There were several times a presenter blew past a slide that I wanted to digest a bit more. It also keeps people from feeling like they have to snap a picture of every slide. Here are my highlights:

NSX Mindset: Clouds Collide, Opportunity Strikes (NET1919BU) - This is not a technical talk, but I’d recommed it to anyone working in IT. Chris McCain is a fantastic presenter and could probably work the motivational speaker circuit.
Kubernetes NSX-T Deep Dive (NET1677BU) - I’ve spent plenty of hours trying to detangle networking in Kubernetes. This presentation lays out k8s topics and constructs in an easy to understand way, and makes a great case for NSX-T as one of the best ways to “do networking” in Kubernetes.
Next-Generation Reference Design with NSX-T Data Center: Part 1 (NET1561BU)
Next-Generation Reference Design with NSX-T Data Center: Part 2 (NET1562BU)
VMware Cloud on AWS with NSX: Use Cases, Design, and Implementation (NET1327BU) - Good overview of networking in VMWonAWS, plus a preview of things to come with NSX-T support.

Keynotes and announcements

Honestly, I don’t really care about keynotes at conferences. The only ones I’m truly interested in are the non-technical ones, a la Michio Kaku & Amy Webb at CLUS, and Malala Yousafzai at VMworld. All of the announcements are already well covered, so I’m not going to generate yet another list. I was absolutely thrilled at the opportunity to hear Malala speak, and I give VMware major credit for bringing her to speak, along with committing to supporting her charity. There were some grumbles about the increased security, but in my opinion it was all worth it. I am so inspired by this young woman and her commitment to fighting for education for girls everywhere. Someone - I’m not saying who - recorded her talk on Periscope, and you can watch here. Thinking it still gives me all the feels.

Parties

Maybe it’s because VMworld is in Las Vegas, but it would be an understatement to say that there were lots of parties going on. My MO for conferences is to treat them like work. I’m there to learn, and my employer is paying for me to be there. However, there were a few baller parties that are worth mentioning.

Rubrik had the party of the week in my option. RUN-DMC and The Roots?! It was non-stop awesome and I danced my butt off. I have been a fan of The Roots since 1996, and I had only seen them live once. I made my way to the front of the stage and enjoyed a once-in-a-lifetime show. RUN-DMC was also great and Jam Master Jay’s son is a hell of a turntablist. Kudos to Rubrik for throwing a great party. Here are a couple videos I took from the party
VMfest was, in my opinion, a fun time. Several people I talked to skipped the party altogether. I’ve read comments from many people that thought it was terrible. Royal Machines was an unpopular choice for a band - I was disappointed when I saw the announcement. If this wasn’t my first VMworld I may have skipped the party as well, but I decided to go into it with an open mind. When I walked in, there were long lines for food trucks scattered around the entrance area. I have no idea why people were waiting as there was food available in several other places. I never had to wait in line for a drink all night. The theme was four different environments: tropical, desert, jungle, aquatic. Maybe this turned people off - I thought it was an original idea and the decorations were well done. Royal Machines were a pleasant surprise. I’m a sucker for a good cover band, and it was a fun show. They completely embraced the ridiculousness of who they are. Dave Navarro is a rock god - it was a pleasure to watch him play. Mark McGrath understands that everyone thinks he’s a joke, and he is still willing to get out on the stage and give it his all. He gets my respect for that. Macy Gray covering Radiohead: Awesome. Sebastian Bach covering Ozzy: Awesome. Robin Zander in general: Awesome. Surprise appearance by DMC: Awesome. Videos from the show / Setlist
The NSBU threw an “NSX Mindset” party at the 1923 Bourbon Bar. The place was packed, and rockin’. I truly wish I could have stayed longer, but I did not want to experience FUTURE:NET with a hangover. I did the responsible (i.e. boring) thing and slipped out early.

FUTURE:NET

Future:net is a one day “conference within a conference”, described as a “discussion on the future of networking with industry leaders and visionaries”. It is invite-only, and I was lucky to receive an invitation along with my scholarship. I first heard about this event on Packet Pushers, and I was immediately intrigued. Of everything I had scheduled at VMworld, I was most excited for this event, and it did not disappoint. The event took place all day Thursday, and there was a welcome reception Wedensday evening. I considered skipping the reception, and I’m glad I walked over to The Four Seasons instead of taking a nap. The first person I met leads networking teams at Google. Not long after that, Pat Gelsinger showed up. I was standing right beside him as he and Greg Ferro made a bet about the SD-WAN industry.

I just made a SD-WAN bet with @pgelsinger that NSX Velocloud and Cisco Viptela will NOT have 70% market share by this time next year. Tell me I am wrong ? https://twitter.com/etherealmind/status/1035011002855636992

Thursday the conference kicked off with breakfast and a live recording of a Packet Pushers podcast, which was a real treat to watch. I have been a loyal listener for many years, but I had never gotten the chance to meet Greg, Ethan and Drew. After breakfast, the presentations began, and the first presenter was a professor from Cornell discussing blockchain. Of every presenter on the agenda I was least excited for this talk - I feel like we’ve all heard more than enough about blockchain already. I was completely wrong, and it may have been my favorite talk of the day. Emin Gun Sirer delivered fascinating talk about why blockchain as a technology is much more interesting than cryptocurrencies.

I live tweeted the event and this blog is already long enough, so you can see my thoughts and others here: #FutureNET18. You can also find coverage in Packet Pushers Weekly Episode 406 and Network Break 200. I will try to write some more words about this event later - it really deserves its own blog post. Needless to say I was honored to attend and it was one of the highlights of my week.

Closing thoughts

As I mentioned, I’m not going to regurgitate all of the announcements from VMworld. Here’s a few links if you still need to catch up.
- https://www.vmware.com/products/whats-new.html?src=so_5a314d05e49f5&cid=70134000001SkJn
- https://anthonyspiteri.net/vmworld-2018-recap-part-1-major-announcement-breakdown/
- https://anthonyspiteri.net/vmworld-2018-recap-part-2-community-and-veeam-recap/
VMworld has a little ways to go in terms of organization. Compared to CLUS, registration was a hot mess. CiscoLive is a larger conference, and Cisco clearly throws a lot of resources at it. There are some other small things like putting tables in the breakout room that I missed. Would this stuff prevent me from coming again? Probably not. VMware does a very good job with this conference, but they could take a couple pages out of Cisco’s playbook.
There was a question thrown out in the Packet Pushers slack: If you went to VMworld this year, would you go again? My answer is probably. I’m not sure if it’s an event that I would need to hit every year, but I really enjoyed my experience. The only thing that bothered me was the location. I am not a fan of Las Vegas. Everything is too expensive. Everything is over the top. There are times when I’m mildly amused, but they are few and far in-between. I am not the morality police and I am not interested in judging anyone, but being in Vegas pushes me to the edge. It makes me feel icky. I’ve made no decision on if I’ll request to go to San Francisco in 2019, but I’ll seriously consider it.
Some genius at DEF CON was handing out “blockchains” - miniature cinder blocks on a dogtag chain. I found this to be incredibly punny, so I gathered the necessary materials and brought some with me to Vegas. I figured it would be a fun way to break the ice and meet new people, and I was not wrong. Everyone loved them, and I met so many people that I would not have met otherwise. I wish I knew who came up with the original idea so I could give him/her credit. Having something fun to share is an awesome way to meet people, especially if you’re a newcomer. If I’m already thinking about ways to expand on this idea if I make it to VMworld in San Francisco.

If you’re in Vegas and you don’t get a meal at Hash House a Go Go, you’re losing at life.

Hybrid Home Lab Pt. 1

Tue, 21 Aug 2018 00:00:00 +0000

Over the last few weeks I’ve been working on standing up my version of a “real” lab. I’ve got enough information together to start putting together some blog posts, so let’s dive right in. Previously, my home lab was just a custom built linux server with plenty of memory and software RAID. This was enough to do some small-scale network labs and run the few applications I needed, but it really doesn’t qualify as a true home lab. There’s no way for me to work with a vSphere or KVM cluster, let alone NSX-v or NSX-T. I’ve laid out a few goals for my “Hybrid Home Lab”:

On Prem Resources
- 2x UCS C220 M3
- Re-purpose existing server as a home NAS
  - Utilize hardware RAID and serve LUNs via iSCSI or NFS
- Ubiquiti EdgeSwitch 16XG
- Ubiquiti EdgeRouter PoE
- Purpose: Compute Virtualization Lab (vSphere or KVM), Network Virtualization Lab (NSX-V, NSX-T, EVE-NG, VIRL, GNS3), Kubernetes backup cluster
Cloud Resources
- Hosted in vCloud Director
- Rancher 2.0 Kubernetes cluster
- OPNsense firewall
  - Also provides ZeroTier VPN/SD-WAN and HAproxy load balancing
  - Replaces NSX edge in vCD
- Gluster for persistent Kubernetes storage
- Purpose: Learn Kubernetes, deliver applications independent of on prem resources, test OPNsense as a “cloud router” and ZeroTier for hybrid cloud scenarios
  - Applications I’ll try to run: Gitlab, Netbox, Zabbix, Grafana, MariaDB/Postgres, StackStorm, other automation tools, and custom

I will be publishing detailed blog posts on the setup of these components - stay tuned!

But, Why?

(my daughter approves the use of this meme)

Why vCD? I have access to a vCD lab at work. I have to keep a small footprint, but this is much more economical than using another cloud provider. We’ve run vCD at my day job for quite a while, and I’ve become fond of it. It’s come a long way since we initially deployed it, and it continues to improve. #LongLiveVCD

Why Rancher? This is another product that we’re using at work, so I have some motivation to learn it. It definitely is “training wheels” for Kubernetes, and I’m already getting the itch to experiment with vanilla Kubernetes or OpenShift. For now it does what I need it to, and it’s not terribly difficult to take all my YAML files and load them in another Kubernetes cluster later.

Why are you running stateful applications in Kubernetes? I understand that Kubernetes is mainly for stateless applications and microservices, but it does support stateful workloads. This is a lab, and sometimes it is fun to push the limits.

Why Gluster? Persistent storage in Kubernetes is a PITA if you’re not using one of the major cloud providers, or leveraging storage that provides a Kubernetes plugin. Heketi provides an API interface for GlusterFS that Kubernetes can leverage. I’ll provide more information in a later blog post, but this was the easiest way to provide redundant persistent storage for my Rancher cluster.

Why OPNsense? Yes, vCD provides an NSX edge. In vCD 9.1, it is full featured and suitable for most workloads. I’m a network nerd so this is one of the areas where I want more flexibility than what NSX can provide. The feature list for OPNsense is impressive, and most importantly for me, it has built in support for ZeroTier.

Why ZeroTier? Please see my previous post on cloud automation. Future posts will go into more detail on this as well.

Show me the diagram

IP addresses have been changed to protect the innocent.

(Click to embiggen)

CLUS 2018 recap

Mon, 23 Jul 2018 00:00:00 +0000

For the first time in seven years, I had the opportunity to travel to Cisco Live 2018 in Orlando, FL. In this belated blog post, I’ve got a few thoughts, a few tips, and a bit of geeking out.

There’s a thrill to registering for Cisco Live: scheduling sessions, RSVPing to party invites, planning to meet friends, and booking flights. The most important part, by far, is creating a reasonable schedule. CLUS is a marathon, not a sprint, and you have to be careful to not overburden yourself. I was at packed 8:00am sessions every day but Thursday, and up fairly late most nights. There is simply too much to do. Below is a list of sessions I attended, to get an idea of my week.

[BRKSDN-2262] Open Source for Networking: The FD.io/VPP example
[DEVNET-1293] Cisco UCS Automation and orchestration with Ansible
[BRKDCN-2035] VXLAN BGP EVPN based Multi-Site
[DEVNET-2644] Open Network Automation Platform (ONAP)
[BRKDCN-3040] Troubleshooting VxLAN BGP EVPN
[DEVNET-1296] Building a NetDevOps CICD Pipeline with OpenSource
[BRKSDN-2115] Introduction to Containers and Container Networking
[BRKDCN-3001] Leveraging Micro Segmentation to Build Comprehensive Data Center Security Architecture
[BRKRST-3310] Troubleshooting OSPF
[BRKCLD-3440] Multicloud Networking – Design & Deployment
[BRKDCN-2125] Overlay Management and Visibility with VXLAN
[DEVNET-1365] DevNet Workshop- Vagrant Up for the Network Engineer (NX-OS, IOS-XE, IOS-XR)
[DEVNET-2076] Continuous Integration and Testing for Networks with Ansible
[BRKSEC-2010] Talos Insights: The State of Cyber Security
[KEYGEN-1003] Closing Keynote: What Science Can Tell Us About Our Future

Here is the approach I took to building my schedule:

I went through the course catalog, filtering by technology, and marked every interesting course as a favorite. All favorites are saved, so you can go back and watch recordings for sessions you missed once they’re posted.
I noted 5-6 “must attend” sessions, and registered for them as soon as registration opened.
Filtering by time slot and favorite sessions, I filled up the rest of my schedule. I didn’t worry about leaving time for lunch at this stage.
After some internal deliberation, I dropped between 1/3rd and 1/4th of the courses I’d registered for. This provided time to eat, rest, socialize, and attend some of the “meatspace only” opportunities (DevNet, Walk-in Self Paced Labs, Tweetups etc.)

I knew I’d made good picks when I walked into my first session and sat down behind Terry Slattery and Wendell Odom. My favorite session was Troubleshooting OSPF, by Nick Russo. The room was packed, and Nick put on a master class. If you missed it, do yourself a favor and watch it now. You don’t need to be an OSPF guru to keep up, but I’m willing to bet that even the most seasoned CCIE R&S will gain something from this session. Overall the session content across the board was top notch, with only a couple sessions that I found mildly disappointing at worst.

Almost every session recording is posted online, so there is no reason to have Cisco Live session FOMO. Most of us go to CLUS to learn the latest and greatest in our chosen technology stacks, but I find far greater value in the human connections I formed. I’m an extrovert, so being surrounded by a throng of people gives me energy. As I walked down the halls I would look around and think to myself, “Yes, these are my people!”

I made a concerted effort to connect with as many online friends and personal inspirations as I could. Here’s a incomplete list of folks I was either able to meet or learn from: Russ White, Jordan Martin, Eyvonne Sharp, Terry Slattery (plus many other NetCraftsmen I sat in sessions with), Wendell Odom, Scott Morris, Lukas Krattiger, Hank Preston, Jason Edelman, Nick Russo, Daniel Dibb, Dmitry Figol, Katherine McNamara, Denise Fishburne, Humphrey Cheung, Quentin Demmon and Tony Efantis, not to mention all the fine folks I met from RouterGods. This is a prolific group of networkers. If you want to improve yourself, what better way is there than learning from people like this? I’m also a believer in spreading gratitude, so I made sure to personally thank folks that had helped me grow technically and professionally. Every single person I thanked seemed genuinely appreciative to hear it. There’s never any harm in spreading the love!

My only regret is that I did not hunt down Drew Conry-Murray, as I am an avid Packet Pushers listener and I love The Network Break. Hopefully I can remedy this next year!

I have to give special attention to the DevNet Zone, and the folks that put it all together. This area was filled with some of the best content of the conference. Network Automation, Programming, APIs and the future of Networking in general was on full display. There were hands-on labs and experts willing to whiteboard anything you wanted to discuss. Watching Wendell Odom geek out like the rest of us as Hank Preston presented on NetDevOps was a particularly cool moment.

You’ll notice from the list of sessions above that I only attended one keynote. There were DevNet sessions that I wanted to attend instead, and the keynotes are posted online, so it wasn’t a tough decision. The closing keynote, featuring Amy Webb and Dr. Michio Kaku is a different story. By Thursday I was running on fumes, so I took the day easy. About an hour before the closing keynote, I made my way towards the entrance and saw a huge line had already formed. I had no interest in standing for an hour, so I found an empty seat nearby and waited for the doors to open. For some reason they didn’t open the doors where folks had queued - they opened the doors directly behind the seat I was sitting in. I was surprised and felt bad for the people that had been waiting in line, but I’m no dummy. I grabbed my stuff, walked in, and got seated in the front row almost directly in front of the stage. Talk about good luck! To top it off, as I was sitting there, one of my tweets was flashed up on the uber-displays. It was an amazing and surreal way to end CLUS. Both Amy and Michio gave great keynotes to wrap up CLUS.

Closing thoughts and tips

I had a great time at Cisco Live 2018. It was so fulfilling to meet and hang out with everyone, learn new things, explore the DevNet Zone/World of Solutions, and attend several great parties. I will admit to feeling somewhat overwhelmed the whole time I was there. There is something bright and shiny to grab your attention at every turn. Keeping up with twitter is a job within itself, and the Cisco Social Media team really deserves kudos for the great job they do during CLUS. However, I cannot disagree with anything Tom Hollingsworth wrote in his Cisco Live Recap. CLUS is a great event, but there will always be ways to improve and provide better value. In the end, like most things, you will get out of it what you put into it.

Here’s a few random tips to wrap up this post

Take breaks - you will need time to decompress.
Stay hydrated.
Come prepared to learn a lot, and keep a notebook handy. You may find yourself wanting to take notes when least expected.
Put yourself out there. Go out of your way to introduce yourself to peers in sessions, during meals, and at parties. Bring business cards.
If you’re social, hit the Tweetups. This is a great place to meet people and network.
Go easy at the parties. You’ll do yourself no favors by trying to make it through the next day hungover.
HAVE FUN.

Network Brouhaha

Cloud Director V to T Migration Videos

Introducing IP Spaces for VMware Cloud Director

The Background

IP Spaces Overview

The Why

Helpful Links

Notes

2022 Update: Simple Cloud Automation with VCD, Terraform, ZeroTier and Slack

Tools Used and Prerequisites

Terraform Example

Workflow

State of the VCD Terraform Provider in 2022

Final Thoughts

Resources

Using cloud-init for Customization with VCD and Terraform

The Basics of cloud-init

Lesson #1: Check GitHub issues

Lesson #2: Use a Cloud Image

Deploying and Customizing a VCD vApp with Terraform

Creating a Catalog

Uploading an OVA to a Catalog

Deploying the vApp

Troubleshooting cloud-init

Resources

Intro to Google Cloud VMware Engine – Common Networking Scenarios

Table of Contents

Creating Workload Segments in NSX-T

Exposing a VM via Public IP

Creating Firewall Rules

Load Balancing with NSX-T

Accessing Cloud-Native Services

Google Private Access

Viewing Routing Information

VPC Routes

VPC Network Peering Routes

NSX-T

VPN Connectivity

DNS Notes

Wrap Up

Helpful Resources

Intro to Google Cloud VMware Engine – HCX Configuration

HCX Configuration with Terraform

main.tf Contents

Variables Used

Using Environment Variables

Initializing and Running Terraform

Final Thoughts

Helpful Links

Screenshots

Intro to Google Cloud VMware Engine – Network and Connectivity Overview

SDDC Networking Overview

VDS and N-VDS Configuration

VMkernel and vmnic Configuration

NSX-T Configuration

SDDC Networking Capabilities

Egress Traffic

Ingress Traffic

Connecting to your SDDC

Advertising Routes to GCVE

ICMP in GCVE

Next Steps

Intro to Google Cloud VMware Engine – Bastion Host Access with IAP

Identity Aware Proxy Overview

Bastion Host Deployment with Terraform

main.tf Contents

output.tf Contents

terraform.tfvars Contents

Initializing and Running Terraform

Accessing the Bastion Host with IAP

Accessing GCVE Resources from the Bastion Host

Wrap Up

Terraform Documentation Links

Intro to Google Cloud VMware Engine – Connecting a VPC to GCVE

Private Service Access Overview

Leveraging Terraform for Configuration

main.tf Contents

terraform.tfvars Contents

Initializing and Running Terraform

Review VPC Configuration