Automate your Cloud - CloudFormation Templates

I recently had to build a CloudFormation template that automates the provisioning of a stack that looks like this. Create VPC, subnets. Create an S3 bucket, datapipeline, EMR cluster and RedShift cluster. Back when I tried to do this, EMR creation was not directly supported in CloudFormation so I had to use a workaround. Recently, AWS started supporting EMR resource directly. I have enjoyed building this stack using a cloudformation template and I thought I'd use these learnings and come up with yet another post introducing CloudFormation and share some of the templates with the world. So here goes:

CloudFormation is a wonderful automation service by aws which allows developers to create templates for the service or application they want. It creates “stacks” with those templates for the quick and reliable provisioning of the services or applications. So, the CloudFormation has two parts:

Template which is a JSON text file which contains the required aws resources for the application

Stack which is a running instance of the template after the template is being submitted to the CloudFormation. CloudFormation then creates all the specified resources. When a Stack is deleted, all the related resources will also get deleted automatically.

Here's is a sample structure of the cloudformation template (json file):

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "A simple description of what this template does",
  "Parameters": {
    //variables that can be passed to the template during deployment.
  },
  "Mappings": {
    //keys which match to a corresponding set of named values
  }
  "Resources": {
    //aws resources that will be deployed by the stack
  },
  "Outputs": {
    //values that can be returned back after successful completion of the stack
  }
}

With CloudFormation, several aws resources can be automated: from aws networking (VPCs, Subnets, Route tables, Internet Gateways etc) to compute (EC2 instances, lambda functions etc) to database (RDS, Redshift) to storage (s3) components.

In this blog post, we are going to show some collection of sample CloudFormation Templates

Template for AWS Network Infrastructure

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "AWS CF Template to create VPC, Subnet, Security Group, Route Table and Internet Gateway",
  "Parameters":{
     "AvailabilityZone": {
    "Description": "select the Availability Zone for your Deployment",
    "Type": "AWS::EC2::AvailabilityZone::Name"
      }
  },
  "Mappings": {
    "SubnetConfig": {
      "VPC": {
        "CIDR": "10.0.0.0/16"
      },
      "Subnet": {
        "CIDR": "10.0.0.0/24"
      }
    }
  },
  "Resources":{
     "VPC": {
    "Type": "AWS::EC2::VPC",
    "Properties": {
      "CidrBlock": {
        "Fn::FindInMap": [
          "SubnetConfig",
          "VPC",
          "CIDR"
        ]
      },
        "Tags": [
          {
            "Key": "Name",                                                                                                                                                                      
            "Value": "vpc01"                                                                                                                                                                    
          }                                                                                                                                                                                     
        ]                                                                                                                                                                                       
      }                                                                                                                                                                                         
    },
    "InternetGateway": {
      "Type": "AWS::EC2::InternetGateway"
    },
    "InternetGatewayAttachement": {
      "Type": "AWS::EC2::VPCGatewayAttachment",
      "Properties": {
     "InternetGatewayId": {
          "Ref": "InternetGateway"
        },
        "VpcId": {
          "Ref": "VPC"
        }
      }
    },
    "EC2InstanceSubnet": {
      "Type": "AWS::EC2::Subnet",
      "Properties": {
        "CidrBlock": {
          "Fn::FindInMap": [
            "SubnetConfig",
            "Subnet",
            "CIDR"
          ]
        },
        "AvailabilityZone": {"Ref": "AvailabilityZone"},
        "VpcId": {"Ref": "VPC"},
        "Tags": [
          {
            "Key": "Name",
            "Value": "subnet01"
          }
        ]
      }
    },
    "PublicSubnetsRouteTable": {
      "Type": "AWS::EC2::RouteTable",
      "Properties": {
        "VpcId": {"Ref": "VPC"}
      }
    },
    "InternetRoute": {
      "Type": "AWS::EC2::Route",
      "DependsOn": "InternetGateway",
      "Properties": {
        "RouteTableId": {"Ref": "PublicSubnetsRouteTable"},
        "DestinationCidrBlock": "0.0.0.0/0",
        "GatewayId": {"Ref": "InternetGateway"}
      }
    },
    "AssociateRouteTableWithPublicSubnet": {
      "Type": "AWS::EC2::SubnetRouteTableAssociation",
      "Properties": {
        "RouteTableId": {"Ref": "PublicSubnetsRouteTable"},
        "SubnetId": {"Ref": "EC2InstanceSubnet"}
      }
    },
    "SecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "EC2 security group",
        "VpcId": {"Ref": "VPC"},
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": "80",
            "ToPort": "80",
            "CidrIp": "0.0.0.0/0"
          }
        ],
        "Tags": [
          {
            "Key": "Name",
            "Value": "sg01"
          }
        ]
      }
    }
 },
  "Outputs": {
    "VpcId": 
    {
      "Value": {"Ref": "VPC"},
      "Description": "The id of the created vpc"
    },
    "SubnetId":
    {
      "Value": {"Ref": "EC2InstanceSubnet"},
      "Description": "The id of the created subnet"
    },
     "SecurityGroupId":
    {
      "Value": {"Ref": "SecurityGroup"},
      "Description": "The id of the created subnet"
    }
  }
}

After creating this json script, you need to upload it to the CloudFormation console:

This template will create the networking components specified and after the successful completion, CREATE_COMPLETE status will appear and the outputs specified in the template can be seen in the Outputs tab :

Template for EC2 Instance and a LoadBalancer

In this sample template, an EC2 instance and a load balancer is being created with subnet id and security group id provided in parameters:

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "AWS Cloud Formation template to launch an EC2 Instance and 
                  a LoadBalancer",
  "Parameters": {
  "AvailabilityZone": {
      "Description": "select the Availability Zone to launch the Instance",
      "Type": "AWS::EC2::AvailabilityZone::Name"
    },
  "EC2InstanceType": {
      "Description": "Type of EC2 instance to launch.",
      "Type": "String",
      "Default": "t2.micro",
      "AllowedValues" : [
        "cc2.8xlarge",
        "c3.8xlarge",
        "c3.4xlarge",
        "c3.2xlarge",
        "c3.xlarge",
        "c3.large",
        "c4.8xlarge",
        "c4.4xlarge",
        "c4.2xlarge",
        "c4.xlarge",
        "c4.large",
        "r3.8xlarge",
        "r3.4xlarge",
        "r3.2xlarge",
        "r3.xlarge",
        "r3.large",
        "i2.8xlarge",
        "i2.4xlarge",
        "i2.2xlarge",
        "i2.xlarge",
        "cr1.8xlarge",
        "cg1.4xlarge",
        "m3.medium",
        "m3.large",
        "m3.xlarge",
        "m3.2xlarge",
        "hi1.4xlarge",
        "g2.2xlarge",
        "t2.micro",
        "t2.small",
        "t2.medium",
        "t2.large",
        "t2.nano",
        "d2.8xlarge",
        "d2.4xlarge",
        "d2.2xlarge",
        "d2.xlarge",
        "m4.large",
        "m4.xlarge",
        "m4.2xlarge",
        "m4.4xlarge",
        "m4.10xlarge"
      ],
      "ConstraintDescription": "must be a valid EC2 instance type."
    },
  "KeyPair" : {
    "Description" : "Amazon EC2 Key Pair",
    "Type" : "AWS::EC2::KeyPair::KeyName"
  },
  "EC2InstanceAMI": {
      "Description": "Type the Instance AMI ID",
      "Default": "ami-8fcee4e5",
      "Type": "String"
    },
  "SubnetID": {
      "Description": "Choose the SubnetID to attach with the EC2 Instance",
      "Type": "AWS::EC2::Subnet::Id"
    },
  "SecurityGroupID": {
      "Description": "Select the Security Group for EC2 Instance",
      "Type": "AWS::EC2::SecurityGroup::Id"
    }
},
"Resources":{
  "Ec2Instance" : {
      "Type" : "AWS::EC2::Instance",
      "Properties" : {
        "AvailabilityZone": {"Ref": "AvailabilityZone"},
        "InstanceType": {"Ref": "EC2InstanceType"},
        "SubnetId": {"Ref": "SubnetID"},
        "SecurityGroupIds" : [ { "Ref" : "SecurityGroupID" }],
        "ImageId": {"Ref": "EC2InstanceAMI"},
        "KeyName": {"Ref": "KeyPair"},
        "Tags": [
          {
            "Key": "Name",
            "Value": "Instance-01"
          }
        ]
      }
    },
  "ElasticLoadBalancer" : {
      "Type" : "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties" : {
        "Instances" : [ { "Ref" : "Ec2Instance" } ],
        "Subnets": [{"Ref": "SubnetID"}],
        "CrossZone" : true,
        "SecurityGroups" : [ { "Ref" : "SecurityGroupID" }],
        "LoadBalancerName" : "ELB-01",
        "Tags" :[
          {
            "Key": "Name",
            "Value": "ELB-01"
          }
        ],
        "Listeners": [{
          "LoadBalancerPort": "80",
          "InstancePort": "80",
          "Protocol": "HTTP"
        }],
        "HealthCheck": {
          "Target": "HTTP:80/",
          "HealthyThreshold": "3",
          "UnhealthyThreshold": "5",
          "Interval": "30",
          "Timeout": "5"
        },
        "ConnectionDrainingPolicy": {
          "Enabled" : "true",
          "Timeout" : "60"
        }
      }
    }
  },
  "Outputs": {
    "EC2InstanceID":{
      "Value" : {"Ref": "Ec2Instance"},
      "Description" : "The Instance ID of the created Ec2 Instance"
    },
    "ELBDNS":{
      "Value" : { "Fn::GetAtt" : ["ElasticLoadBalancer", "DNSName"] },
      "Description" : "The Public DNS of the created ELB"
    }
  }
}

After the template is being uploaded, the parameters will be asked in the next screen as shown in the screenshot below:
Next, after the creation of the stack reaches the CREATE_COMPLETE status, the EC2 instance and Elastic Load Balancer will get launched. The ec2 instance id and the DNS Name of the Elastic LoadBalancer will be shown in the Outputs tab

Template for RedShift Cluster

This sample template creates a redshift cluster.

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "AWS CF Template to create a Redshift Cluster",
  "Parameters":{
    "DatabaseName" : {
      "Description" : "The name of the first database to be created when 
                       the cluster is created",
      "Type" : "String",
      "Default" : "dev",
      "AllowedPattern" : "([a-z]|[0-9])+"
    },
    "SubnetID" : {
      "Description" : "Subnet IDs",
      "Type" : "AWS::EC2::Subnet::Id"
    },
    "SecurityGroupID" : {
      "Description" : "Security Group IDs",
      "Type" : "AWS::EC2::SecurityGroup::Id"
    },
    "ClusterType" : {
      "Description" : "The type of cluster",
      "Type" : "String",
      "Default" : "single-node",
      "AllowedValues" : [ "single-node", "multi-node" ]
    },
    "NumberOfNodes" : {
      "Description" : "The number of compute nodes in the cluster. For 
       multi-node clusters, the NumberOfNodes parameter must be greater than 1",
      "Type" : "Number",
      "Default" : "1"
    },
    "NodeType" : {
      "Description" : "The type of node to be provisioned",
      "Type" : "String",
      "Default" : "dc1.large",
      "AllowedValues" : ["dc1.large", "dw1.xlarge", "dw1.8xlarge", 
                         "dw2.large", "dw2.8xlarge" ]
    }, 
    "MasterUsername" : {
      "Description" : "The user name that is associated with the master user 
                       account for the cluster that is being created",
      "Type" : "String",
      "Default" : "defaultuser",
      "AllowedPattern" : "([a-z])([a-z]|[0-9])*"
    },
    "MasterUserPassword" :  {
      "Description" : "The password that is associated with the master user 
                       account for the cluster that is being created.",
      "Type" : "String",
      "NoEcho" : "true"
    }
  },
  "Mappings": {},
  "Conditions":{
      "IsMultiNodeCluster" : {
      "Fn::Equals" : [{ "Ref" : "ClusterType" }, "multi-node" ]        
    }
  },
  "Resources":{
    "RedshiftCluster" : {
      "Type" : "AWS::Redshift::Cluster",
      "Properties" : {
        "ClusterType" : { "Ref" : "ClusterType" },
        "NumberOfNodes" : { "Fn::If" : [ "IsMultiNodeCluster",  
                          {"Ref" : "NumberOfNodes"}, {"Ref" : "AWS::NoValue"}]},
        "NodeType" : { "Ref" : "NodeType" },
        "DBName" : { "Ref" : "DatabaseName" },
        "MasterUsername" : { "Ref" : "MasterUsername" },
        "MasterUserPassword" : { "Ref" : "MasterUserPassword" },  
        "ClusterParameterGroupName" : { "Ref": "RedshiftClusterParameterGroup" },
        "VpcSecurityGroupIds" :  [{ "Ref" : "SecurityGroupID" }] ,
        "ClusterSubnetGroupName" : { "Ref" : "RedshiftClusterSubnetGroup" }
      }
    },
    "RedshiftClusterParameterGroup" : {
      "Type" : "AWS::Redshift::ClusterParameterGroup",
      "Properties" : {
        "Description" : "Cluster parameter group",
        "ParameterGroupFamily" : "redshift-1.0",
        "Parameters" : [{
          "ParameterName" : "enable_user_activity_logging",
          "ParameterValue" : "true"
          }
        ]
      }
    },
    "RedshiftClusterSubnetGroup" : {
      "Type" : "AWS::Redshift::ClusterSubnetGroup",
      "Properties" : {
        "Description" : "Cluster subnet group",
        "SubnetIds" : [ { "Ref" : "SubnetID" } ]
      }
    }
  },
  "Outputs" : {
    "ClusterEndpoint" : {
      "Description" : "Cluster endpoint",
      "Value" : { "Fn::Join" : [ ":", [ 
                { "Fn::GetAtt" : [ "RedshiftCluster", "Endpoint.Address" ] }, 
                { "Fn::GetAtt" : ["RedshiftCluster", "Endpoint.Port" ] } ] ] }
    },
    "ClusterName" : {
      "Description" : "Name of the Cluster",
      "Value" : { "Ref" : "RedshiftCluster" }
    },
    "ParameterGroupName" : {
      "Description" : "Name of the Parameter Group",
      "Value" : { "Ref" : "RedshiftClusterParameterGroup" }
    },
    "RedshiftClusterSubnetGroupName" : {
      "Description" : "Name of the Cluster Subnet Group",
      "Value" : { "Ref" : "RedshiftClusterSubnetGroup" }
    }
  }
}

After uploading the template, the next screen for parameters will be like below:
After passing the values to the template, the redshift cluster will be created and the outputs specified will be shown in the Outputs tab of CloudFormation:

Template for an EMR cluster

Since EMR was not supported for CF until recently, we had to use a workaround to provision EMR. We used a datapipeline whcih can provision an EMR cluster. In our below template, we are using datapipeline which not only creates an EMR cluster rather runs an EMR activity too on that EMR cluster.

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "AWS CF Template to create EMR Cluster using DataPipeline",
  "Parameters": {
    "AvailabilityZone": {
      "Description": "select the Availability Zone to deploy your EMR Cluster",
      "Type": "AWS::EC2::AvailabilityZone::Name"
    },
    "MasterInstanceType": {
      "Description": "Type of EC2 instance to launch as Master Node.",
      "Type": "String",
      "Default": "m1.medium",
      "AllowedValues": [
        "cc2.8xlarge",
        "c3.8xlarge",
        "c3.4xlarge",
        "c3.2xlarge",
        "c3.xlarge",
        "c3.large",
        "c4.8xlarge",
        "c4.4xlarge",
        "c4.2xlarge",
        "c4.xlarge",
        "c4.large",
        "r3.8xlarge",
        "r3.4xlarge",
        "r3.2xlarge",
        "r3.xlarge",
        "r3.large",
        "i2.8xlarge",
        "i2.4xlarge",
        "i2.2xlarge",
        "i2.xlarge",
        "cr1.8xlarge",
        "cg1.4xlarge",
        "m1.medium",
        "m1.large",
        "m1.xlarge",
        "m3.medium",
        "m3.large",
        "m3.xlarge",
        "m3.2xlarge",
        "hi1.4xlarge",
        "g2.2xlarge",
        "d2.8xlarge",
        "d2.4xlarge",
        "d2.2xlarge",
        "d2.xlarge",
        "m4.large",
        "m4.xlarge",
        "m4.2xlarge",
        "m4.4xlarge",
        "m4.10xlarge"
      ],
      "ConstraintDescription": "must be a valid EC2 instance type."
    },
    "SlaveInstanceType": {
      "Description": "Type of EC2 instance to launch as Slave Node.",
      "Type": "String",
      "Default": "m1.medium",
      "AllowedValues": [
        "cc2.8xlarge",
        "c3.8xlarge",
        "c3.4xlarge",
        "c3.2xlarge",
        "c3.xlarge",
        "c3.large",
        "c4.8xlarge",
        "c4.4xlarge",
        "c4.2xlarge",
        "c4.xlarge",
        "c4.large",
        "r3.8xlarge",
        "r3.4xlarge",
        "r3.2xlarge",
        "r3.xlarge",
        "r3.large",
        "i2.8xlarge",
        "i2.4xlarge",
        "i2.2xlarge",
        "i2.xlarge",
        "cr1.8xlarge",
        "cg1.4xlarge",
        "m1.medium",
        "m1.large",
        "m1.xlarge",
        "m3.medium",
        "m3.large",
        "m3.xlarge",
        "m3.2xlarge",
        "hi1.4xlarge",
        "g2.2xlarge",
        "d2.8xlarge",
        "d2.4xlarge",
        "d2.2xlarge",
        "d2.xlarge",
        "m4.large",
        "m4.xlarge",
        "m4.2xlarge",
        "m4.4xlarge",
        "m4.10xlarge"
      ],
      "ConstraintDescription": "must be a valid EC2 instance type."
    },
    "CoreInstanceCount": {
      "Description": "Number of Core Instances",
      "Type": "String",
      "AllowedValues": [
        "1",
        "2",
        "3",
        "4",
        "5",
        "6",
        "7",
        "8",
        "9",
        "10"
      ],
      "Default": "2"
    },
    "KeyPair": {
      "Description": "Amazon EC2 Key Pair",
      "Type": "AWS::EC2::KeyPair::KeyName"
    },
    "BucketName": {
      "Description": "Bucket name to store the Output Data",
      "Type": "String"
    },
    "SubnetID": {
      "Description": "Subnet IDs",
      "Type": "AWS::EC2::Subnet::Id"
    },
    "MasterSecurityGroupID": {
      "Description": "Select the Security Group for Master Instance",
      "Type": "AWS::EC2::SecurityGroup::Id"
    },
     "SlaveSecurityGroupID": {
      "Description": "Select the Security Group for Slave Instance",
      "Type": "AWS::EC2::SecurityGroup::Id"
    },
    "SNSTopic": {
      "Description": "The endpoint the SNS Topic",
      "Type": "String"
    }
  },
  "Mappings": {},
  "Conditions": {},
  "Resources": {
    "Datapipeline": {
      "Type": "AWS::DataPipeline::Pipeline",
      "Properties": {
        "Name": "DP01",
        "Activate": "true",
        "ParameterObjects": [],
        "ParameterValues": [],
        "PipelineObjects": [
          {
            "Id": "DefaultSchedule",
            "Name": "RunOnce",
            "Fields": [
              {
                "Key": "occurrences",
                "StringValue": "1"
              },
              {
                "Key": "startAt",
                "StringValue": "FIRST_ACTIVATION_DATE_TIME"
              },
              {
                "Key": "type",
                "StringValue": "Schedule"
              },
              {
                "Key": "period",
                "StringValue": "1 Day"
              }
            ]
          },
          {
            "Id": "Default",
            "Name": "Default",
            "Fields": [
              {
                "Key": "type",
                "StringValue": "Default"
              },
              {
                "Key": "scheduleType",
                "StringValue": "cron"
              },
              {
                "Key": "failureAndRerunMode",
                "StringValue": "CASCADE"
              },
              {
                "Key": "pipelineLogUri",
                "StringValue": {
                  "Fn::Join": [
                    "",
                    [
                      "s3://",
                      {
                        "Ref": "BucketName"
                      },
                      "/logs"
                    ]
                  ]
                }
              },
              {
                "Key": "role",
                "StringValue": "DataPipelineDefaultRole"
              },
              {
                "Key": "resourceRole",
                "StringValue": "DataPipelineDefaultResourceRole"
              },
              {
                "Key": "schedule",
                "RefValue": "DefaultSchedule"
              }
            ]
          },
          {
            "Id": "ActivityId_01",
            "Name": "emr-activity01",
            "Fields": [
              {
                "Key": "schedule",
                "RefValue": "DefaultSchedule"
              },
              {
                "Key": "runsOn",
                "RefValue": "EmrClusterId_01"
              },
              {
                "Key": "type",
                "StringValue": "EmrActivity"
              },
              {
                "Key": "step",
                "StringValue": {
                  "Fn::Join": [
                    ",",
                    [
                      "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
                      "-input",
                      "s3://elasticmapreduce/samples/wordcount/input/0001",
                      "-output",
                      {"Fn::Join":["",["s3://",{"Ref":"BucketName"},"
                      /wordcount","/output"]]},
                      "-mapper",
                      "s3n://elasticmapreduce/samples/wordcount/wordSplitter.py",
                      "-reducer",
                      "aggregate"
                    ]
                  ]
                }
              },
              {
                "Key": "onSuccess",
                "RefValue": "mySuccessAction"
              },
              {
                "Key": "onFail",
                "RefValue": "myFailureAction"
              }
            ]
          },
          {
            "Id": "mySuccessAction",
            "Name": "SuccessNotify",
            "Fields": [
              {
                "Key": "type",
                "StringValue": "SnsAlarm"
              },
              {
                "Key": "topicArn",
                "StringValue": {"Ref": "SNSTopic"}
              },
              {
                "Key": "subject",
                "StringValue": "Success"
              },
              {
                "Key": "message",
                "StringValue": 
                "Success:The EMR activity has been successfully completed."
              }
            ]
          },
          {
            "Id": "myFailureAction",
            "Name": "FailureNotify",
            "Fields": [
              {
                "Key": "type",
                "StringValue": "SnsAlarm"
              },
              {
                "Key": "topicArn",
                "StringValue": {"Ref": "SNSTopic"}
              },
              {
                "Key": "subject",
                "StringValue": "Failure"
              },
              {
                "Key": "message",
                "StringValue": "Error: The EMR activity is failed"
              }
            ]
          },
          {
            "Id": "EmrClusterId_01",
            "Name": "emr-cluster01",
            "Fields": [
              {
                "Key": "schedule",
                "RefValue": "DefaultSchedule"
              },
              {
                "Key": "amiVersion",
                "StringValue": "2.4.8"
              },
              {
                "Key": "keyPair",
                "StringValue": {"Ref": "KeyPair"}
              },
              {
                "Key": "masterInstanceType",
                "StringValue": {"Ref": "MasterInstanceType"}
              },
              {
                "Key": "subnetId",
                "StringValue": {"Ref": "SubnetID"}
              },
              {
                "Key": "emrManagedMasterSecurityGroupId",
                "StringValue": {"Ref": "MasterSecurityGroupID"}
              },
              {
                "Key": "emrManagedSlaveSecurityGroupId",
                "StringValue": {"Ref": "SlaveSecurityGroupID"}
              },
              {
                "Key": "coreInstanceType",
                "StringValue": {"Ref": "SlaveInstanceType"}
              },
              {
                "Key": "coreInstanceCount",
                "StringValue": {"Ref": "CoreInstanceCount"}
              },
              {
                "Key": "type",
                "StringValue": "EmrCluster"
              },
              {
                "Key": "resourceRole",
                "StringValue": "DataPipelineDefaultResourceRole"
              },
              {
                "Key": "role",
                "StringValue": "DataPipelineDefaultRole"
              }
            ]
          }
        ]
      }
    }
  },
  "Outputs": {
    "DataPipelineId": {
      "Value": {
        "Ref": "Datapipeline"
      },
      "Description": "The id of the created datapipeline"
    }
  }
}

So, with this template, the EMR cluster will be created.

The “word count” job will be performed by the cluster, taking a sample input data and storing the output of the processed data in the s3 bucket which you specified in the parameters while creating the stack.

Also, using the endpoint of the SNS Topic, the success or failure of the EMR activity (i.e. Word Count) will be notified to the same email address given while creating a SNS topic.

Template to copy the data from S3 Bucket to the Redshift Cluster

In this template, we are going to copy the processed data that is stored in the s3 bucket from the above template to the redshift cluster. And for this, here is a sample cloudformation template

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "AWS CF Template to copy data from s3 bucket to Redshift Cluster using DataPipeline",
  "Parameters":{
    "DatabaseName" : {
      "Description" : "The name of the first database to be created when the 
                       cluster is created",
      "Type" : "String",
      "Default" : "dev",
      "AllowedPattern" : "([a-z]|[0-9])+"
    },
    "SubnetID" : {
      "Description" : "Subnet IDs",
      "Type" : "AWS::EC2::Subnet::Id"
    },
    "SecurityGroupID" : {
      "Description" : "Security Group IDs",
      "Type" : "AWS::EC2::SecurityGroup::Id"
    },
    "ClusterType" : {
      "Description" : "The type of cluster",
      "Type" : "String",
      "Default" : "single-node",
      "AllowedValues" : [ "single-node", "multi-node" ]
    },
    "NumberOfNodes" : {
      "Description" : "The number of compute nodes in the cluster. For 
       multi-node clusters, the NumberOfNodes parameter must be greater than 1",
      "Type" : "Number",
      "Default" : "1"
    },
    "NodeType" : {
      "Description" : "The type of node to be provisioned",
      "Type" : "String",
      "Default" : "dc1.large",
      "AllowedValues" : ["dc1.large", "dw1.xlarge", "dw1.8xlarge", 
                         "dw2.large", "dw2.8xlarge" ]                                                                                                 
    },                                                                                                                                                                                          
    "MasterUsername" : {                                                                                                                                                                        
      "Description" : "The user name that is associated with the master user 
                       account for the cluster that is being created",                                                                    
      "Type" : "String",                                                                                                                                                                        
      "Default" : "defaultuser",                                                                                                                                                                
      "AllowedPattern" : "([a-z])([a-z]|[0-9])*"                                                                                                                                                
    },
    "MasterUserPassword" :  {
      "Description" : "The password that is associated with the master user 
                       account for the cluster that is being created.",
      "Type" : "String",
      "NoEcho" : "true"
    },
    "ClusterID":{
       "Description" : "Cluster ID",
       "Type" : "String"
    },
    "SNSTopic":{
       "Description" : "The Endpoint of the SNS Topic",
       "Type" : "String"
    },
    "BucketName":{
       "Description" : "S3 Bucket where the data to be copied resides",
       "Type" : "String"
    },
    "AvailabilityZone": {
       "Description": "select the AV zone to create the subnet1",
       "Type": "AWS::EC2::AvailabilityZone::Name"
    }
  },
  "Mappings": {},
  "Conditions":{},
  "Resources":{
    "Datapipeline": {
       "Type": "AWS::DataPipeline::Pipeline",
       "Properties":{
       "Name": "DP02",
       "Activate": "true",
       "ParameterObjects":[],
       "ParameterValues": [],
       "PipelineObjects" :  [
        {
          "Id": "RedshiftDatabaseId1",
          "Name": "redshiftDB01",
          "Fields": [
            {
              "Key": "databaseName",
              "StringValue": {"Ref":"DatabaseName"}
            },
            {
              "Key": "username",
              "StringValue": {"Ref":"MasterUsername"}
            },
            {
              "Key": "*password",
              "StringValue": {"Ref":"MasterUserPassword"}
            },
            {
              "Key":"type",
              "StringValue":"RedshiftDatabase"
            },
            {
              "Key":"clusterId",
              "StringValue":{"Ref":"ClusterID"}
            }
          ]
        },
        {
          "Id": "Default",
          "Name": "Default",
          "Fields": [
            {
              "Key": "type",
              "StringValue": "Default"
            },
            {
              "Key": "scheduleType",
              "StringValue": "timeseries"
            },
            {
              "Key": "failureAndRerunMode",
              "StringValue": "CASCADE"
            },
            {
              "Key": "role",
              "StringValue": "DataPipelineDefaultRole"
            },
            {
              "Key": "resourceRole",
              "StringValue": "DataPipelineDefaultResourceRole"
            },
            {
              "Key": "pipelineLogUri",
              "StringValue": {"Fn::Join" : [ "", [ "s3://", 
                               { "Ref": "BucketName"},"/logs" ] ]}
            }
          ]
        },
        {
          "Id" : "RedshiftDataNodeId1",
          "Name" : "RedshiftDataNode01",
          "Fields":[
            {
              "Key": "schedule",
              "RefValue": "ScheduleId1"
            },
            {
              "Key": "tableName",
              "StringValue": "output"
            },
            {
              "Key": "createTableSql",
              "StringValue": "create table Output (Name VARCHAR(50) NOT NULL 
                              PRIMARY KEY);"
            },
            {
              "Key": "type",
              "StringValue": "RedshiftDataNode"
            },
            {
              "Key": "database",
              "RefValue": "RedshiftDatabaseId1"
            },
            {
              "Key": "onSuccess",
              "RefValue": "mySuccessAction"
            },
            {
              "Key": "onFail",
              "RefValue": "myFailureAction"
            }
          ]
        },
        {
          "Id":"mySuccessAction",
          "Name":"SuccessNotify",
          "Fields":[
            {
              "Key": "type",
              "StringValue":"SnsAlarm"
            },
            {
              "Key": "topicArn",
              "StringValue":{"Ref":"SNSTopic"}
             },
             {
               "Key": "subject",
               "StringValue":"Success"
              },
              {
                "Key": "message",
                "StringValue":"Success:The data has been successfully copied."
              }
           ]
        },
        {
          "Id":"myFailureAction",
          "Name":"FailureNotify",
          "Fields":[
            {
              "Key": "type",
              "StringValue":"SnsAlarm"
            },
            {
              "Key": "topicArn",
              "StringValue":{"Ref":"SNSTopic"}
             },
             {
               "Key": "subject",
               "StringValue":"Failure"
             },
             {
               "Key": "message",
               "StringValue":"Error: Failed to copy the data."
             } 
          ]
        },
        {
          "Id" : "Ec2ResourceId1",
          "Name" : "Ec2Resource01",
          "Fields": [
            {
              "Key": "schedule",
              "RefValue": "ScheduleId1"
            },
            {
              "Key": "securityGroupIds",
              "StringValue":{"Ref": "SecurityGroupID"}
            },
            {
              "Key": "subnetId",
              "StringValue":{"Ref": "SubnetID"}
            },
            {
              "Key": "logUri",
              "StringValue":{"Fn::Join" : [ "", [ "s3://",
                            { "Ref": "BucketName"},"/logs" ] ]}
            },
            {
              "Key": "type",
              "StringValue": "Ec2Resource"
            },
            {
              "Key": "terminateAfter",
              "StringValue": "5 hours"
            },
            {
              "Key": "resourceRole",
              "StringValue": "DataPipelineDefaultResourceRole"
            },
            {
              "Key": "role",
              "StringValue": "DataPipelineDefaultRole"
            }
          ]
        },
        {
          "Id" : "ScheduleId1",
          "Name" : "Schedule01",
          "Fields": [
            {
              "Key": "startAt",
              "StringValue": "FIRST_ACTIVATION_DATE_TIME"
            },
            {
              "Key": "occurrences",
              "StringValue": "3"
            },
            {
              "Key": "type",
              "StringValue": "Schedule"
            },
            {
              "Key": "period",
              "StringValue": "15 minutes"
            }
          ]
        },
        {
          "Id" : "S3DataNodeId1",
          "Name" : "S3DataNode01",
          "Fields": [
            {
              "Key": "schedule",
              "RefValue": "ScheduleId1"
            },
            {
              "Key": "filePath",
              "StringValue": {"Fn::Join" : [ "", [ "s3://", 
                             { "Ref": "BucketName"},"/wordcount","/output" ] ]}
            },
            {
              "Key": "type",
              "StringValue": "S3DataNode"
            }
          ]
        },
        {
          "Id" : "RedshiftCopyActivityId1",
          "Name" : "RedShiftCopyActivity01",
          "Fields": [
            {
              "Key": "input",
              "RefValue": "S3DataNodeId1"
            },
            {
              "Key": "schedule",
              "RefValue": "ScheduleId1"
            },
            {
              "Key": "type",
              "StringValue": "RedshiftCopyActivity"
            },
            {
              "Key": "insertMode",
              "StringValue": "KEEP_EXISTING"
            },
            {
              "Key": "runsOn",
              "RefValue": "Ec2ResourceId1"
            },
            {
              "Key": "output",
              "RefValue": "RedshiftDataNodeId1"
            }
          ]
        }
      ]
     }
    }
   },
   "Outputs": {
     "DataPipelineId": {
        "Value": {"Ref": "Datapipeline"},
        "Description": "The id of the created datapipeline"
      }
   }
}

After creating the stack, a datapipeline will be created which will execute the copy activity from S3 bucket to the redshift cluster by launching the required dependencies.
The copy activity will take place on an EC2 resource which was launched by data pipeline
The data in the redshift cluster can be verified by the “Loads” tab in its console. The status will appear as “completed” if the data has been successfully copied from S3 bucket to the redshift cluster.
Moreover, the SNS alarm has been activated by this template, so the success or failure of the activity will be notified by the email too. Also, the verification of the data in the cluster can be done using psql tool in your local system

Hope you found this useful! Happy Automating :)

Priyanka Sharma

Priyanka is Senior Cloud and DevOps Engineer. She can churn out CloudFormation templates at a moment's notice and play with Chef/Ansible. Dancing, music, badminton and word games are her hobbies

comments powered by Disqus