• Skip to primary navigation
  • Skip to main content
  • Skip to footer

In Out Code

Your comprehensive guide to optimized code

  • Data Structures
  • Python
    • Control Flow
    • HackerRank
    • Input and Output
    • Modules
  • AWS

Amazon Redshift Tutorial (AWS)

You are here: Home / AWS / Amazon Redshift Tutorial (AWS)

October 5, 2019 by Daniel Andrews

Amazon Redshift Data Warehouse

  • Step 1: Create an AWS Account
  • Step 2: Setup Billing Alarms
  • Step 3: Create a Redshift Cluster
  • Step 4: Update VPC for External Access
  • Step 5: Create an IAM User
  • Step 6: Configuring DBeaver
  • Step 7: Load Data into S3
  • Step 8: Loading Data into Redshift

Step 1: Create an AWS Account

  1. Go to aws.amazon.com and create an account.
    Create AWS Account
  2. Follow through with the payment information and identity verification.

Step 2: Setup Billing Alarms

  1. Navigate to My Billing Dashboard from the top menu.
    AWS Billing and Cost Management
  2. Click on AWS Budgets.
  3. Click on Create a budget.
    AWS Budgets - Create Budget
  4. Choose your budget type. For this example, we’ll select Cost budget.
    AWS Budgets - Budget Type
  5. Fill in the details at the Set your budget step. This is where you can specify the budget period (monthly or annually), and the budget amount.
    AWS Budgets - Set Budget
  6. When you’ve set your budget, scroll to the bottom and click Configure Alerts.
  7. At the Configure alerts screen, you can choose to send alerts based on actual costs or forecasted costs.
  8. We’ll set ours to Actual Costs, with an alert threshold of 80%.
    AWS Budgets - Configure Alerts
  9. Type in the email address(es) you would like alerts to be sent to.
  10. Scroll to the bottom and click Confirm budget.
  11. Review the details on the screen, then click Create.
  12. You should now see a new entry in the AWS Budgets screen.
    AWS Budgets - Budgets Summary
  13. We’ve only created one alarm at $10, but you’ll probably want to create others to alert you when you go over different limits. e.g. One at $10, one at $50, one at $100, depending on your estimated usage and budget.

Step 3: Create a Redshift Cluster

  1. Drop down the Services menu from the top of the page, and open the Amazon Redshift service
  2. Click on Quick launch cluster
    Redshift Service Dashboard
  3. Take a note of the region you’re currently set to from the top right corner. We’ll use US East (Ohio).
  4. Increase the number of nodes to 4
  5. Provide a value for the Cluster identifier field (following conventions*)
  6. Database port can be left as the default of 5439
    Quick Launch Cluster
  7. Provide a user name
  8. Provide a password, then confirm your password
  9. Click launch cluster (charges will begin once you push this button)
    Redshift Cluster Creation Charges

*lowercase string, words separated by hyphens.


Step 4: Update VPC for External Access

  1. From the AWS Management Console, search for the VPC service
  2. From the VPC dashboard, select Security Groups on the left
    Virtual Private Cloud (VPC) Security Groups
  3. To grant the DBeaver application access, click Edit rules
  4. Click Add Rule and leave the type as Custom TCP Rule
  5. Set the port to be 5439
    VPC Inbound Rule Edit
  6. Set the source value to be your IPV4 address, followed by /32
  7. Save the rule

Step 5: Create an IAM User

  1. From the AWS Management Console, search for the IAM service
  2. Select Users from the menu on the left, then click Add user
  3. Provide a user name
  4. For the access type, select Programmatic access
    IAM Service User Creation
  5. Click Next: Permissions
  6. Click on Attach existing policies directly
  7. Search for redshift. Click on the AmazonRedshiftFullAccess policy name
    IAM User Permissions
  8. Click Next: Tags
  9. Click Next: Review
  10. Review your selections, then click Create User
    IAM User Details Review
  11. Below the Success message, click on Download.csv
  12. This downloads your credentials, including access key ID and secret access key. Store this somewhere safe
  13. Return back to your Amazon redshift service dashboard
  14. Your cluster should now be available
    Amazon Redshift Cluster Available
  15. Opening your cluster reveals three dropdown menus; Cluster, Database, and Backup
  16. Clusters cannot be stopped and resumed. When you’ve finished your work, or want to stop being charged, you’ll need to:
    1. Create a snapshot of the cluster
    2. Delete the cluster
    3. Create a new cluster using the snapshot
    Delete Cluster With Snapshot
    Cluster Status Deleting

Command to generate a snapshot, then delete a cluster:
aws redshift delete-cluster –cluster-identifier –final-cluster-snapshot-identifier

Command to restore a cluster from the snapshot:
aws redshift restore-from-cluster-snapshot –cluster-identifier –snapshot-identifier

Snapshots incur S3 data storage charges, but it’s considerably cheaper than keeping the cluster up.


Step 6: Configuring DBeaver

  1. Download DBeaver from their website
  2. Install and open DBeaver
  3. Click on the Databases menu and select New Database Connection
  4. Click on All then select Redshift
    DBeaver - Connect to Database
  5. Click Next
  6. Click Edit Driver Settings
  7. Click Download/Update to download the latest driver
    Download AWS Redshift Driver
  8. You should now see the .jar file under the Libraries tab
    DBeaver Edit Redshift Driver
  9. Click OK
  10. On the General tab of the Connect to database window:
    Host: section between // and :
    Port: 5439
    Database:
    User:
    Password: DBeaver Redshift Driver Connection
  11. The above information should be pulled from the configuration page of your cluster in AWS
  12. Click Test connection to ensure it’s successful
  13. Click Finish
  14. Run select version(); from a query window to check the connection
  15. This will show the version of Redshift you’re running, together with the Postgres version
    DBeaver Version Check Script

Step 7: Load Data into S3

  1. From AWS, search for the S3 service
  2. Click on Create bucket
  3. Provide a globally unique name for your bucket
  4. Set the region to be the same as your Redshift cluster
    Create Amazon S3 Bucket
  5. In Step 3: Creating a Redshift Cluster, we used US East (Ohio)
  6. Click Next
  7. For this guide, we won’t enable versioning or logging. Click Next
    Amazon S3 Bucket Versioning Logging
  8. At the Set permissions screen, select Block all public access
    Amazon S3 Bucket Permissions
  9. Click Next
  10. Click Create bucket
    Amazon S3 Create Bucket
  11. At the S3 buckets screen, click on the bucket you just created
    Amazon S3 Buckets
  12. Click Create folder
  13. Give your folder a name, then click Save
    Amazon S3 Folder
  14. Download the AWS Sample Files
  15. This will trigger the download of files for creating the database schema
  16. From your Amazon S3 bucket folder, click Upload
  17. Drag the files from our unzipped folder into the Upload window
    Amazon S3 File Upload
  18. Click Upload
  19. Each object is uploaded with a storage class of STANDARD
  20. You can find out more about what each storage class means here

Step 8: Loading Data into Redshift

Schema creation

  1. Run the following SQL from DBeaver to return a list of table names in Redshift:
    select distinct(tablename) from pg_table_def where schemaname = 'public';
  2. Run the following SQL scripts to create our schema:

    CREATE TABLE part 
    (
      p_partkey     INTEGER NOT NULL,
      p_name        VARCHAR(22) NOT NULL,
      p_mfgr        VARCHAR(6),
      p_category    VARCHAR(7) NOT NULL,
      p_brand1      VARCHAR(9) NOT NULL,
      p_color       VARCHAR(11) NOT NULL,
      p_type        VARCHAR(25) NOT NULL,
      p_size        INTEGER NOT NULL,
      p_container   VARCHAR(10) NOT NULL
    );

    CREATE TABLE supplier 
    (
      s_suppkey   INTEGER NOT NULL,
      s_name      VARCHAR(25) NOT NULL,
      s_address   VARCHAR(25) NOT NULL,
      s_city      VARCHAR(10) NOT NULL,
      s_nation    VARCHAR(15) NOT NULL,
      s_region    VARCHAR(12) NOT NULL,
      s_phone     VARCHAR(15) NOT NULL
    );

    CREATE TABLE customer 
    (
      c_custkey      INTEGER NOT NULL,
      c_name         VARCHAR(25) NOT NULL,
      c_address      VARCHAR(25) NOT NULL,
      c_city         VARCHAR(10) NOT NULL,
      c_nation       VARCHAR(15) NOT NULL,
      c_region       VARCHAR(12) NOT NULL,
      c_phone        VARCHAR(15) NOT NULL,
      c_mktsegment   VARCHAR(10) NOT NULL
    );

    CREATE TABLE dwdate 
    (
      d_datekey            INTEGER NOT NULL,
      d_date               VARCHAR(19) NOT NULL,
      d_dayofweek          VARCHAR(10) NOT NULL,
      d_month              VARCHAR(10) NOT NULL,
      d_year               INTEGER NOT NULL,
      d_yearmonthnum       INTEGER NOT NULL,
      d_yearmonth          VARCHAR(8) NOT NULL,
      d_daynuminweek       INTEGER NOT NULL,
      d_daynuminmonth      INTEGER NOT NULL,
      d_daynuminyear       INTEGER NOT NULL,
      d_monthnuminyear     INTEGER NOT NULL,
      d_weeknuminyear      INTEGER NOT NULL,
      d_sellingseason      VARCHAR(13) NOT NULL,
      d_lastdayinweekfl    VARCHAR(1) NOT NULL,
      d_lastdayinmonthfl   VARCHAR(1) NOT NULL,
      d_holidayfl          VARCHAR(1) NOT NULL,
      d_weekdayfl          VARCHAR(1) NOT NULL
    );

    CREATE TABLE lineorder 
    (
      lo_orderkey          INTEGER NOT NULL,
      lo_linenumber        INTEGER NOT NULL,
      lo_custkey           INTEGER NOT NULL,
      lo_partkey           INTEGER NOT NULL,
      lo_suppkey           INTEGER NOT NULL,
      lo_orderdate         INTEGER NOT NULL,
      lo_orderpriority     VARCHAR(15) NOT NULL,
      lo_shippriority      VARCHAR(1) NOT NULL,
      lo_quantity          INTEGER NOT NULL,
      lo_extendedprice     INTEGER NOT NULL,
      lo_ordertotalprice   INTEGER NOT NULL,
      lo_discount          INTEGER NOT NULL,
      lo_revenue           INTEGER NOT NULL,
      lo_supplycost        INTEGER NOT NULL,
      lo_tax               INTEGER NOT NULL,
      lo_commitdate        INTEGER NOT NULL,
      lo_shipmode          VARCHAR(10) NOT NULL
    );

  3. Now run the table check SQL again, and we should see some results:
    select distinct(tablename) from pg_table_def where schemaname = 'public';
  4. From the AWS IAM service, click on Users and open the user you created
  5. You should already have the AmazonRedshiftFullAccess policy
  6. Click Add permissions, then Attach existing policies directly
  7. Select the AmazonS3FullAccess policy, then click Next: Review
  8. Click Add permissions
  9. We can now load data from S3 into our Redshift schema
    1. Data Load

      1. Open the credentials.csv file downloaded earlier
      2. In the commands below, replace [ACCESS-KEY-ID] with your access key ID
      3. Replace [SECRET-ACCESS-KEY] with your AWS secret key
      4. Load dwdate:
        copy dwdate from 's3://awssampledbuswest2/ssbgz/dwdate' 
        credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
        gzip compupdate off region 'us-west-2';
      5. Load customer:
        copy customer from 's3://awssampledbuswest2/ssbgz/customer' 
        credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
        gzip compupdate off region 'us-west-2';
      6. Load lineorder:
        copy lineorder from 's3://awssampledbuswest2/ssbgz/lineorder' 
        credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
        gzip compupdate off region 'us-west-2';
      7. Load part:
        copy part from 's3://awssampledbuswest2/ssbgz/part' 
        credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
        gzip compupdate off region 'us-west-2';
      8. Load supplier:
        copy supplier from 's3://awssampledbuswest2/ssbgz/supplier' 
        credentials 'aws_access_key_id=[ACCESS-KEY-ID];aws_secret_access_key=[SECRET-ACCESS-KEY]' 
        gzip compupdate off region 'us-west-2';
      9. You can check the tables are now populated using the following SQL:
        select count(*) from LINEORDER;
        select count(*) from PART;
        select count(*) from  CUSTOMER;
        select count(*) from  SUPPLIER;
        select count(*) from  DWDATE;

        This should return the following results:

        TableRows
        LINEORDER600,037,902
        PART1,400,000
        CUSTOMER3,000,000
        SUPPLIER1,000,000
        DWDATE2,556
      10. To confirm the load happened, navigate to the Loads tab of your Redshift cluster
      11. If the load fails, you can find out why it failed by running the SQL below:
        select query, substring(filename,22,25) as filename,line_number as line, 
        substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,
        substring(raw_field_value,0,15) as field_text, 
        substring(err_reason,0,45) as reason
        from stl_load_errors 
        order by query desc
        limit 10;
      12. If you encounter any issues during the data loads, the best place to look for a solution is the official AWS documentation, which can be found here.

      Other useful SQL scripts:

      -- View the last commit
      select query, trim(filename) as file, curtime as updated
      from stl_load_commits order by updated desc;

      -- Check data distribution across tables. Each slice should have approx same number of records
      select slice, col, num_values, minvalue, maxvalue
      from svv_diskusage
      where name='customer' and col=0
      order by slice,col;

      -- Check for disk spills
      select query, step, rows, workmem, label, is_diskbased
      from svl_query_summary
      where query = [YOUR-QUERY-ID]
      order by workmem desc;

      -- Check distkey, sortkey, and encoding for a given table
      select "column", type, encoding, distkey, sortkey, "notnull" 
      from pg_table_def
      where tablename = 'lineorder';

Category iconAWS Tag iconAmazon Redshift

About Daniel Andrews

Passionate about all things data and cloud. Specializing in Python, AWS and DevOps, with a Masters degree in Data Science from City University, London, and a BSc in Computer Science.

Footer

Recent Posts

  • How to Setup Neo4j on AWS ECS (EC2)
  • How to Setup Neo4j on AWS EC2
  • How to List AWS S3 Bucket Names and Prefixes
  • Amazon Redshift Tutorial (AWS)
  • Big O: How to Calculate Time and Space Complexity

.