Home / How To / Configure AWS Redshift to store cloud data

Configure AWS Redshift to store cloud data

In this tutorial I will explain and guide how to set up AWS Redshift to use Cloud Data Warehousing. Redshift is a fully managed petabyte data warehousing service introduced to the cloud by Amazon Web Services. It works by combining one or more collections of computer resources called nodes, organized into a group, a cluster. Each cluster runs a Redshift engine and can contain one or more databases. The architecture can in principle be developed as follows:

AWS Redshift

1. How does it work?

Basically, Redshift is based on PostgreSQL as its core engine, so most SQL applications can work with Redshift. Redshift can also be integrated with a wide range of applications including BI, analysis and ETL tools (Extract, Transform, Load) that enable analysts or engineers to work with the data it contains.

When a user sets up an Amazon Redshift data warehouse, they have a core topology for operations called a cluster. A Redshift cluster consists of 1 or more calculation nodes. If the user chooses to use more than one calculation node, Redshift automatically starts a master node. This lead node is set to receive requests and commands from the client’s execution page and is not billed by AWS.

Client applications only communicate with the lead node. The calculation nodes under the leader node are transparent to the user. When clients run a query, the leading node analyzes the query and creates an optimal execution plan for running on the calculation nodes, taking into account the amount of data stored on each node.

In this tutorial I will show you how to configure and configure Redhift for your own use. In this example, I create an account and start with the free level package.

2. Configuration phase

2.1 Prerequisite

Before we start configuring an Amazon Redshift cluster, there are certain prerequisites that must be completed.

Register first for AWS and then once, go to the IAM service to create a role that we can use for Redshift use. You can follow the screenshot as below:

AWS Management Console

Create a role


Amazon S3 access

Add tags

Review role settings

When you are done, you should get a screenshot as below that the role has been successfully created.

Redshift user created

2.2 Setting the Redshift Configuration

When the conditions are clear, we can continue to create our own Redshift cluster. Search for Redshift functions in the search box and continue from there. The following is a screenshot example:

Find services

click Create clusters and continue with the variables needed, noting that it is on Cluster condition page we include our IAM role that we have created before.

Create clusters

Cluster Configuration

Database settings and names

redshift Cluster was created successfullyAdvertisement

Cluster permissions and overview

When done, you should end up in the redshift dashboard as above. For this tutorial, we disable the network security layer by changing the security group. To do so, go to the bottom of the dashboard and add the Redshift port on the Incoming tab. The following is an example:

Disable network security

When everything is ready, you should see that the new cluster you have created is now available for use.

Cluster is ready to use

3. Test phase

Let us now try to access our data warehouse. Click to test it EDITOR in the left pane, include the necessary variables click on Connect to the database

Test Redshift configurationAdvertisement

You should be taken to an editing page. Let’s start by creating our own test schedule. Create a schedule as shown below and then run it.

Create Redshift Database Schema

Good, let’s now test on the client’s local page. To do so, you must have either a JDBC or ODBC connection from the Redshift page. To get that information, click on Config button on the left side of the instrument panel.

Select your connection type favorite, then download the required libraries and copy the URL shown in the example below:

Select connection type

Then open one of your SQL client tools and enter the required connection variables. In our example here, we use the name of the SQL client tool DBeaver which can come from here

DBeaver Redshift client

Your connection should succeed as expected. In any case, if you stumble upon an authentication issue, check with your configuration under the AWS Security Group for more information.

Let’s then create a set of tables under our previously newly created schedule. Below is an example of table creation that we will perform in our cluster:

 SET search_path = imdb; 

create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);

create table venue(
venueid smallint not null distkey sortkey,
venuename varchar(100),
venuecity varchar(30),
venuestate char(2),
venueseats integer);

create table category(
catid smallint not null distkey sortkey,
catgroup varchar(10),
catname varchar(10),
catdesc varchar(50));

create table date(
dateid smallint not null distkey sortkey,
caldate date not null,
day character(3) not null,
week smallint not null,
month character(5) not null,
qtr character(5) not null,
year smallint not null,
holiday boolean default('N'));

create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);

create table listing(
listid integer not null distkey,
sellerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
numtickets smallint not null,
priceperticket decimal(8,2),
totalprice decimal(8,2),
listtime timestamp);

create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

Expected results are shown as below: –

Create table schedule

Table diagram part 2

Let us then try to upload sample data to our data warehouse. For this example, I have uploaded a sample data in my own S3 bucket and then use the script below to copy data from the S3 file to Redshift.

Data transfer of test data

copy sales from 's3://shahril-redshift01-abcde/sales_tab.txt'
iam_role 'arn:aws:iam::325581293405:role/shahril-redshift-s3-ro-role'
delimiter 't' timeformat 'MM/DD/YYYY HH:MI:SS' region 'eu-west-1';

copy dates from 's3://shahril-redshift01-abcde/date2008_pipe.txt'
iam_role 'arn:aws:iam::325581293405:role/shahril-redshift-s3-ro-role'
delimiter '|' region 'eu-west-1';

If you encounter a problem in any way during the load, you can ask from the redshift dictionary table with the name stl_load_errors as below to get a hint of the problem.

 select * from stl_load_errors ; 

Test question


Finally, when everything is ready, you should be able to extract and manipulate data with any SQL function. Below are some examples of scripting questions I have used for our example.

-- Get definition for the sales table.
FROM pg_table_def
WHERE tablename="sales";

-- Find total sales on each day
SELECT b.caldate days, sum(a.qtysold) FROM sales a, dates b
WHERE a.dateid = b.dateid
group by b.caldate ;

-- Find top 10 buyers by quantity.
SELECT firstname, lastname, total_quantity
FROM (SELECT buyerid, sum(qtysold) total_quantity
FROM sales
GROUP BY buyerid
ORDER BY total_quantity desc limit 10) Q, users
WHERE Q.buyerid = userid
ORDER BY Q.total_quantity desc;

-- Find events in the 99.9 percentile in terms of all time gross sales.
SELECT eventname, total_price
FROM (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile
FROM (SELECT eventid, sum(pricepaid) total_price
FROM sales
GROUP BY eventid)) Q, event E
WHERE Q.eventid = E.eventid
AND percentile = 1
ORDER BY total_price desc;

Test result

Thumbs up! now we have successfully created our own Redshift Cluster for data storage use. Next, we look at combining existing data in the Redshift Cluster with any flat file that uses the Redshift Spectrum.

Source link