Python Data Persistence – Create Keyspace

Python Data Persistence – Create Keyspace

As mentioned above. Cassandra Query Language (CQL) is the primary tool for communication with the Cassandra database. Its syntax is strikingly similar to SQL. The first step is to create a keyspace.

A keyspace is a container of column families or tables. CQL provides CREATE KEYSPACE statement to do exactly the same – create a new keyspace. The statement defines its name and replication strategy.

Example

CREATE KEYSPACE name with replication {options}

The replication clause is mandatory and is characterized by class and replication_factor attributes. The ‘class’ decides the replication strategy to be used for the keyspace. Its value is. by default, SimpleStrategy indicating that data will be spread across the entire cluster. Another value of the class is NetworkTopologyStrategy. It is a production-ready strategy with the help of which replication factor can be set independently on each data center.

The replication_factor attribute defines the number of replicas per data center. Set its value to 3, which is considered optimum, so that the data availability is reasonably high.

The following statement creates ‘ My Key Space ’ with ‘ SimpleStrategy’ and a replication factor of 3.

cqlsh> create keyspace my keyspace with
. . . replication={'class': 'SimpleStrategy',
'replication_factor 1 : 3};

Note that, the name of keyspace is case-insensitive unless given in double quotes. CQL provides the use keyword to set a certain keyspace as current. (Similar to MySQL ‘use” statement isn’t it?). To display a list of keyspaces in the current cluster, there is DESCRIBE keyword.

cq1sh> describe keyspaces;
castest            system_auth my keyspace
system_traces
system_schema system           system_distributed
cq1sh> use my keyspace; 
cq1sh:mykeyspace>

Create New Table

As mentioned earlier, one or more column families or tables may be present in a keyspace. The CREATE TABLE command in CQL creates a new table in the current keyspace. Remember the same command was used in SQL?

The general syntax is, as follows:

Example

create table if not exists table_name 
(
col1_definition, 
col2_ definition,
. .
. .
)

Column definition contains its name and data type, optionally setting it as the primary key. The primary key can also be set after the list of columns has been defined. The ‘if not exists” clause is not mandatory but is recommended to avoid error if the table of the given name already exists.
The following statement creates the ‘Products’ table in mykeyspace.

cq1sh:mykeyspace> create table if not exists products
                           . . . (
                           . . . productID int PRIMARY KEY,
                           . . . name text,
                           . . . price int 
                           . . . ) ;

The following definition is also identical:

cq1sh:mykeyspace> create table products
                              . . . (
                              . . . productID int,
                              . . . name text,
                              . . . price int,
                              . . . primary key (productID) 
                              . . . ) ;

Partition Key

Partition key determines on which node will a certain row will be stored. If a table has a single primary key (as in the above definition), it is treated as a partition key as well. The hash value of this partition key is used to determine the node or replica on which a certain row is located. Cassandra stores row having the primary keys in a certain range on one node. For example, rows with a productID value between 1 to 100 are stored on Node A, between 2 to 200 on node B, and so on.

The primary key may comprise more than one column. In that case, the first column name acts as the partition key and subsequent columns are cluster keys. Let us change the definition of the Products table slightly as follows:

cq1sh:mykeyspace> create table products
                                    . . . (
                                    . . . productID int,
                                    . . . manufacturer text,
                                    . . . name text,
                                    . . . price int,
                                    . . . primary key(manufacturer,
productID)
                                   . . . ) ;

In this case, the ‘manufacturer’ column acts as the partition key and ‘productID’ as a cluster key. As a result, all products from the same manufacturer will reside on the same node. Hence a query to search for products from a certain manufacturer will return results faster.