Showing posts with label NoSQL-Cassandra. Show all posts
Showing posts with label NoSQL-Cassandra. Show all posts

Saturday, December 13, 2014

Cassandra - Installing Cassandra

1. First Download the Cassandra installable tarball (as I am going to install in Red Hat Linus machine) from the official Cassandra website to the location in your machine where you want to install it - http://cassandra.apache.org/download/

image

2. Cassandra is written in Java and it requires Java 7 or above is to be installed before we install Cassandra. Ensure in your machine java 7 or above is installed.  Java version 7 naming looks like 1.7.x.x
The command to check what version of java installed is simple, in the terminal type in :   java –version as shown below

image

3. Cassandra installation is very simple – untaring the tarball makes the cassandra installed (or all the executables and tools are available to get started with Cassandra ) and the installable is of around 22MB size only.

Once the tarball is downloaded in the machine where you want to install it, make a directory named – cassandra in your home directory and place the tarball in it.

To untar the tarball use the below command

$ tar -xvzf apache-cassandra-2.1.2-bin.tar.gz

[rvalusa@ol6-11gGG ~]$ cd cassandra/
[rvalusa@ol6-11gGG cassandra]$ ls
apache-cassandra-2.1.2-bin.tar.gz
[rvalusa@ol6-11gGG cassandra]$ tar -xvzf apache-cassandra-2.1.2-bin.tar.gz
apache-cassandra-2.1.2/bin/
apache-cassandra-2.1.2/conf/
apache-cassandra-2.1.2/conf/triggers/
apache-cassandra-2.1.2/interface/
apache-cassandra-2.1.2/javadoc/
apache-cassandra-2.1.2/javadoc/org/
apache-cassandra-2.1.2/javadoc/org/apache/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/auth/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/auth/class-use/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/cache/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/cache/class-use/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/cli/
apache-cassandra-2.1.2/javadoc/org/apache/cassandra/cli/class-use/
..........
..........
..........
..........

apache-cassandra-2.1.2/tools/bin/cassandra-stress.bat
apache-cassandra-2.1.2/tools/bin/cassandra-stressd
apache-cassandra-2.1.2/tools/bin/cassandra.in.bat
apache-cassandra-2.1.2/tools/bin/cassandra.in.sh
apache-cassandra-2.1.2/tools/bin/json2sstable
apache-cassandra-2.1.2/tools/bin/json2sstable.bat
apache-cassandra-2.1.2/tools/bin/sstable2json
apache-cassandra-2.1.2/tools/bin/sstable2json.bat
apache-cassandra-2.1.2/tools/bin/sstablelevelreset
apache-cassandra-2.1.2/tools/bin/sstablemetadata
apache-cassandra-2.1.2/tools/bin/sstablemetadata.bat
apache-cassandra-2.1.2/tools/bin/sstablerepairedset
apache-cassandra-2.1.2/tools/bin/sstablesplit
apache-cassandra-2.1.2/tools/bin/sstablesplit.bat
apache-cassandra-2.1.2/tools/bin/token-generator

[rvalusa@ol6-11gGG cassandra]$ ls
apache-cassandra-2.1.2  apache-cassandra-2.1.2-bin.tar.gz
[rvalusa@ol6-11gGG cassandra]$ cd apache-cassandra-2.1.2
[rvalusa@ol6-11gGG apache-cassandra-2.1.2]$ ls
bin          conf       javadoc  LICENSE.txt  NOTICE.txt  tools
CHANGES.txt  interface  lib      NEWS.txt     pylib

[rvalusa@ol6-11gGG apache-cassandra-2.1.2]$


image

Viewing The Main Configuration File:
Cassandra has a main configuration named – cassandra.yaml in conf directory.
Details on YAML are in here - http://en.wikipedia.org/wiki/YAML.  It is a recursive acronym – Yaml Ain’t Markup Language

Below are some of the parameter properties set in the cassandra.yaml config file

[rvalusa@ol6-11gGG conf]$ pwd
/home/rvalusa/cassandra/apache-cassandra-2.1.2/conf
[rvalusa@ol6-11gGG conf]$ cd cassandra.yaml

cluster_name: 'Test Cluster'
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
endpoint_snitch: SimpleSnitch

Providing Permissions to Cassandra directories:
[rvalusa@ol6-11gGG lib]$ su -
Password:
[root@ol6-11gGG ~]# mkdir /var/lib/cassandra <== In here cassandra data files will be residing
[root@ol6-11gGG ~]# mkdir /var/log/cassandra <== In here cassandra system logs will be residing
[root@ol6-11gGG ~]# chown -R rvalusa /var/lib/cassandra
[root@ol6-11gGG ~]# chown -R rvausa /var/log/cassandra

Cassandra - Getting Started With The Architecture

Understanding That Cassandra Is A Distributed Database
Cassandra is a distributed database, all nodes in cluster has same functionality when compared with each other. There is no master or slave nodes thus eliminating the single point of failure. Data is replicated across the nodes to high availability.

In Cassandra, cluster can easily be spread across more than one data center allowing for high availability even if one data center completely goes down.

image

Cassandra Documentation is available at - http://www.datastax.com/docs

Snitch: Snitch is how the nodes in a cluster know about the topology of the cluster
Ways to Define Snitch:
- Dynamic Snitching: Monitors the performance of reads from the various replicas and chooses the best replica based on this history
- SimpleSnitch: For single-data center deployments only.
- RackInferringSnitch: Determines the location of nodes by rack and data center corresponding to the IP addresses.
- PropertyFileSnitch: Determines the location of nodes by rack and data center.
- GossipingPropertyFileSnitch: Automatically updates all nodes using gossip when adding new nodes.
- EC2Snitch: Use with Amazon EC2 in a single region.
- EC2MultiRegionSnitch: Use with Amazon EC2 in multiple regions.
- GoogleCloudSnitch
- CloudstackSnitch


Gossip: Gossip is how the nodes in a cluster communicate to each other.
Every ONE second, each node communicates with up to three other nodes, exchanging information about itself and all the other nodes that it has information about.
Gossip is the internal communication method for nodes in a cluster to talk to each other.

For external communication, such as from an application to a Cassandra database, CQL(Cassandra Query Language) or Thrift are used.

How data distribution is done across the Nodes in Cassandra ?
Data Distribution is done through consistent hashing algorithm, to strive for even distribution of data across the nodes in a cluster.
Rather than all of the rows of a table existing on only on node, the rows are distributed across the nodes in the cluster, in an attempt to evenly spread out the load of the table’s data.
For example, notice the following rows of data, to be inserted in a table within a Cassandra database. (The data will be spread across the nodes based on the hash algorithm used which is illustrated below)

image

To distribute the rows across the nodes, a Partitioner is used.
The Partitioner uses an algorithm to determine which node a given row of data will go to
The default partitioner in Cassandra is Murmur3

Murmur3 takes the values in the first column* (Depending upon the table definition more than one column can also be used by Partitioner Murmur3) of the row to generate a unique number between  -263 and 263.

So based on the hashing algorithm the above table Home_ID column row data turn into as below
H01033638 –>  -7456322128261119046
H01545551 –>  -2487391024765843411
H00999943 –>  6394005945182357732


Similarly each node in a cluster has an end point value assigned to it manually which decides which row data will get distributed to which node. Each node is responsible for the token values between its endpoint and the endpoint of the previous node.

image
Therefore, the –7456322128261119046 data is owned by the –4611686018427387904 node

image

Node token ranges are calculated using the below formula or Murmur3 calculator - http://www.geroba.com/cassandra/cassandra-token-calculator/

image

Replication: A Replication Factor must be specified whenever a database is defined.
The Replication Factor specifies how many instances  of the data there will be within a given database.
Although 1 can be specified, it is common to specify 2, 3, or more, so that if a node goes down, there is at least one other replica of the data, so that the data is not lost with the down node.
 

Virtual Nodes: Virtual nodes are an alternative way to assign token ranges to nodes, and are now the default in Cassandra.
With virtual nodes, instead of a node being responsible for just one token range, it is instead responsible for many small token ranges (by default, 256 of them)
Virtual nodes allow for assigning a high number of ranges to a powerful computer (e.g: 512) and a lower number of ranges (e.g: 128) to a less powerful computer.
Virtual nodes (aka vnodes) were created to make it easier to add new nodes to a cluster while keeping the cluster balanced.
When a new node is added, it receives many small token range slices from the existing nodes, to maintain a balanced cluster.
With the old way, of static token ranges, it was common to double the number of nodes, so that the end-point for the new nodes could be a value half of the value of the existing end-points.

Cassandra–Introduction to Cassandra

Understanding What Cassandra Is:
Cassandra is a

- Open Source
- A NoSQL (Not Only SQL) database Technology
- A distributed database technology
- A big data technology which provides massive scalability
- Commonly used to create a database that is spread across nodes in more than one data center, for high availability
- Based on Amazon Dynamo and Google Big Table
- Fault Tolerant
- Highly Performant
- Decentralized – No Single point of failure
- Durable – Data is not lost even one of the data center goes down
- Elastic – Read/Write throughput increases linearly as new nodes are added to cluster

What Cassandra Is Being Used For:

Use cases of Cassandra are listed in – planetcassandra.org/apache-cassandra-use-cases/

Companies running their applications on Apache Cassandra have realized benefits which have directly improved their business. Cassandra is capable of handling all of the big data challenges that might arise: massive scalability, an always on architecture, high performance, strong security, and ease of management, to name a few

- Product Catalog/Playlist – Coursera, Comcast, Netflix, Hulu, Sky, Soundcloud etc.,
- Recommendation/Personalization - Bazaarvoice, Outbrain, eBay etc.,
- Fraud Detection - Barracuda Networks, F-Secure etc.,
- Messaging -  Accenture, CallFire, eBuddy, The New York Times etc.,
- IOT/Sensor Data - NASA, AppDynamics, Lucid, Aeris etc.,