Quick Intro to Cassandra vs MongoDB with python

Cassandra Nosql

    Cassandra Conclusion:

  • “One way that Cassandra deviates from Mongo is that it offers much more control on how it’s data is laid out. Consider a scenario where we are interested in laying out large quantities of data that are related, like a friend’s list. Storing this in MongoDB can be a bit tricky – it’s not great at storing lists that are continuously growing. If you don’t store the friends in a single document, you end up risking pulling data from several different locations on disk (or on different servers) which can slow down your entire application. Under heavy load this will impact other queries being performed concurrently.”[1]
  • If you have a project that is mature, it requires a lot of consecutive data that you will want to read later without jumping around to different disks. Cassandra looks like a strong candidate for:
    1. Show last 50 items for “TheMostIntrestingPersonInTheWorld”: item1,item2,..item3000..
    2. Show me last comments on “TheLucasMovie”: comment1,comment2,comment3,
    3. Show water level in Louisiana RiverIoT: level at 8am,level at 8:01am,level at 8:02am, x 100-1000 locations
  • Great if you have data structure already setup, and it fits above model. [2][3]

MongoDB

    MongoDB Conclusion:

  • No structure. import mongodb, mydb = db.myawsomedatabase, mydb.insert(start adding data). Done.
  • You have a project and you are not sure how NoSQL will handle it but you want to try it. [4]
  • You have a working process but its grown to a point where traditional RDMS can’t handle the IO load. [5]
  • You don’t have time to create table structures just now, you just want to get going, and see what happens.
  • You want to find documentation with python fast, and benefit from large community examples.


Cassandra Python
Cassandra Code in Python; Details:
Installation:

#Add cassandra repo to /etc/apt/sources.list
deb http://www.apache.org/dist/cassandra/debian 37x main
sudo apt-get update
update-alternatives --config java  #pick openjdk 8
sudo apt-get install cassandra
#status
nodetool status
nodetool info
nodetool tpstats
#python
virtualenv -p python3 env_py3
source env_py3/bin/activate
pip install cassandra-driver

Python:

from cassandra.cluster import Cluster
cluster=Cluster()
session = cluster.connect()

#nodetool status
#nodetool info
#nodetool tpstats


#https://github.com/dkoepke/cassandra-python-driver/blob/master/example.py
session.execute("CREATE KEYSPACE vindata WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' }")
session.execute("use vindata")
#http://www.slideshare.net/ebenhewitt/cassandra-datamodel-4985524 slide 23
session.execute("""
CREATE TABLE emissions (
vin text,
make text,
year text,
zip_code_of_station text,
co2 text,
year_month_key int,
PRIMARY KEY (vin)
)
""")

#https://www.youtube.com/watch?v=97VBdgIgcCU
#Load mydata

import glob
print(glob.glob("./data/*.dat"))
session.execute("use vindata")


for datafile in glob.glob("./data/*.dat"):
    f=open(datafile, 'r')
    data={}
    for row in f.readlines():
        data={}
        data['vin']=row[:20].strip()
        data['make']=row[20:24].strip()
        data['year']=row[24:28].strip()
        data['zip_code_of_station']=row[42:47].strip()
        data['co2']=row[47:48].strip()
        ymk='20'+datafile[-12:-8]
        data['year_month_key']=ymk
        #print(data)
        session.execute(
        """
        INSERT INTO emissions (vin, make, year,zip_code_of_station,co2,year_month_key)
        VALUES (%s,%s,%s,%s,%s,%s)
        """,
        (data['vin'],data['make'],data['year'],data['zip_code_of_station'],data['co2'],data['year_month_key'])
    )
    f.close()

future=session.execute_async("SELECT * FROM emissions where vin='1B4GP33R9TB205257'")
rows = future.result()
for row in rows:
    print(row)

MongoDB and Python
MongoDB Code in Python; Details:

Installation

sudo aptitude install mongodb
/etc/init.d/mongodb start
#python
virtualenv -p python3 env_py3
source env_py3/bin/activate
pip install pymongo

Python

#http://api.mongodb.com/python/current/tutorial.html
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
#create database
db = client.vindata
#create collection/table
emissions = db.emissions

#Load data from mydata
import glob
print(glob.glob("./data/*.dat"))
for datafile in glob.glob("./data/*.dat"):
    f=open(datafile, 'r')
    data={}
    for row in f.readlines():
        data={}
        data['vin']=row[:20].strip()
        data['make']=row[20:24].strip()
        data['year']=row[24:28].strip()
        data['zip_code_of_station']=row[42:47].strip()
        data['co2']=row[47:48].strip()
        #data['year_month_key']=201608
        ymk='20'+datafile[-12:-8]
        data['year_month_key']=ymk
        #print(data)
        emissions.insert(data)
    f.close()

emissions.count()
emissions.find_one()
emissions.find_one({"vin":"1B4GP33R9TB205257"})
#http://altons.github.io/python/2013/01/21/gentle-introduction-to-mongodb-using-pymongo/
#https://www.youtube.com/watch?v=f7l8PTjQ160&index=4&list=PLGOsbT2r-igmFK9IKEGAnBaklqtuW7l8W
#https://www.youtube.com/watch?v=FVyIxdxsyok

#-------BONUS--------------
import pandas
cursor=emissions.find({"year_month_key":"201608"})
result=pandas.DataFrame(list(cursor))
result.describe()
result.columns
#http://lucasmanual.com/mywiki/Pandas
#later http://alexgaudio.com/2012/07/07/monarymongopandas.html

Sources:
1. https://academy.datastax.com/mongodb-to-cassandra-migration
2. http://www.slideshare.net/nkorla1share/cass-summit-3?qid=f85a27f7-a560-48bb-9d64-6eaa91c39f24&v=&b=&from_search=8
3. https://www.youtube.com/watch?v=tg6eIht-00M
4. https://www.mongodb.com/customers/city-of-chicago
5. https://www.youtube.com/watch?v=FVyIxdxsyok