Describe DataHub here.



Code Repository:

6 Rules for Data Privacy

  1. Sensitive, and possibly inaccurate, information may not be used against people in financial, political, employment, and health-care settings.
  2. All information should not forcing anybody to hide or protect them self against improper information use that significantly limits persons ability to exercise his/her right to freedom of association.
  3. Implement a basic form of information accountability by tracking identifying information that identifies a person or corporation and could be used to held that person/corporation accountable for the compliance.
  4. There should be no restriction on use of data unless specified by laws and these privacy rules.
  5. Privacy is protected not by limiting the collection of data, but by placing strict rules on how the data may be used. Data that can be used in financial, political, employment, and health-care settings cannot be used for marketing and other profiling. Strict penalties should be imposed by for the breach of these use limitations. Actions that involve financial, political, employment, and health-care settings decision must be justified with reference to the specific data on which the decision was based. If the person/corporation discovers that the data is inaccurate, he or she may demand that it be corrected. Stiff financial penalties should be imposed against the agency that does not make the appropriate corrections.
  6. Achieve greater information accountability only by making better use of the information that is collected, retaining the data that is necessary to hold data users responsible for policy compliance. Build the system that encourages compliance, and maximizes the possibility of accountability of violations. Technology should supplant the rules because users are aware of what they are and because they know there will be consequences, after the fact.

7 rules to get Meaning

Engineering Part

  1. Acquire
  2. Parse
  3. Filter
  4. Mine

Design Part

Install DataHub

virtualenv --no-site-packages datahubENV
New python executable in datahubENV/bin/python
Installing setuptools............done.

source datahubENV/bin/activate 

tar -xzvf datahub-0.7.tar.gz

*Install it

cd datahub-0.7/
python install

paster create --list-templates

Source Install

virtalenv --no-site-packages BASELINE
source BASELINE/bin/activate

Install Bazaar if its not already installed on your system:

easy_install bzr

Branch out the code. This will pull all the revision history. If you want just the recent one use checkout:

bzr branch datahub_code

Install it::

cd datahub_code/trunk
python develop

Create DataHub based project

paster create --list-templates
paster create -t datahub

paster create -t datahub
Selected and implied templates:
  PasteScript#basic_package  A basic setuptools-enabled package
  datahub#datahub            DataHub is a tool to help you datamine(crawl, parse, and load) any data.

Enter project name: myproject
  egg:      myproject
  package:  myproject
  project:  myproject
Enter version (Version (like 0.1)) ['']: 
Enter description (One-line description of the package) ['']: my project
Enter long_description (Multi-line description (in reST)) ['']: this is a long description
Enter keywords (Space-separated keywords/tags) ['']: datahub dataprocess
Enter author (Author name) ['']: myname
Enter author_email (Author email) ['']: 
Enter url (URL of homepage) ['']: 
Enter license_name (License name) ['']: 
Enter zip_safe (True/False: if the package can be distributed as a .zip file) [False]: 
Creating template basic_package
Creating directory ./myproject
  Recursing into +package+
    Creating ./myproject/myproject/
    Copying to ./myproject/myproject/
  Copying setup.cfg to ./myproject/setup.cfg
  Copying setup.py_tmpl to ./myproject/
Creating template datahub
  Recursing into +package+
    Copying README.txt_tmpl to ./myproject/myproject/README.txt
    Recursing into crawl
      Creating ./myproject/myproject/crawl/
      Copying Readme.txt_tmpl to ./myproject/myproject/crawl/Readme.txt
      Copying to ./myproject/myproject/crawl/
      Copying to ./myproject/myproject/crawl/
      Copying download_list.txt_tmpl to ./myproject/myproject/crawl/download_list.txt
      Copying harvestman-+package+.xml to ./myproject/myproject/crawl/harvestman-myproject.xml
    Recursing into hdf5
      Creating ./myproject/myproject/hdf5/
      Copying READEM_hdf5.txt_tmpl to ./myproject/myproject/hdf5/READEM_hdf5.txt
      Copying to ./myproject/myproject/hdf5/
    Recursing into load
      Creating ./myproject/myproject/load/
      Copying to ./myproject/myproject/load/
      Copying model.template to ./myproject/myproject/load/model.template
    Recursing into parse
      Creating ./myproject/myproject/parse/
      Copying to ./myproject/myproject/parse/
    Recursing into wiki
      Creating ./myproject/myproject/wiki/
      Copying REAME.wiki_tmpl to ./myproject/myproject/wiki/
Running /home/lucas/tmp/datahubENV/bin/python egg_info
Manually creating paster_plugins.txt (deprecated! pass a paster_plugins keyword to setup() instead)
Adding datahub to paster_plugins.txt

|-- myproject
|   |-- README.txt
|   |--
|   |-- crawl
|   |   |-- Readme.txt
|   |   |--
|   |   |--
|   |   |-- download_list.txt
|   |   `-- harvestman-myproject.xml
|   |-- hdf5
|   |   |-- READEM_hdf5.txt
|   |   `--
|   |-- load
|   |   |--
|   |   `-- model.template
|   |-- parse
|   |   `--
|   `-- wiki
|       `--
|-- myproject.egg-info
|   |-- PKG-INFO
|   |-- SOURCES.txt
|   |-- dependency_links.txt
|   |-- entry_points.txt
|   |-- not-zip-safe
|   |-- paster_plugins.txt
|   `-- top_level.txt
|-- setup.cfg

Get stared with your data project



cd crawl
harvestman --genconfig
#save or edit harvestman conf file, and then start downloading using the following command.
harvestman -C harvestman-myproject.xml


#Edit download_list.txt and add url of files you want to download
cd crawl
vi download_list.txt

Off Topic




MyWiki: DataHub (last edited 2013-01-17 02:31:13 by LukaszSzybalski)