Keywords: Data mining for Beginners, Data Cleaning, First Step in datamining

Main Principle

  1. What's new?
  2. What's Interesting?
  3. Predict for me.

Main Structure

  1. Customer
  2. System, System Integration
  3. Database
  4. Algorithms

Business Challenge

  1. Acquisition
  2. Conversion
  3. Average Order
  4. Retention
  5. Loyalty

== Solution ==

  1. Where is the data. You need to start from scratch.
  2. Make the data work. How to think of data as a strategic for business.
  3. Tie data to revenue. Drive business from data.
  4. Don't sell data mining. Don't sell data mining capability. Sell operation capability. "I will solve your revenue problem."
  5. Solve high risk, high value to low risk to low volume. Predicted lifetime.
  6. Solve simple problems that a human bing could solve.
  7. Is the data able to drive the business.
  8. Data driven capabilities to revenue driven activities.
  9. Data driven matrix to success measurment.
  10. Operations to include data, mesurability, data driven measurement loop.
  11. Domain application (automotive, insurance, business/customers)

Managing and Maintaining

  1. How do you update it? How do you maintain it?
  2. What happens when you use your tool 6 months from now. Can you still use it?
  3. Implement econometrics. Scarcity.

Effectiveness Measure

  1. Effectiveness measurment.
  2. Success Matrix.
  3. Matrix that tell you how long to keep it alive.
  4. Matrix that tell you when to kill it.
  5. Measurement of success.
  6. Quality Assurance of Results.


  1. Find information (15%)
  2. Find direction (40%)
  3. Transaction

User Interaction Control

  1. Not enough control
  2. Too much control
  3. Find the balance.

How to get usable list of files in folder (ls -l)

ls -l >myfile.txt

[Esc] then

Data Cleaning

Data cleaning with VIM for beginners


X1222 22323 2A22 3303 0000 3334esss test 123
X2222 22353 2A22 3303 0001 3334esss tacd 456
X3222 22383 2A22 3303 0010 3334esss fals 789
X4222 22393 2A22 3303 0011 3334esss true 012

It is doesn’t really matter what it is, this example is somewhat contrived. Suppose you needed to make the following changes for each line that starts with X: * change the ID from X_222 to Y_223 * reverse the 4th and 5th fields * copy the second character from the beginning and insert it before the last character of the line

If it were only 4 lines, you could handle this yourself, but it would be very tedious. Suppose rather than 4 lines, you had 400- it’d be much easier to automate it. The best way to take care of it would be with a macro:

[esc]qa /^X[enter] i[delete]Y[right][right][right][delete]3 [esc]wwwdww[left]p 0[right]d[ctrl+v]y$[shift+p] q

That right there is a MESS, but gets the job done- it’s not something you want to repeat for fear of a typo. Notice that the first characters you typed were qa: ‘q’ starts recording, and ‘a’ is the slot we’re using to store the macro. From here we record how *we* would make the changes, making sure to keep our keystrokes to a minimum. When we’re done, we stop the macro by pressing ‘q’ again.

To run our macro on the next line, press ‘@a’ to run the newly created ‘a’ macro- it should find the next line that starts with an X (notice the /^X in our first command) and run those commands to massage our text.

Remember how we were talking about 400 lines like this? Even at 2 characters each, that’s 800 characters to type which is still annoying. Here’s where the magic comes in- you can record macros of macros:

qb @a@a@a@a@a@a@a@a@a@a q

Now each time you run @b, you’ll run the a macro 10 times. A more efficient way to handle this would be to use

@a 398@@

the first one was done manually to record the macro, the second to play the macro, and the third to say run it 398 more times.

And there you go- a quick tour of recording macros. I’m sure there’s much more than what I’ve shown, but that’s enough to keep you busy.

Some definitions

Explicit knowledge, inferred knowledge and embedded knowledge can help to make decisions regarding matches and non-matches. Explicit knowledge considers entity attribution that is clearly presented in the data instance. The challenge for customer data integration for this is to identify linking fields across data sets that may be represented differently. As human beings, we may be able to look at two contacts and figure out whether or not they refer to the same individual. However, for a computer system, implicit/inferred knowledge means relying on more complex algorithms to compute the degree of similarity and then combining it with other inputs to derive the inference that solidifies the link.

Mergers and Acquisitions The data represents useful information about a company's customers, their opinions, data on the products and services of the acquired company, etc. The data is very useful in mapping the path for the new merged entity in terms of.. Without a solid data quality solution, managing the data between the parent company and the subsidiary would prove dangerous. The organization would have a difficult time identifying mutual customers and that confusion would threaten the very relationships they had worked so hard to develop.

CRM and ERM CRM, ERP, Inventory management systems, etc are implemented to utilize the data and add to it. The data that they create and refer to reveals important purchasing and buying patterns, dispatching options, and consumer buying trends. However, inaccurate and incomplete data is not uncommon in most data warehouses. It has a negative effect on the decision making process. Frequently, the scope of the problem is such that companies have to introduce an enterprise-wide quality solutions program. This enables the executives to access uniform data from a centralized source and be confident about the decisions taken on the basis of that data Well seasoned data quality experts, particularly ones experienced in turning business requirements into lasting data quality standards, are generally far more effective than part-time programmers whose job is to “fix”

Multiple CRM ERM "Master data management" is the art of managing information exchange, such as customer and product information, between different computer systems. It is very difficult for large companies to maintain multiple sets of master data. Moreover, enterprises have more than one ERP or CRM system, which is different from the other. It is difficult to manage data across these systems, track their path, and maintain accuracy across the master databases. Inconsistency in corporate data can lead to duplication of efforts and difficulty in analyzing business performance. Having a centralized master database facilitates developing marketing strategies more rapidly or pushing for more sales via the web. Product numbers, names, brands, formulations, etc all constitute information that can be very difficult to utilize if it is not easily accessible


Basicall XML for input specification, output in XML/hdf5;

interested in standard format for interoperability between QMC and QC/DFT packages.

wishes: modular, hierarchical structure; extensible, interoperability

current implementation: hybrid XML plus hdf5; XML stores summaries of sims; hdf5 contains "everything" (like ALPS?). reason of usage of binary format is its efficiency.

why hdf5? highly organized and hierarchical storage format; it has multidimensional array ("data sets") and groups (directory like structures), key point was hdf5 efficiency (I/O) for high I/O applications (in particular for parallel IO); collection of analysis tools for data manipulation and visualtization; it is an open standard and platform independent.

initial problem: no fortran library (but a new implementations exist)

benchmarking was performed => it's fast and efficient

stored are for example: scalar data and tabular data mainly; average data for tabular data. occasionally time series (therefore efficiency was critical)

cool thing about hdf5: has conversion tools (dump multiple hdf5 data files into a merged one, the library does already a lot of work for; filters, convert to XML etc.)

storage backend: filesystem/directories/files

hdf5 has a stable API, proven, known-to-work technology.

MyWiki: DataMining (last edited 2009-09-06 02:49:25 by localhost)