TABLE OF CONTENTS
1. DATA
MINING – A GENTLE INTRODUCTION
1.1 INTRODUCTION
1.2 DATA MINING: WHY NOW?
1.2.1 AVAILABILITY OF LARGE DATABASES -DATA WAREHOUSING
1.2.2 PRICE DROP IN DATA STORAGE AND EFFICIENT COMPUTER PROCESSING
1.2.3 NEW ADVANCEMENTS IN ANALYTICAL METHODOLOGY
1.3 BENEFITS OF DATA MINING
1.4 DATA MINING: USERS
1.5 DATA MINING: TOOLS
1.6 DATA MINING: STEPS
1.6.1 IDENTIFICATION OF PROBLEM AND DEFINING THE BUSINESS GOAL
1.6.2 DATA PROCESSING
1.6.3 DATA EXPLORATION AND DESCRIPTIVE ANALYSIS
1.6.4 DATA MINING SOLUTIONS: UNSUPERVISED LEARNING METHODS
1.6.5 DATA MINING SOLUTIONS: SUPERVISED LEARNING METHODS
1.6.6 MODEL VALIDATION
1.6.7 INTERPRET AND MAKE DECISIONS
1.7 PROBLEMS IN DATA MINING PROCESS
1.8 SAS SOFTWARE THE LEADER IN DATA MINING
1.8.1 SEMMA – THE SAS DATA MINING PROCESS
1.8.2 SAS ENTERPRISE MINER FOR COMPREHENSIVE DATA MINING SOLUTION
1.9 USER-FRIENDLY SAS MACROS FOR DATA MINING
1.10 SUMMARY
1.11 REFERENCES
1.12 DATA MINING: FURTHER READINGS
2
PREPARING DATA FOR DATA MINING
2.1 INTRODUCTION
2.2 DATA REQUIREMENTS IN DATA MINING
2.3 IDEAL STRUCTURES OF DATA FOR DATA MINING
2.4 UNDERSTANDING THE MEASUREMENT SCALE OF VARIABLES
2.5 ENTIRE DATABASE OR REPRESENTATIVE SAMPLE
2.6 SAMPLING FOR DATA MINING
2.6.1 SAMPLE SIZE
2.7 SAS
APPLICATIONS USED IN DATA PREPARATION
2.7.1 CONVERTING RELATIONAL DBMS INTO SAS DATA SETS
2.7.1.1 Instruction for extracting SAS data from Oracle database using the SAS SQL Pass-through facility
2.7.1.2 Instruction for creating SAS Data set from Oracle database using the SAS / ACCCESS and the LIBNAME statement
2.7.2 CONVERTING PC BASED DATA FILES
2.7.2.1 Instruction
for converting PC data formats to SAS data sets using the SAS IMPORT WIZARD
2.7.2.2 Converting PC data formats to SAS data sets using the “EXCELSAS” SAS macro
2.7.2.3 Steps involved in running the EXCELSAS macro
2.7.2.4 Help file for SAS Macro- EXCELSAS: Description of macro parameters
2.7.2.5 Importing an EXCEL file called ‘fraud’ to a permanent SAS dataset called ‘fraud’
2.7.3 SAS MACRO APPLICATIONS: RANDOM SAMPLING FROM THE ENTIRE DATABASE USING THE SAS MACRO ‘RANSPLIT’
2.7.3.1 Steps involved in running the RANSPLIT macro
2.7.3.2 Help file for SAS Macro- RANSPLIT: Description of macro parameters
2.7.3.3 Drawing training (400), validation (300), and test (all left over observations) samples from the permanent SAS data called ‘fraud’
2.8 SUMMARY
2.9 REFERENCES
2.10 SUGGESTED READINGS
2.11 LIST OF FIGURES
3
EXPLORATORY DATA ANALYSIS
3.1 INTRODUCTION
3.2 EXPLORING CONTINUOUS VARIABLE
3.2.1 DESCRIPTIVE STATISTICS
3.2.1.1 Measures of location or central tendency
3.2.1.2 Robust measures of location
3.2.1.3 Five-Number summary statistics
3.2.1.4 Measures of dispersion
3.2.1.5 Standard errors and confidence interval estimates
3.2.1.6 Detecting deviation from normally distributed data
3.2.2 GRAPHICAL TECHNIQUES USED IN EDA OF CONTINUOUS DATA
3.3 DATA EXPLORATION – CATERGORICAL VARIABLE
3.3.1 DESCRIPTIVE STATISTICAL ESTIMATES
3.3.2 GRAPHICAL DISPLAYS FOR CATEGORICAL DATA
3.4 SAS MACRO APPLICATIONS USED IN DATA EXPLORATION
3.4.1 EXPLORING CATEGORICAL VARIABLES USING THE SAS MACRO ‘FREQ’
3.4.1.1 Steps involved in running the FREQ macro
3.4.1.2 Help file for SAS Macro- FREQ: Description of macro parameters
3.4.1.3 Exploring categorical variables in a permanent SAS dataset ‘gf.cars93’
3.4.2
EDA ANALYSIS OF CONTINUOUS VARIABLES USING SAS MACRO
‘UNIVAR’
3.4.2.1 Steps involved in running the UNIVAR macro
3.4.2.2 Help file for SAS Macro- UNIVAR: Description of macro parameters
3.4.3 CASE STUDY 2 DATA EXPLORATION – CONTINUOUS VARIABLE
3.5 SUMMARY
3.6 REFERENCES
3.7 SUGGESTED READINGS
3.8 LIST OF FIGURES
4
UNSUPERVISED LEARNING METHODS
4.1 INTRODUCTION
4.2 APPLICATIONS OF UNSUPERVISED LEARNING METHODS
4.3 PRINCIPAL COMPONENT ANALYSIS
4.3.1 PCA TERMINOLOGY
4.4 EXPLORATORY FACTOR ANALYSIS
4.4.1 EXPLORATORY FACTOR ANALYSIS VS PCA
4.4.2 EXPLORATORY FACTOR ANALYSIS TERMINOLOGY
4.5 DISJOINT CLUSTER ANALYSIS
4.5.1 TYPES OF CLUSTER ANALYSIS
4.5.2 ‘FASTCLUS’ A SAS PROCEDURE TO PERFORM DCA
4.6 BI-PLOT
DISPLAY OF PCA, EFA AND DCA RESULTS
4.7 PCA AND EFA USING SAS MACRO ‘FACTOR’
4.7.1 STEPS INVOLVED IN RUNNING THE ‘FACTOR’ MACRO
4.7.2 HELP FILE FOR SAS MACRO- FACTOR
4.7.3 CASE STUDY: 1 PRINCIPAL COMPONENT ANALYSIS OF 1993 CAR ATTRIBUTE DATA
4.7.4
CASE STUDY 2 MAXIMUM LIKEIEHOOD FACTOR ANALYSIS WITH
‘VARIMAX’ ROTATION OF1993 CAR ATTRIBUTE DATA
4.8 DISJOINT CLUSTER ANALYSIS USING SAS MACRO ‘DISJCLUS’
4.8.1 STEPS INVOLVED IN RUNNING THE DISJCLUS MACRO
4.8.2 HELP FILE FOR SAS MACRO- DISJCLUS
4.8.3 CASE STUDY 3 DISJOINT CLUSTER ANALYSIS OF 1993 CAR ATTRIBUTE DATA
4.9 SUMMARY
4.10 REFERENCES
4.11 SUGGESTED READINGS
4.12 LIST OF FIGURES
5
SUPERVISED LEARNING METHODS –PREDICTION
5.1 INTRODUCTION
5.2 APPLICATIONS OF SUPERVISED PREDICITIVE METHODS
5.3 MULTIPLE LINEAR REGRESSION MODELING
5.3.1 MLR KEY CONCEPTS AND TERMINOLOGY
5.3.2 EXPLORATORY ANALYSIS USING DIAGNOSTIC PLOTS
5.3.3
MODEL SELECTION
5.3.4
VIOLATIONS OF REGRESSION MODEL ASSUMPTIONS
5.3.5 REGRESSION MODEL VALIDATION
5.4 BINARY LOGISTIC REGRESSION MODELING
5.4.1 TERMINOLOGY AND KEY CONCEPTS
5.4.2 EXPLORATORY ANALYSIS USING DIAGNOSTIC PLOTS
5.4.3 MODEL SELECTION
5.4.4 CHECKING FOR VIOLATIONS OF REGRESSION MODEL ASSUMPTIONS
5.5 MULTIPLE LINEAR REGRESSION USING SAS MACRO ‘REGDIAG’
5.5.1 STEPS INVOLVED IN RUNNING THE ‘REGDIAG’ MACRO
5.5.2 HELP FILE FOR SAS MACRO- ‘REGDIAG’
5.6 LIFT CHART USING SAS MACRO ‘LIFT’
5.6.1 STEPS INVOLVED IN RUNNING THE ‘LIFT’ MACRO
5.6.2 HELP FILE FOR USING SAS MACRO- ‘LIFT’
5.7 SCORING NEW REGRESSION DATA USING THE SAS MACRO ‘RSCORE’
5.7.1 STEPS INVOLVED IN RUNNING THE ‘RSCORE’ MACRO
5.7.2 HELP FILE FOR USING SAS MACRO- ‘RSCORE’
5.8 LOGISTIC REGRESSION USING SAS MACRO ‘LOGISTIC’
5.8.1 STEPS INVOLVED IN RUNNING THE ‘LOGISTIC’ MACRO
5.8.2 HELP FILE FOR SAS MACRO- ‘LOGISTIC’
5.9 SCORING NEW LOGISTIC REGRESSION DATA USING THE SAS MACRO ‘LSCORE’
5.9.1 STEPS INVOLVED IN RUNNING THE ‘LSCORE’ MACRO
5.9.2
HELP FILE FOR USING SAS MACRO- ‘LSCORE’
5.10 CASE
STUDY: 1 MODELING MULTIPLE LINEAR REGRESSION
5.11 CASE
STUDY: 2 MODELING MULTIPLE LINEAR REGRESSION WITH CATEGORICAL VARIABLES
5.12 CASE STUDY: 3 MODELING BINARY LOGISTIC REGRESSION
5.13 SUMMARY
5.14 REFERENCES
5.15 LIST OF FIGURES
6
SUPERVISED LEARNING METHODS –CLASSIFICATION
6.1 INTRODUCTION
6.2 DISCRIMINANT ANALYSIS
6.3 STEPWISE DISCRIMINANT ANALYSIS
6.4 CANONICAL
DISCRIMINANT ANALYSIS (CDA)
6.4.1 CDA ASSUMPTIONS
6.4.2 KEY CONCEPTS AND TERMINOLOGY IN CDA
6.5 DISCRIMINANT
FUNCTION ANALYSIS (DFA)
6.5.1 KEY CONCEPTS AND TERMINOLOGY IN DFA
6.6 APPLICATIONS OF DISCRIMINANT ANALYSIS
6.7 CLASSIFICATION TREE BASED ON CHAID
6.7.1 KEY CONCEPTS AND TERMINOLOGY IN CLASSIFICAION TREE
6.8 APPLICATIONS OF CHAID
6.9 DISCRIMINANT ANALYSIS USING SAS MACRO ‘DISCRIM’
6.9.1 STEPS INVOLVED IN RUNNING THE ‘DISCRIM’ MACRO
6.9.2 HELP FILE FOR SAS MACRO- ‘DISCRIM’
6.10 DECISON TREE USING SAS MACRO ‘CHAID’
6.10.1 STEPS INVOLVED IN RUNNING THE ‘CHAID’ MACRO
6.10.2 HELP FILE FOR SAS MACRO- ‘CHAID’
6.11 CASE STUDY1: CDA AND PARAMETRIC DFA
6.12 CASE STUDY2: NON-PARAMETRIC DFA
6.13 CASE STUDY3: CLASSIFICATION TREE USING CHAID
6.14 SUMMARY
6.15 REFERENCES
6.16 SUGGESTED READINGS
6.17
LIST OF FIGURES
7 EMERGING TECHNOLOGIES IN DATA MINING
7.1
INTRODUCTION
7.2
DATA
WAREHOUSING
7.2.1 KEY CONCEPTS IN DATA WAREHOUSING FEATURES
7.3 ARTIFICIAL NEURAL NETWORK METHODS
7.4 MARKET BASKET ANALYSIS
7.4.1 BENEFITS OF MBA
7.4.2 LIMITATIONS OF MBA
7.5 SAS SOFTWARE THE LEADER IN DATA MINING
7.6 SUMMARY
7.7 REFERENCES
7.8 FURTHER READINGS
8
APPENDIX1: INSTRUCTION FOR USING THE SAS MACROS