Determine the attribute that influences the outcome most

1k views Asked by At

I have a dataset in .csv format as shown:

NRC_CLASS,L1_MARKS_FINAL,L2_MARKS_FINAL,L3_MARKS_FINAL,S1_MARKS_FINAL,S2_MARKS_FINAL,S3_MARKS_FINAL,
FAIL,7,12,12,24,4,30,
PASS,49,36,46,51,31,56,
FAIL,59,35,42,18,18,45,
PASS,61,30,51,33,30,52,
PASS,68,30,35,53,45,54,
2,82,77,75,32,36,56,
FAIL,18,35,35,32,21,35,
2,86,56,46,44,37,60,
1,94,45,62,70,50,59,

Where the first column talks about the over all grade:

FAIL - Fail
PASS - Pass class
1 - First class
2 - Second class
D - Distinction

This is followed by marks of each student in 6 subjects.

Is there anyway i can find out performance in which subject makes a difference in overall outcome?

I am using Weka and had used J48 to build a tree.

The summary of J48 classifier is:

=== Summary ===

Correctly Classified Instances       30503               92.5371 %
Incorrectly Classified Instances      2460                7.4629 %
Kappa statistic                          0.902 
Mean absolute error                      0.0332
Root mean squared error                  0.1667
Relative absolute error                 10.8867 %
Root relative squared error             42.7055 %
Total Number of Instances            32963 

Also I discretized the marks data into 10 bins with useEqualFrequency set to true. The summary of J48 now is:

=== Summary ===

Correctly Classified Instances       28457               86.3301 %
Incorrectly Classified Instances      4506               13.6699 %
Kappa statistic                          0.8205
Mean absolute error                      0.0742
Root mean squared error                  0.2085
Relative absolute error                 24.3328 %
Root relative squared error             53.4264 %
Total Number of Instances            32963 
3

There are 3 answers

1
Matthew Spencer On

First of all, you may need to quantify a value for each of the NRC_CLASS Values (or even better, use the actual grade out of 100) to improve the quality of attribute testing.

From there, you could potentially use Attribute Selection (found in the Select Attribute tab of Weka Explorer) to find the attributes that have the greatest influence on the overall grade. Perhaps the CorrelationAttributeEval as the Attribute Evaluator coupled with the Ranker search method could assist in identifying the attributes of greatest importance to the least.

Hope this Helps!

0
Vera On

It seems you want to determine the relative relevance of each attribute. In this case, you need to use a weight learning algorithm. Weka has a few, I just used Relief. Go to the tab Select attributes, in Attribute Evaluator, select ReliefF-AttributeEval, it will select the Select the attribute that has the value for the outcome class. Search Method for you. Click Start. The results will include the ranked attributes, the highest ranked is the most relevant.

0
Marcus On

In a test data set T with 25 attributes, run i=1:25 rounds where you replace the values of the i-th attribute with random values (=noise). Compare the test performance of each of the 25 rounds with the case where no attribute was replaced, and identify the round in which the performance dropped the most.

If the worst performance decrease occurred e.g. in round 13, this indicates that attribute 13 is the most important one.