I have a dataset in .csv format as shown:
NRC_CLASS,L1_MARKS_FINAL,L2_MARKS_FINAL,L3_MARKS_FINAL,S1_MARKS_FINAL,S2_MARKS_FINAL,S3_MARKS_FINAL,
FAIL,7,12,12,24,4,30,
PASS,49,36,46,51,31,56,
FAIL,59,35,42,18,18,45,
PASS,61,30,51,33,30,52,
PASS,68,30,35,53,45,54,
2,82,77,75,32,36,56,
FAIL,18,35,35,32,21,35,
2,86,56,46,44,37,60,
1,94,45,62,70,50,59,
Where the first column talks about the over all grade:
FAIL - Fail
PASS - Pass class
1 - First class
2 - Second class
D - Distinction
This is followed by marks of each student in 6 subjects.
Is there anyway i can find out performance in which subject makes a difference in overall outcome?
I am using Weka and had used J48 to build a tree.
The summary of J48 classifier is:
=== Summary ===
Correctly Classified Instances 30503 92.5371 %
Incorrectly Classified Instances 2460 7.4629 %
Kappa statistic 0.902
Mean absolute error 0.0332
Root mean squared error 0.1667
Relative absolute error 10.8867 %
Root relative squared error 42.7055 %
Total Number of Instances 32963
Also I discretized the marks data into 10 bins with useEqualFrequency set to true. The summary of J48 now is:
=== Summary ===
Correctly Classified Instances 28457 86.3301 %
Incorrectly Classified Instances 4506 13.6699 %
Kappa statistic 0.8205
Mean absolute error 0.0742
Root mean squared error 0.2085
Relative absolute error 24.3328 %
Root relative squared error 53.4264 %
Total Number of Instances 32963
First of all, you may need to quantify a value for each of the NRC_CLASS Values (or even better, use the actual grade out of 100) to improve the quality of attribute testing.
From there, you could potentially use Attribute Selection (found in the Select Attribute tab of Weka Explorer) to find the attributes that have the greatest influence on the overall grade. Perhaps the CorrelationAttributeEval as the Attribute Evaluator coupled with the Ranker search method could assist in identifying the attributes of greatest importance to the least.
Hope this Helps!