Feature selection and classification model construction on type 2 diabetic patients’ data.

Yue Huang, PJ McCullagh, Norman Black, Roy Harper

Research output: Contribution to journalArticlepeer-review

102 Citations (Scopus)


SummaryObjectiveDiabetes affects between 2% and 4% of the global population (up to 10% in the over 65 age group), and its avoidance and effective treatment are undoubtedly crucial public health and health economics issues in the 21st century. The aim of this research was to identify significant factors influencing diabetes control, by applying feature selection to a working patient management system to assist with ranking, classification and knowledge discovery. The classification models can be used to determine individuals in the population with poor diabetes control status based on physiological and examination factors.MethodsThe diabetic patients’ information was collected by Ulster Community and Hospitals Trust (UCHT) from year 2000 to 2004 as part of clinical management. In order to discover key predictors and latent knowledge, data mining techniques were applied. To improve computational efficiency, a feature selection technique, feature selection via supervised model construction (FSSMC), an optimisation of ReliefF, was used to rank the important attributes affecting diabetic control. After selecting suitable features, three complementary classification techniques (Naïve Bayes, IB1 and C4.5) were applied to the data to predict how well the patients’ condition was controlled.ResultsFSSMC identified patients’ ‘age’, ‘diagnosis duration’, the need for ‘insulin treatment’, ‘random blood glucose’ measurement and ‘diet treatment’ as the most important factors influencing blood glucose control. Using the reduced features, a best predictive accuracy of 95% and sensitivity of 98% was achieved. The influence of factors, such as ‘type of care’ delivered, the use of ‘home monitoring’, and the importance of ‘smoking’ on outcome can contribute to domain knowledge in diabetes control.ConclusionIn the care of patients with diabetes, the more important factors identified: patients’ ‘age’, ‘diagnosis duration’ and ‘family history’, are beyond the control of physicians. Treatment methods such as ‘insulin’, ‘diet’ and ‘tablets’ (a variety of oral medicines) may be controlled. However lifestyle indicators such as ‘body mass index’ and ‘smoking status’ are also important and may be controlled by the patient. This further underlines the need for public health education to aid awareness and prevention. More subtle data interactions need to be better understood and data mining can contribute to the clinical evidence base. The research confirms and to a lesser extent challenges current thinking. Whilst fully appreciating the requirement for clinical verification and interpretation, this work supports the use of data mining as an exploratory tool, particularly as the domain is suffering from a data explosion due to enhanced monitoring and the (potential) storage of this data in the electronic health record. FSSMC has proved a useful feature estimator for large data sets, where processing efficiency is an important factor.
Original languageEnglish
Pages (from-to)251-262
JournalArtif. Intell. Med.
Issue number3
Publication statusPublished - 2007

Bibliographical note

Reference text: [1] Gan D, editor. Diabetes atlas, 2nd ed. Brussels: International
Diabetes Federation; 2003. http://www.eatlas.idf.org/
webdata/docs/Atlas%202003-Summary.pdf (accessed June
19, 2007).
[2] Alberti K, Zimmet P. Definition, diagnosis and classification
of diabetes mellitus and its complications. Part 1. Diagnosis
and classification of diabetes mellitus–—provisional report of
a WHO Consultation. Diabetic Med 1998;15:539—53.
[3] Guthrie RA, Guthrie DW, editors. Nursing management of
diabetes mellitus. 5th ed., New York: Springer Publishing;
[4] Pinhas-Hamiel O, Zeitler P. Acute and chronic complications
of type 2 diabetes mellitus in children and adolescents.
Lancet 2007;369:1823—31.
[5] Pickup JC, Williams G, editors. Textbook of diabetes. 3rd
ed., Oxford: Blackwell Science; 2003.
[6] Lorig K, Holman H. Self management education: history,
definition and outcomes and mechanisms. Ann Behav Med
2003;26(1):1—7. doi:10.1207/S15324796ABM2601_01.
[7] Smith R. Improving the management of chronic disease. Br
Med J 2003;327. doi:10.1136/bmj.327.7405.12.
[8] Department of Health. Supporting people with long term
conditions: an NHS and social care model to support local
innovation and integration. London: Department of Health;
Crown copyright 2005.
[9] Department of Health. Self care: a real choice. London:
Department of Health; Crown copyright 2005.
[10] Nissen SE, Wolski K. Effect of rosiglitazone on the risk of
myocardial infarction and death. N Engl J Med 2007;365.
[11] Dash M, Liu H. Consistency-based search in feature selection.
Artif Intell 2003;151:155—76.
[12] Lavrac N. Data mining in medicine: selected techniques and
applications. In: Proceedings of the second international
conference on the practical application of knowledge discovery
and data mining. London: The Practical Applications
Company; 1998. p. 11—31.
[13] Mitchell M, editor. Machine learning. New York: McGraw-
Hill; 1997.
[14] Martin B. Instance-based learning: nearest neighbour with
generalisation. PhD thesis. Hamilton, New Zealand: Department
of Computer Science, University of Waikato; 1995.
[15] Lewis D, Gale W. A sequential algorithm for training text
classifiers. In: Croft BW, Rijsbergen CJ, editors. Proceedings
of the seventeenth annual ACM-SIGIR conference on
research and development in information retrieval.
Springer-Verlag; 1994. p. 3—12.
[16] Rish I, Hellerstein J, Thathachar J. An analysis of data characteristics
that affect Naı¨ve Bayes performance. New York.
IBM Technical Report; 2002. http://www.research.ibm.com/
PM/icml01.pdf (accessed June 19, 2007).
[17] Topon KP. Gene expression based cancer classification using
evolutionary and non-evolutionary methods. Technical
Report No. 041105A1. Japan: Department of Frontier Informatics,
The University of Tokyo; 2004.
[18] Cornforth D, Jelinek H, Peichl L. Fractop: a tool for automated
biological image classification. In: Sarker, McKay,
Gen, Namatame, editors. Proceedings of the sixth Australia—
Japan joint workshop on intelligent and evolutionary
systems. 2002. p. 141—8.
[19] Aires R, Manfrin A, Aluisio S, Santos D. Which classification
algorithm works best with stylistic features of Portuguese in
order to classify web texts according to users needs? Technical
Report NILC-TR-04-09. Brasil: University de Sao Paulo;
[20] Hall M. Correlation-based feature selection for machine
learning. PhD thesis. Hamilton, New Zealand: Department
of Computer Science, University of Waikato; 1999. http://
www.cs.waikato.ac.nz/�mhall/thesis.pdf (accessed June
19, 2007).
[21] Inza I, Sierra B, Blanco R, Larranaga P. Gene selection by
sequential search wrapper approaches in microarry cancer
class prediction. J Intell Fuzzy Syst 2002;12(1):25—32.
[22] Hall M, Holmes G. Benchmarking attribute selection techniques
for discrete class data mining. IEEE Trans Knowledge
Data Eng 2003;15:1437—47.
[23] Sierra B, Lazkano E. Probabilistic-weighted k-nearest neighbour
algorithm: a new approach for gene expression-based
classification. Knowledge-Based Intell Inf Eng 2003;932—9.
[24] Su CT, Yang CH, Hsu KH, Chiu WK. Data mining for the
diagnosis for type II diabetes from three-dimensional body
surface anthropometrical scanning data. Comput Math Appl
[25] Huang Y, McCullagh PJ, Black ND. Feature selection via
supervised model construction. In: Bramer M, editor. Proceedings
of the 4th IEEE international conference on data
mining. 2004. p. 411—4.
[26] Kononenko I. Estimating attributes: analysis and extension
of relief. In: Proceedings of the seventh European
conference in machine learning. Springer-Verlag; 1994 .
p. 171—82.
[27] Demsar J, Zupan B, Aoki N, Wall M, Granchi T, Beck J.
Feature mining and predictive model construction from
severe trauma patient’s data. Int J Med Inf 2001;63:41—50.
[28] Kononenko I, Simec E. Induction of decision trees with
RELIEFF. In: Proceedings of ISSEK workshop on mathematical
and statistical methods in artificial intelligence. New York:
Springer; 1995. p. 199—220.
[29] Robnik M, Kononenko I. Theoretical and empirical analysis of
ReliefF and RReliefF. Mach Learn 2003;53:23—69.
[30] Fayyad U, Piatesky-Shapiro G, Smyth P, editors. Advances in
knowledge discovery and data mining. AAAI/MIT Press; 1996.
[31] Kauderer K, Mucha H, editors. Classification, data analysis
and data highways. New York: Springer-Verlag; 1997.
[32] Schohn G, Cohn D. Less is more: active learning with support
vector machines. In: Pat Langley, editor. Proceedings of the
seventeenth international conference on machine learning.
Morgan Kaufmann; 2000. p. 839—46.
[33] Roy N, McCallum A. Toward optimal active learning through
sampling estimation of error reduction. In: Brodley CE,
Pohoreckyj Danyluk A, editors. Proceedings of the eighteenth
international conference on machine learning. Morgan
Kaufmann; 2001. p. 441—8.
[34] Liu H, Motoda H, Yu L. A selective sampling approach to
active feature selection. Artif Intell 2004;159:49—74.
[35] Aha D, Kibler D, Albert M. Instance-based learning algorithms.
Mach Learn 1991;6:37—66.
[36] Kantardzic M, editor. Data mining: concepts, models, methods,
and algorithms. New Jersey: Wiley-IEEE Press; 2002.
[37] Demsar J, Zupan B, Aoki N, Wall MJ, Granchi TH, Beck JR.
Feature mining and predictive model construction from
severe trauma patient’s data. Int J Med Inf Elsevier Science
[38] Molina L, Belanche L, Nebot A. Feature selection algorithms:
a survey and experimental evaluation. In: Proceeding of IEEE
international conference on data mining, IEEE. 2002. p.
[39] van Bemmel J, Musen M, editors. Handbook of medical
informatics. New York: Springer; 1997.
[40] Perner P. Improving the accuracy of decision tree induction by
feature pre-selection. Appl Artif Intell 2001;15(8):747—60.
[41] Grzymala-Busse J. Data mining in bioinformatics. Technical
Report. USA; University of Kansas; 2003.
[42] Hall L, Collins R, Bowyer K, Banfield R. Error-based pruning
of decision trees grown on very large data sets can work. In:
Proceedings of 14th IEEE international conference on tools
for artificial intelligence; 2002. p. 233—8.
[43] Bennett P. Epidemiology of diabetes mellitus. In: Rifkin H,
Porte D, editors. Ellenberg and Rifkin’s diabetes mellitus.
New York: Elsevier; 1990. p. 363—77.
[44] Croxson S, Burden A, Bodlington M, Bostha J. The prevalence
of diabetes in elderly people. Diabetic Med 1991;8:28—31.
[45] Newman B, Selby J, King M. Concordance for type 2 diabetes
mellitus (NIDDM) in male twins. Diabetiologia 1987;30:
[46] Knowler W, Pettitt D, Saad M. Diabetes mellitus in the pima
Indians: Incidence, risk factors and pathogenesis. Diabetes
Metab Rev 1990;6:1—27.
[47] Harris M. Epidemiological correlates of NIDDM in Hispanics,
Whites, and Blacks in the US population. Diabetes Care
[48] Marcovecchio M, Mohn A, Chiarelli F. Type 2 diabetes mellitus
in children and adolescents. J Endocrinol Investig
[49] Wong T, Barr E, Tapp R, Harper C, Taylor H, Zimmet P, et al.
Retinopathy in persons with impaired glucose metabolism:
the Australian diabetes obesity and lifestyle (AusDiab) study.
Am J Ophthalmol 2005;140:1157—9.
[50] Hansen B, Bodkin N. Primary prevention of diabetes mellitus
by prevention of obesity in monkeys. Diabetes 1993;42:
[51] Brug J, Campbell M, van Assema P. The application and
impact of computer generated personalized nutrition education:
a review of the literature. Patient Educ Counsel
[52] Diabetes Prevention Program Research Group. Reduction in
the incidence of Type II diabetes with lifestyle intervention
or metformin. N Engl J Med 2002;346(6):393—403.
[53] Franz MJ. The answer to weight loss is easy–—doing it is hard!
Clin Diabetes 2001;19(3):105—9.
[54] Vijan S, Hayward RA. Treatment of hypertension in Type 2
diabetes mellitus: blood pressure goals, choice of agents,
and setting priorities in diabetes care. Ann Intern Med
[55] The American College of Physicians. Blood pressure control
in people with Type 2 diabetes mellitus: recommendations
from the American College of Physicians. Ann Intern Med
[56] Snow V, Weiss KB, Mottur-Pilson C. The evidence base for
tight blood pressure control in the management of Type 2
diabetes mellitus. Ann Intern Med 2003;138:587—92.
[57] Bakris G, Weir M, DeQuattro M, McManhon F. Effects of an
ace inhibitor/calcium antagonist combination on proteinuria
in diabetic nephropathy. Kidney Int 1998;54:1283—9.
[58] Cheraskin E. The breakfast/lunch/dinner ritual. J Orthomol
Med 1993;8:6—10.
[59] West K, Ahuja M, Bennett B, Czyzyk A, DeAcosta O, Fuller J.
The role of circulating glucose and triglyceride concentrations
and their interactions with other ‘risk factors’ as
determinants of arterial disease in nine diabetic population
samples from the who multinational study. Diabetes Care
[60] Standl E, Stiegler H, Janka H, Mehnert H. Risk profile of
macrovascular disease in diabetes mellitus. Diabetes Metab
[61] Fontbonne A, Thibult N, Eschwege E, Ducimetiere P. Body fat
distribution and coronary heart disease mortality in subjects
with impaired glucose tolerance or diabetes mellitus: the
paris prospective study 15-year follow-up. Diabetologia
[62] Rimm E, Chan J, Stampfer M, Colditz G, Willett W. Prospective
study of cigarette smoking, alcohol use, and the risk of
diabetes in men. Br Med J 1995;310:555—9.
[63] Wannamethee, Shaper SA, Perry I. Smoking as a modifiable
risk factor for type 2 diabetes in middle-aged men. Diabetes
Care 2001;24:1590—5.
[64] Sairenchi T, Iso H, Nishimura A, Hosoda T, Irie F. Cigarette
smoking and risk of type 2 diabetes mellitus among middleaged
and elderly Japanese men and women. Am J Epidemiol
[65] Chen M, Han J, Yu P. Data mining: an overview from a
database perspective. IEEE Trans Knowledge Data Eng
[66] Veropoulos K, Campbell C, Cristianini N. Controlling the
sensitivity of support vector machines. In: Proceedings of
the international joint conference on artificial intelligence
(IJCAI) Workshop support vector machines; 1999.
p. 55—60.
[67] Newman DJ, Hettich S, Blake CL, Merz CJ. UCI repository of
machine learning databases. Irvine, CA: University of California,
Department of Information and Computer Science;
1998. http://www.ics.uci.edu/�mlearn/MLRepository.html
(accessed June 19, 2007).
[68] Crone S, Lessmann S, Stahlbock R. Empirical comparison and
evaluation of classifier performance for data mining in
customer relationship management. In: Wunsch D, et al.,
editors. Proceedings of the international joint conference on
neural networks, IJCNN’04. 2004. p. 443—8.


  • Type 2 diabetes
  • Blood glucose
  • Data mining
  • Classification
  • Feature selection


Dive into the research topics of 'Feature selection and classification model construction on type 2 diabetic patients’ data.'. Together they form a unique fingerprint.

Cite this