The use of knowledge discovery databases in the identification of patients with colorectal cancer
Cowley, Jonathan Bowes
Thesis or dissertation
- © 2012 Jonathan Bowes Cowley. All rights reserved. No part of this publication may be reproduced without the written permission of the copyright holder.
Colorectal cancer is one of the most common forms of malignancy with 35,000 new patients diagnosed annually within the UK. Survival figures show that outcomes are less favourable within the UK when compared with the USA and Europe with 1 in 4 patients having incurable disease at presentation as of data from 2000.
Epidemiologists have demonstrated that the incidence of colorectal cancer is highest on the industrialised western world with numerous contributory factors. These range from a genetic component to concurrent medical conditions and personal lifestyle. In addition, data also demonstrates that environmental changes play a significant role with immigrants rapidly reaching the incidence rates of the host country.
Detection of colorectal cancer remains an important and evolving aspect of healthcare with the aim of improving outcomes by earlier diagnosis. This process was initially revolutionised within the UK in 2002 with the ACPGBI 2 week wait guidelines to facilitate referrals form primary care and has subsequently seen other schemes such as bowel cancer screening introduced to augment earlier detection rates. Whereas the national screening programme is dependent on FOBT the standard referral practice is dependent upon a number of trigger symptoms that qualify for an urgent referral to a specialist for further investigations. This process only identifies 25-30% of those with colorectal cancer and remains a labour intensive process with only 10% of those seen in the 2 week wait clinics having colorectal cancer.
This thesis hypothesises whether using a patient symptom questionnaire in conjunction with knowledge discovery techniques such as data mining and artificial neural networks could identify patients at risk of colorectal cancer and therefore warrant urgent further assessment. Artificial neural networks and data mining methods are used widely in industry to detect consumer patterns by an inbuilt ability to learn from previous examples within a dataset and model often complex, non-linear patterns. Within medicine these methods have been utilised in a host of diagnostic techniques from myocardial infarcts to its use in the Papnet cervical smear programme for cervical cancer detection.
A linkert based questionnaire of those attending the 2 week wait fast track colorectal clinic was used to produce a ‘symptoms’ database. This was then correlated with individual patient diagnoses upon completion of their clinical assessment. A total of 777 patients were included in the study and their diagnosis categorised into a dichotomous variable to create a selection of datasets for analysis. These data sets were then taken by the author and used to create a total of four primary databases based on all questions, 2 week wait trigger symptoms, Best knowledge questions and symptoms identified in Univariate analysis as significant. Each of these databases were entered into an artificial neural network programme, altering the number of hidden units and layers to obtain a selection of outcome models that could be further tested based on a selection of set dichotomous outcomes. Outcome models were compared for sensitivity, specificity and risk. Further experiments were carried out with data mining techniques and the WEKA package to identify the most accurate model. Both would then be compared with the accuracy of a colorectal specialist and GP.
Analysis of the data identified that 24% of those referred on the 2 week wait referral pathway failed to meet referral criteria as set out by the ACPGBI. The incidence of those with colorectal cancer was 9.5% (74) which is in keeping with other studies and the main symptoms were rectal bleeding, change in bowel habit and abdominal pain. The optimal knowledge discovery database model was a back propagation ANN using all variables for outcomes cancer/not cancer with sensitivity of 0.9, specificity of 0.97 and LR 35.8. Artificial neural networks remained the more accurate modelling method for all the dichotomous outcomes.
The comparison of GP’s and colorectal specialists at predicting outcome demonstrated that the colorectal specialists were the more accurate predictors of cancer/not cancer with sensitivity 0.27 and specificity 0.97, (95% CI 0.6-0.97, PPV 0.75, NPV 0.83) and LR 10.6. When compared to the KDD models for predicting the same outcome, once again the ANN models were more accurate with the optimal model having sensitivity 0.63, specificity 0.98 (95% CI 0.58-1, PPV 0.71, NPV 0.96) and LR 28.7.
The results demonstrate that diagnosis colorectal cancer remains a challenging process, both for clinicians and also for computation models. KDD models have been shown to be consistently more accurate in the prediction of those with colorectal cancer than clinicians alone when used solely in conjunction with a questionnaire. It would be ill conceived to suggest that KDD models could be used as a replacement to clinician- patient interaction but they may aid in the acceleration of some patients for further investigations or ‘straight to test’ if used on those referred as routine patients.
- Postgraduate Medical Institute, The University of Hull
- Qualification level
- Qualification name
- Filesize: 1 MB