|Home | About | Journals | Submit | Contact Us | Français|
The amount of data in electronic and real world is constantly on the rise. Therefore, extracting useful knowledge from the total available data is very important and time consuming task. Data mining has various techniques for extracting valuable information or knowledge from data. These techniques are applicable for all data that are collected inall fields of science. Several research investigations are published about applications of data mining in various fields of sciences such as defense, banking, insurances, education, telecommunications, medicine and etc. This investigation attempts to provide a comprehensive survey about applications of data mining techniques in breast cancer diagnosis, treatment & prognosis till now. Further, the main challenges in these area is presented in this investigation. Since several research studies currently are going on in this issues, therefore, it is necessary to have a complete survey about all researches which are completed up to now, along with the results of those studies and important challenges which are currently exist in this area for helping young researchers and presenting to them the main problems that are still exist in this area.
Nowadays in all fields of sciences including genetics, education, earth science, agriculture and medicine the amount of data is increasing dramatically. Analyzing these huge amount of data to extract the novel and usable information or knowledge is very complicated and time consuming task. Data mining techniques are useful for this matter.
Generally, in the medical world, there are two phases for making the decisions. These two phases are :
• Differential Diagnosis (DD): in this phase, all information of patients including their medical history, symptoms of disease, results of various testing such as blood testing and etc. are perceivedby doctors as the input data. These data are processed by doctors based on their medical knowledge for disease diagnosis. Sometimes several diseases havesome similar symptoms, therefore, medical doctors must be assign arbitrary weights to each one of inputsand make patterns, match these patterns with the patterns of various diseases and finally select the closest match and diagnosis the exact disease.
• Final or Provisional Diagnosis (FD): in this phase, the preliminary recommendations and treatments would be start according to the identified disease. In this step, a physician with medical knowledge and his/her logic, continues checkups and records the results of continually perceives or tests, and decides the final treatments and prognosis.
Data mining has various techniques (such as: Classification, Clustering, Regression, Association Rules and etc.,) and algorithms (such as: Decision Trees, Genetic Algorithm, Nearest Neighbor method etc.,) for analyzing the huge amount of raw or multi-dimensional data. In the other words, data mining has capabilities for intelligent data analysis to extract hidden knowledge from large databases of medical or clinical data that are collected from medical centers or hospitals. These knowledge provide useful information to improve decision support, prevention, diagnosis and treatment in medical world. Further, data mining has ability to identify association rules or establish relationships between various features such as: patient’s personal data, disease symptoms and etc. .
This investigation attempts to represent the results of several researchworks which are published related to data mining applications in prediction, diagnosis or treatment of breast cancers.
This paper is organized in four sections. Section 2nd includes some basic concepts related to this paper. Section 3rd presents data mining applications or usages in early diagnosis, treatment and prognosis of various cancers. Section 4th concludes this paper and presents our future works.
This section presents the concepts of data mining and its different techniques.
Data mining and knowledge discovery in databases (KDD) are extracting novel, understandable and useful information, knowledge or patterns from huge amount of available data . In the other words, data mining has capabilities for analyzingthe large datasets, finding unexpected or hidden relationships between various attributes and summarizing the extractedinformation more understandable and useful to data users or owners. In the traditional model for transforming data to knowledge, some manual analysis and interpretation are executed. For example, in medical centers, generally doctors or specialists manually analyze current trends, disease and health-care data, then make a report and use this report for decision making or planning for medical diagnosis, treatments and etc. The problem of this type of data analysis is that, this form of manual data analysis is slow, expensive, time consuming, and highly subjective. However, KDD has various data processing steps including :
• Selection: selecting target or relevant data based on the goal or data mining task.
• Pre-processing: removing missing, incorrect, noisy, and inconsistent or no quality data.
• Transformation: includes smoothing, aggregation, generalization or normalization and attribute/feature selection.
• Data mining: applying data mining methods or techniques for extracting interesting patterns.
• Interpretation/Evaluation: includes statistical validation, qualitative review and etc.
Data mining has two main tasks:
• Predictive tasks: with applying various techniques or algorithms, it can make decisions or predict the unknown or future values of other variables. These techniques includes classification, association rule and etc.
• Descriptive tasks: describe the data or find human understandable patterns and present the results in tables, diagrams and etc., which can be understand easily by data owners or data users.
The remaining of this section, provides a short review of common data mining techniques.
Classification is called as supervise learning. It take some of data (named as training set) which has collection of records and each record contain set of attributes and define one attribute named as class. The main goal of classification is producing a model with capability of predicting the value of class attribute in previously unseen records as accurately as possible. A test set is used for predicting the accuracy of the created model . Some applications of classification in medical diagnosis are: classifying tumor cells, analyzing the effectiveness of treatmentand etc.
Several classification algorithms and techniques are proposed such as [6-10]: Decision TreeInduction (ID3 & C4.5, Hunt’s Algorithm and etc.,), Rule-Based Methods, Memory-Based Methods (such as: k-Nearest-Neighbor), Genetic Programming [9,10], Naïve Bayes  and Bayesian Classification , Artificial Neural Networks , Support Vector Machines (SVMs) , Ensemble Methods  and etc.
Association Rule is one the most important techniques of data mining. It attempts to extract frequent patterns and interesting relationships between different sets of items , and etc.
Association rule mining has various applications  in biology (exp. for detecting relationships between gens and environment), healthcare settings and Medical Diagnosis [18-20], Critical Medical Applications , Maximal-Profit Item Selection (MPIS) , Privacy preservation , astrophysics , crime prevention , counterterrorism , Business [27,28], GIS  and etc.
There are several association rule mining techniques that are proposed such as: Apriori algorithm and its extension which is called as AprioriTID [30,31], DIC , STEM , ICAP (Incremental Constrained APriori , CARMA (Continuous Association Rule Mining Algorithm) , RARM (Rapid Association Rule Mining) , FP-Tree algorithm , Goethals FP-Growth , Broglet’s FP-Growth, Eclat and SaM algorithm , COFI-Tree and CT-PRO , Recursive Elimination  and etc.
Regression same as classification attempts to predict the value of an attribute (variable) based on the value/values of other attributes/variables. The main difference between classification and regression is related to the type of target attribute (variable) that must be predict based on the value of other attributes/variables. The target variable in classification is categorical in nature. Whereas in regression, the target variable is numeric or continuous. Further, in classification, classes are created whereas in regression there is no classes, and all data is divided in various split points and for each split point the amount of “error” is equal to square of differences between amount of actual value and predicted value. The amount of split points error across different variables are compared and minimum split point error is selected as the split point/root node. This process recursively continued . In the other words, the main objectives of regression are:
• Dividing the set of data into two continuous variables then describe the associations or relationships between them.
• Find the value of attributes/variables.
• Predict the value of one attribute/variable based on the value of other attribute/variable.
• Control the accuracy of prediction.
Regression has several applications including: estimation of agricultural data , Geography , marketing , business , Financial Forecasting , medical diagnosis and cancer diagnosis or prognosis , Predicting Laptop Retail Price  and etc.
Clustering is unsupervised learning and divides the data into groups (call as clusters) based on their similar attributes . All objects into one cluster are similar with each other and dissimilar with objects in other clusters. Clustering is widely used in: science and statistics [50,51], pattern recognition [52,53], image processing and segmentation , Web applications [55-58], DNA analysis in biology , GIS [60,61] and etc.
Different clustering algorithms are available. Some of these algorithms are: Hierarchical Methods [62,63] (Divisive Algorithms & Agglomerative Algorithms), Partitioning Methods  (Relocation Algorithms, K-medoids Methods, K-means Methods, Probabilistic Clustering, Density-Based Algorithms), Grid-Based Methods , Constraint-Based Clustering , Clustering Algorithms Used in Machine Learning , Scalable Clustering Algorithms [68,69], Algorithms For High Dimensional Data  (Subspace Clustering, Co-Clustering Techniques, Projection Techniques) and Methods Based on Co-Occurrence of Categorical Data .
Cancer is one type of disease. It is happening when cells growth in part of human body becomes out-of-control. In the other words, whenevercells in part of the body divide uncontrollably and damage the other cells, cancer is occurred. Nowadays, more than 100 types of cancers based on the part of body where it’s appeared, or cells that are affected, have been classified. Currently, cancer becameone of the main causes of death in allover of the world. Several factors affect the creation or spreading cancers including: gender, age, genetics, marital status, quality of life, living location and etc.
In some cases, cancer makes a masses of tissue in part of the body which is called as tumors. These tumors can grow and effect various organs of body such as nervous, digestive or circulatory systems. However, when a tumor spreads to other parts of the body, invading or destroying other tissues, it is called as metastasized and this process is called metastasis. When tumor becomes in this stage, it is very difficult to treat that. Therefore, one of the important issues in treatment process of tumors and cancers is related to the time of diagnosis of that cancer or tumor. The early diagnosis of cancers increases the chance of their treatment. For this reason, several researchers attempts to create intelligent systems to assist the doctors for early diagnosis of cancers. For example, P. Ramachandran et al.  analyzed the effect of four attributes including age, sex, marital status and educational qualification on cancer creation. Further, these researchers made useful patterns for cancer diagnosis.
Two types of tumors are identified:
• Benign: this type of tumors is not dangerous for human body and rarely causes for human death. In this type, tumor grows in one part (spot) of body and has limited growth.
• Malignant: this type of tumors is more dangerous andhas two types of effects on human body:
○ A cancerous cell with uncontrolled growth spread with invasion lymph system destroys healthy tissues. In the other words, it metastasis to other tissues and make problem in their general actions or duties.
○ A cancerous cell continually growth and with angiogenesis process makes new blood vessels to feed itself. Therefore, it uses body’s blood and can cause Anemia.
Currently, one of the main challenges in the area of cancer treatment is, identifying the most common symptoms which can help for earlier diagnosis of cancers. Several research studies have been conducted to extract the patterns of cancers and create intelligent or fast method for diagnosis of tumors in the early stages and suggest the best treatments. Based on the results of several studies, some of the symptoms that are shown in various cancers have been listed below .
• Losing or gaining weight abnormally (in a short time)
• Blood in stool
• Persistent cough
• Problem or Difficulty in swallowing
• Abnormal Hoarseness
• Persistent joint pain or unexplained muscles
• Persistent night sweats or unexpected fever
• Any changes that are happening in skin color.
• Lump or area of thickness that may be felt in each part of the body or under the skin
• Any changes that may be happen in bladder habits or bowel
Breast Cancer is one of the most common type of the cancers in women which is affecting approximately 12.5% of all women in all around of the world. Moreover, developing countries have growing breast cancer epidemic with an increasing number of younger women which are susceptible to the cancer.
There are two types of breast cancers which must be a doctor distinguish in the time of diagnosis and this identification is very important for starting treatment process and Prognosis the time period of recur patients. These two types are:
• Benign breast lump or non-cancerous: the size of tumour and its texture is understandable during the roughly examination.
• Malignant breast lump or cancerous: clinical diagnosis requires for predicting this type of cancer. Two types of Malignant cancers are exist:
○ Non-invasive: The malignant cells have not spread in other tissues.
○ Invasive: The malignant cells or cancer have spread into the surrounding tissue.
Since early detection of this cancer can be help for effective treatment, therefore, several efforts are done to achieve early detection of this disease. The main reason of this disease is not known to scientists till now, but some risk factors are recognized that increase the likelihood of breast cancer creation in female, which are broadly classified into non-modifiable and modifiable factors. Some of the non-modifiable factors are :
• Gender (Being a woman)
• A family background of having breast or ovarian cancer
• Starting menopause after age 55
• Having high bone or breast density
• A history of ovarian or breast cancer
Further, some of the modifiable factors are:
• Women with no children
• Number of abortions
• Age at first child birth (if age > 35 risk is high)
• Duration of breast feeding to child
• Having Frequently X-Rays
• Using Estrogen or Estrogen-Plus Progestin hormone
• Alcohol and alcoholic beverages
• Diet and food habits
There are six stages for breast cancer. Among these stages, stage 0 is the most primary stage of this disease and stage IV is the most dangerous and advanced stage . Currently for the aim of diagnosis of breast cancers and identifying the stage of cancer, some common methods are used such as: Mammography, Biopsy, Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) . Further, some data mining techniques are useful for cancers’ pattern recognition purpose. In the continue of this sub-section we attempt to cover some of the efforts which are done related to data mining applications in breast cancer diagnosis, treatment or prognosis respectively.
Breast Cancer Diagnosis is recognizing benign from malignant breast cancers and lumps and Breast Cancer Prognosis predicts the high risk people in aspect of breast cancer or predicting recurrence of cancers after treatment or removing their cancers.
The interesting point which we saw in almost all of these research papers was related to the databases which are used. Almost all of them used the following databases in their researches:
• Wisconsin breast cancer dataset (WBCD) has 11 attributes [74, 77, 79, 80, 92, 97 115, 121 and 125]
• Wisconsin Diagnosis Breast Cancer (WDBC) has 32 attributes 
• Wisconsin Prognosis Breast Cancer (WPBC) data set from the UCI machine learning repository has 34 attributes [82, 87, 88, 92, 93, 97 and 98]
• SEER dataset [81, 82, 107, 110, 111, 113, 114, 117, 122 and 124]
• The Digital Database for Screening Mammography (DDSM) [82, 107 and 109]
• Breast Cancer Surveillance Consortium (BCSC) database which contains 2.4 million screenings mammograms and associated self-administered questionnaires 
• PCE data and PCE colorectal dataset 
• National Cancer Institute’s Surveillance, Epidemiology, and End Results breast carcinoma data set 
This section, covers the results of several research works that have been done related to data mining applications for early diagnosis, treatment and prognosis of breast cancers.
Currently most of the physicians for identifying type of cancers (benign breast tumours from malignant) prefer to make surgical biopsy. But most of them believed that, biopsy is very critical task and must be prevented as much as possible. Therefore, proposing an intelligent system which can help tophysiciansto identify the type of cancer and avoid unnecessary surgical biopsywould be helpful for both patients and physicians.
In this sub-section we attempt to cover most of the research works that have been done related to diagnosis of breast cancers with applying various techniques of data mining along with their results. For our presentation we will classify all research works based on the interesting points and discuss them.
Based on the main goal of papers, we classified them in the following categories.
• Some of the research works, are compared the accuracy of applying various classification techniques for diagnosis of breast cancers, such as:
○ Vikas Chaurasia et al.  applied Simple Logistic, RBF and RepTree for diagnosis of breast cancer. The accuracy of their classification was 74.5%.
Wei-pin Chang et al.  made a comparative study for predicting breast cancers by decision tree, neural network, genetic algorithm and logistic regression. They concerned on 10 variable/attribute for creating breast cancer classification model. These variables were included: Clump thickness, Bland chromatin, Uniformity of cell size, Uniformity of cell shape, Bare nuclei, Normal nucleoli, Marginal adhesion, Mitoses, Single epithelial cell size and class variable with two value (benign/malignant). Their experimental results revealed that, decision tree has lowest prediction accuracy and logistic regression model had higher accuracy rate among these applied techniques for predicting breast cancers. Further, genetic algorithm had highest accuracy in the classification of breast cancers and created acceptable classification rules.
○ Based on the results of research which is done by Chaurasi and et al. , Simple logistic classifier among the other machine learning algorithms with having accuracy of 74.4% and total time taken for building model in 0.62 seconds, is the best algorithm for diagnosis of breast cancers. Further, in this study, researchers used three tests (including Gain Ratio test, Info Gain test and Chi-square test) for recognizing the variables which are important in diagnosis or treatment of breast cancers such as: Tumour size, patients’ Age, Degree of malignancy, Menopause, Breast-quad and etc.
○ Shweta Kharya  made a complete survey about applying different classification techniques for diagnosis of breast cancers. She studied different the performance or accuracy rate of various techniques (including Decision Tree, Bayesian Network, Logistic Regression, Support Vector Machines, Naïve Bayes Classifier, Association Rule Mining and ANN) for diagnosis of cancers by analysing factors (genes and etc.) or Digital Mammography images classification, Her study was based on the data which are collected from WBCD and SEER datasets. She claimed that, Decision tree with 93.62% accuracy rate of predicting cancers is the best predictor among the concerned techniques and the Bayesian network is the popular technique which is used in medical world for Brest cancer prognosis and diagnosis.
○ Senturk et al.  applied seven algorithms including KNN, Decision Tree, Naïve bayes, logistic regression, multi-layer perceptron, discriminant analysis and Support Vector Machine for diagnosis of breast cancers. Their experimental results declared that, accuracy of classification made by Support Vector Machine was high than others.
○ Ghassem Pour and colleagues  made a comparison between a Neural Network classification techniques with Model-based data mining techniques for accuracy of detecting breast cancers. Their experimental results showed that, adding an ensemble oriented approach can improve the results of both techniques. Furthermore, Neural Network approach with ensemble oriented approach had highest accuracy rate of classification in compare with model based data mining techniques.
○ Rajesh et al.  for classifying patients into either “Carcinoma in situ” (beginning or pre-cancer stage) or “Malignant potential” group, used C4.5 algorithm. They showed that, C4.5 had accuracy ~93% for diagnosis of breast cancers.
○ Hota  several intelligent techniques such as, ANN (Artificial Neural Network, Unsupervised ANN, Statistical and decision tree based techniques used for classifying data related to breast cancer. In this research work, different models are combined and made ensemble model. Experimental results in this study revealed that, the accuracy rate of ensemble model is better than single individual model.
○ Gupta and et al.  made a survey with study the several techniques which are used by many researchers for diagnosis and prognosis of breast cancers. Finally they mentioned that, in both cases, for selecting the best technique or algorithm with high degree of accuracy, can be decided after creating several types of models, trying different techniques or algorithms.
○ Burke HB et al.  compared the prediction accuracy of the TNM staging system1 with that of artificial neural network statistical models. They studied the accuracy of breast cancer prediction based on 5 years and 10 years surveillance data and revealed that, in both case, Artificial Neural Network’s prediction was accurate than TNM staging system.
○ Ronak Sumbaly et al.  used general (types, risk factors, symptoms and treatment) of breast cancers and applied various data mining techniques for diagnosis of breast cancers in the early step. Their results showed that, decision tree have capability to diagnosis breast cancers in the first stages.
○ Shrivastava et al.  made a review of different classification techniques which have been done for diagnosis of breast cancers. Finally they showed that, Neural Network and decision tree are the most popular techniques which are used by various researchers to create decision rules or predictive models from the breast cancer data.
○ Jahanvi Joshi et al.  applied various classification and clustering techniques to create pattern of breast cancer patients. For finding the healthy patients, several classifier rules are used. Further, authors claimed that, they used 47 classification algorithms for recognizing healthy people from sick patients. Their experimental results showed that, the results of approximately 13 techniques within those 47 applied techniques were same (24% sick patients and 76% healthy people). These 13 techniques are: Multilayer Perceptron, LMT classifier, Logistic, Classification via Regression, Multi-Class Classifier, GD, SMO, J48, Simple Logistic, AdaBoostM1, Bayes Net and Attribute Selected technique.
○ Padmavati et al.  for predicting breast cancers used RBF (Radial Basis Function), MLP (Multilayer Perceptron) and Logistic Regression techniques. Their experimental results showed that, RBF has prediction capability of RBF was better that two other techniques. Further, the time taken for prediction by RBF was lesser than other techniques.
○ Aboul  applied rough set data and ID3 decision tree classiﬁer algorithm for creating classification rules. Their experimental results showed that, the accuracy of classification rules created by rough set was better than ID3 algorithm. Further, the number of classification rules made by rough set algorithm is reduced in compare with ID3 algorithm. In the other words, rough set algorithm had compact number of produced rules.
○ Gouda I. Salama et al.  compared the accuracy and confusion matrix based on 10-fold cross validation method of different classification techniques including Multi-Layer Perception (MLP), decision tree (J48), Instance Based for K-Nearest neighbour (IBK) and Sequential Minimal Optimization (SMO) for diagnosis of cancers in three different databases of breast cancers (WPBC, WDBC, WBC). Their experimental results showed that, the combination of SMO, MLP, IBK and J48 hast the highest accuracy rate in compare with other techniques (in all of three datasets) for diagnosis of benign breast tumours from malignant.
○ One comparison between different neural network techniques (such as MLP, RBF, SOM (Self-Organizing Map) and PNN (Probabilistic Neural Network) is made by Sarvestan et al.  for diagnosis of breast cancers and detecting the type of breast tumours. Their experimental results, showed that, the accuracy of identifying breast cancers by PNN technique was high that other techniques. Finally they made a system with applying statistical neural network techniques and predefined accuracy rate for detecting breast cancers.
Challenges: There are several research works related to comparison of various data mining classification techniques, Association rule mining algorithms and etc. for diagnosis of breast cancers in the first stages. But the main challenge has remained and that is, for detecting breast cancers how many attributes are necessary? Which algorithm is applicable for all of databases? How can improve the accuracy of diagnosis and decrease the number of Biopsy and error in detecting malignant cancers? Is it possible we develop a tool which can be automatically without human Interference diagnosis the breast cancer with analysing automatically results of mammography and etc.?
• Several research works have attempted to propose a method or approach to recognize benign from malignant breast tumours.
○ Hassanien and colleagues  studied the applications of rough set theory to analysis the medical data and proposed an approach for creating compact classification rules with applying their proposed simplification algorithm. They claimed that proposed approach had classification accuracy of 98% whereas, accuracy of classification made by decision trees was 85.25. Further, the number of classification rules with applying decision trees and their proposed approach was respectively 428 and 30.
○ A simple data mining approach for finding people with high-risk breast cancer is proposed by Orlando . First they made different association rules by default and then made one questionnaire based on that rules and important defined factors which can be related with cancer disease occurrence, and asked from patients to fill that. In that questionnaire several questions were included such as: people habit for drinking alcohol. They made two option for this question including drink along food or no? After analysing the results of questionnaires, they made one decision tree and present important factors that can be help for recognizing high-risk groups of women. They claimed that, their experimental results have been shown that, decision tree has capabilities for finding significant association rules for predicting and diagnosis of breast cancers. However, this methodology needs to be evaluated in a larger set of examples in order to find associations with a higher degree of statistical confidence. Using a larger data set will also enable us to find correlations between a bigger set of genes and SNPs.
○ Einipour  combined two methodologies including ACO (Ant Colony Optimization) and Fuzzy System and made an automatically breast cancer diagnosis system named as FUZZY-ACO. The main advantage of the proposed system was high reliability and adequate interpretability in compared with other algorithms. Further the results of comparing the proposed approach with some algorithms such as C4.5, SVM, NN, Naïve Bayes and MLP revealed that, it had accuracy rate higher than other algorithms.
○ Raad and colleagues  made an approach for classification of breast cancers based on neural network techniques. Further, they developed a tool for automatic detection of breast cancers based on RBF neural network. They proved that accuracy, reliability and efficiency of RBF in compare with MLP technique was better.
○ Wen-Jia Kuo at el.  proposed a new computer aided diagnosis (CAD) system for classification of breast cancers by using decision tree technique. The main goal was reducing the number of unnecessary biopsies and increasing the diagnosis confidence. They used 24 co-variance texture features for creating decision tree with ability of identifying benign and malignant breast cancers. Accuracy, Positive Predictive value, Negative Predictive value, Sensitivity and Specificity are concerned as objective indices for estimating performance of proposed system in diagnosis of cancers. Authors claimed that, their system which had been made by decision tree had 96% accuracy rate, 93.33% Positive Predictive Value, 96.69% Negative Predictive Value, 93.33% sensitivity and 96.67% Specificity.
○ Jaimini Majali  compared the results of applying FP algorithm for diagnosis of breast cancers with the results of applying various classification techniques (such as: Neural networks, Bayesian classifier and Decision tree algorithm). Finally proposed a system for diagnosis and prognosis of breast cancers. Author used FP (frequent pattern mining) algorithm for recognizing the type of breast cancer (malignant or benign tumour) and Decision Tree algorithm to predict the possibility of breast cancer in context of age.
○ The other approach for early diagnosis of breast cancers with reducing the number of unnecessary biopsies is proposed by Jamarani et al.  by applying composition of different techniques including: ANN (Artificial Neural Network) and multi-wavelet based sub-band image decomposition. They claimed that, their proposed approach can be used as part of Computer Aided Diagnosis (CAD).
○ Sudhir D. Sawarkar et al.  have used neural networks and SVM (Support Vector Machine) method for diagnosis of breast cancers and proposed a new algorithm with implementing SVM by using kernel Adatron algorithm. This algorithm has capability for mapping inputs into a high-dimensional space. Further, it can be isolate inputs and separating data into their respective classes. Their experimental results revealed that, the proposed algorithm has high accuracy on diagnosis and detection of breast cancers. Based on their results, the accuracy of cancer diagnosis by surgeons, radiologists or physicians was nearly 85%, whereas, the accuracy of detections made by their proposed algorithm was 97%.
○ Xiao-Hui Wang et al.  attempted to combine physical examinations’ results, patients’ clinical background, histories and features of Mammography images and made a hybrid combination for diagnosis of breast cancers. Their experimental results showed that, applying single classification method (such as logistic regression) for both image and non-image information had higher performance and high accuracy rate in compare with applying hybrid combination which is tested by them.
○ Dina A. Sharaf-elDeen  used hybrid case-based approach for proposing a breast cancer diagnosis system. This system extract adaptation rules by integrating case-based and rule-based reasoning. In the proposed system, case based reasoning can be automatically generate the reasoning and the adaptation rules. In this system, both reasoning and adaptation rules are updated automatically with each new cases that are added for solving into system. Therefore, there is no need to create these rules from beginning. Experimental results with mammography images information showed that, the developed approach have reliable accuracy and can assist physicians to make diagnosis decisions with high accuracy rate.
○ Pendharkar PC et al.  studied breast cancers and proposed an approach with combining several Association rule mining and classification techniques. Their experimental results revealed that, this approach is capable to predict the occurrence of breast cancers or diagnosis cancers in the first stages.
○ Ta-Cheng Chen  proposed an approach based on genetic algorithms (GAs) for extracting breast cancers’ pattern, decision rules, threshold values and finally decision-making model with high degree of prediction accuracy. Their experimental results showed that, in the proposed approach, accuracy of prediction was improved. Further, modelling becomes simple. Moreover, this proposed approach had capability for extracting rules and acted on a computer model for prediction or classification purposes in breast cancer data.
Challenges: Regarding breast cancer diagnosis as we have seen in the above, severalalgorithms are proposed for improving the accuracy of breast cancer diagnosis, but the main challenges is: which one is having the highest accuracy rate for all databases? There are several diagnosis ways for doctors to help them for identifying cancers such as: Mammography, Physical test and etc. Now the main question is, is it possible to make a method which can combine the results of all tests or images and can detect the malignant in the first stage? Or propose an algorithm for predicting the possibility of occurring breast cancer for each individual person based on various tests or other personal information?
• Several researchers attempted to apply Regression Data Mining Techniques in breast cancer databases for diagnosis of breast cancers in the first stages.
○ Tayal et al.  compared seven regression techniques (Linear Regression, Pace Regression Model, Multiple Linear Regression, Non Linear Regression, Logistic Regression, Regression by Discretization and Isotonic regression) for diagnosis of cancers in first stages. Their experimental results showed that, among these seven regression techniques Logistic Regression with 10.59% of least Relative Absolute and 45.84% of Least Root Relative Squared Error was the best performance for diagnosis of breast cancers.
• Some tools are developed for breast cancer prediction or early diagnosis purpose. Such as:
○ Gauthier and colleagues  developed reliable assessment tool for breast cancer prediction and early diagnosis. They collect different profiles from public database and defined four parameters (including: breast density, age, prone to breast biopsy and number of affected first degree relatives) for calculating the risk score. In this research work, authors used k-nearest-neighbor algorithm to compute risk score. Their experimental results showed that, the accuracy of developed tool for identifying high risk people was 63%.
Challenges: Currently there is no tool which can diagnose breast cancer with high accuracy rate and minimum degree of errors or minimum number of required medical tests. Further, there are no tools which can analyze various examination tests (images, physical test’s results, MRI and etc.) without need to Biopsy and with high degree of accuracy diagnosis cancers.
• Breast-conserving surgery (BCS) also called as lumpectomy: if breast cancer is small and in the earlier stage of invasive.
• Mastectomy: When the size of cancer is too big.
• Chemotherapy (chemo): is useful for shrinking of cancer and reducing its size enough and preparing that for BCS. It is recommended usually for big size of cancers.
• Radiation: is done for all women who have Mastectomy or BCS. It is reducing the chance of coming back in the cancer.
• Adjuvant systemic therapy or Adjuvant systemic therapy is recommend after Mastectomy or BCS.
• Neo-adjuvant chemo: for treatment of patients before BCS or Mastectomy.
• Hormone therapy: all women with cancer size of larger than 0.5 cm (or t ¼ inch) will be recommend to continue Hormone therapy for 5 years.
• Targeted therapy: Sometimes for treatment of cancers before/after surgery some drugs are recommended by physicians.
We looked in several papers which are published related to data mining applications on cancers’ diagnosis, prognosis and the best way of treatment. But unfortunately there are not enough papers which discuss the application of data mining techniques for selecting the best option among the available options for treatment of each individual based on her/his special characteristics. The paper published by Yadav et al.  proposed a procedure for identifying patients whose need urgent Chemotherapy starting without wasting time. It used decision tree and SVM to classify patients into two classes as, Benign and Malignant. The accuracy rate of this two classification techniques in this study was 96% and 98% respectively for decision tree and SVM. Therefore, the classed made by SVM are concerned in the next step. In the second step, a clustering method such as k-means is applied for partitioning the two classes into 3 clusters including intermediate, poor and good for identifying the patients who require chemotherapy as soon as possible. Poor group is crucial group and chemotherapy should start immediately to enhance their survival.
Challenges: Generally, in medical world, all patients based on the stage of cancer lead to follow similar treatments. The main question is, can we apply behavior mining techniques to extract unique follow of treatments for each patient? Can we apply classification or clustering techniques for grouping patients based on their characteristic, stage of cancer and etc. and use these groups’ characteristics for identifying high risk people whose treatments follow must be different from other patients? Since we saw whenever similar treatments are started for a group of patients with having same stage of disease sometimes number of patients not become well and their health and size of cancer becomes larger, whereas for other patients size of cancer becomes smaller. Why? There is no clear answer in medical world for this question. Even sometimes a few of patients with stage 2 even stage 3 of breast cancer without any treatment live along time whereas, some other patients whose having same stage cancer and continually treated by physicians, unfortunately no longer live. The question: What is the reason for this event? There is no answer of this question up to now.
Prognosis problem is also called as “analysis of survival or lifetime data”. It is predicting the occurrence or recurrence of the breast cancer in each individual person. We divide prognosis in two parts :
• Prognosis of occurrence of breast cancer in a person without cancer background (in the other words, women without any breast cancer symptom but they are identified as a high-risk or at- risk people and the possibilities of occurrence of breast cancer in them is high). This type of prognosis is called as, breast cancer prognosis at treatment stage.
• Prognosis the possibility of recurrence of breast cancer after complete treatment. This type of prognosis is called as, breast cancer prognosis at recurrence stage.
Based on several research results, several attributes are affected on the survivability of breast cancer, some of these attributes are presented in Table 1.
There are different types of breast cancer recurrence:
• Local recurrence: it means, breast cancer after sometimes (may be 6 months or more) complete treatment, will be back in the same place which it had started before.
• Regional recurrence: it means, when the breast cancer happens for the second time, it will appear in the lymph nodes near the place that it happened first time.
• Distant recurrence: it means, after treatment, for a second time when breast cancer appears, it will start in some other part of the body not in breast itself, such as: liver, bone, brain or lungs.
Several experimental results of physicians are shown that, approximately most of the women 5 years after diagnosis of breast cancer are alive. However, some people live more than 5 years but generally 5-years survival period is used as a standard rate for discussion about prognosis.
There are several paper which are published for recognizing high risk people those who are prone to cancer. We categorised these papers in two parts:
• For predicting survivability rate of breast cancers, several data mining techniques are compared and performance or accuracy rate of that algorithms are compared.
○ Abdelghani Bellaachia et al.  applied three data mining techniques including the back-propagated neural network, Naïve Bayes and the C4.5 decision tree algorithms for predicting the survivability rate of breast cancer patients. Their experimental result showed that, C4.5 algorithm had better performance in comparisonto other techniques for predicting survivability rate of breast cancer patients.
○ Jaree Thongkam et al.  applied Modest AdaBoost, k-mean, Real, Gentle, C4.5, Bagging, random forest, C-SVC Algorithms for extracting knowledge from breast cancer survival database in Thailand. Performance of these algorithms is examined by comparing classification accuracy, confusion matrix, sensitivity, specificity and stratified 10-fold cross-validation method. Their experimental results revealed that, the accuracy of prediction by Real, Adaboost and Gentle was better than others.
○ Alshammari et al.  applied four classification approaches (including: C4.5, Naïve Bayes and two types of ANN (Artificial Neural Network): Multilayer Perceptron and RBF (Radial Basic Function)) for predicting breast cancer survivability. Their experimental results showed that, among these approaches, the accuracy (0.893), specificity (0.985) and sensitivity (0.891) of predicting breast cancer survivability by decision tree (C4.5) was higher than others.
○ Kung-Min Wang et al.  applied to classification techniques (decision tree and logistic regression model) for predicting 5 year survivability of breast cancer patients. They selected some variables such as: cancer stage, extension of cancer, grade, race, site-specific surgery code for applying their classification. Their experimental results showed that, logistic regression model had better accuracy rate in compare with decision tree model.
○ G. Ravi Kumar  compared the accuracy of breast cancer prognosis and prediction by applying six classification techniques including: KNN, Naïve Bayes, Decision Tree, Logistic Regression, MLP and SVM. Their experimental results proved that, SVM is the more suitable for breast cancer prediction since it has the highest accuracy rate among other techniques.
○ YJ Lee et al.  used data mining technique (nonlinear smooth support vector machines (SSVMs)) for analysing the effects of chemotherapy on survival time of breast cancer patients. They analysed the effects of several features (feature space, cytological features, pathology features) in tumour size. They classified all breast cancer patients into three classes: Good, Intermediate and Poor. Then showed that, only for patients in good group chemotherapy is not required.
○ D. Delen et al.  applied three classification techniques including: decision trees, artificial neural networks and logistic regression for prediction breast cancer survivability. Their experimental results with using 10 fold cross-validation for each model showed that, decision tree had higher accuracy rate (0.9362) among other techniques.
○ J. Gadewadikar et al.  applied Bayesian Belief Network for predicting breast cancer automatically. They developed an interface for radiologists to detect cancers with analysing mammography.
○ A. A. M. Medhat et al.  for extracting the mammographic mass features and using these features for predicting survival time of breast cancer patients, applied Tree Boost, SVM and Tree Forest techniques. Their experimental results showed that, SVM technique had highest prediction accuracy in comparison with the other techniques.
○ Gandhi Rajiv K et al.  created classification rules by applying Particle Swarm Optimization Algorithm. For selecting feature subset, they used fuzzy and Genetic algorithms. They created smaller fuzzy rule bases system with higher accuracy. These rules are used for classification by applying Particle Swarm Optimization Algorithm and showed the high rate of accuracy.
○ Sudhir D et al.  for predicting the type of cancer and avoiding un-necessary biopsy, used SVM and ANN techniques. Their experimental results showed that, both of these techniques had 97% accuracy rate and can be used as assistant for physicians to avoid un-necessary biopsy.
• Several methods are proposed by researchers for prognosis of breast cancers.
○ Saleema et al.  proposed a model for identifying prominent response variables (such as: patient age at diagnosis, stage of cancer and patient survival) by applying the standard classifiers. This model had three phases namely: basic level pre-processing, problem specific processing (deals with sampling, feature selection and response variable selection) and classifiers for modelling (such as Decision Tree, KNN and Naïve Bayes) for prediction analysis. Their experimental results showed that, decision tree algorithm had better performance in compare with other classifiers. Further, balanced stratified sampling technique maintained consistency in the performance.
○ Dengju Yao et al.  proposed a combination method of multivariate adaptive regression splines (MARS) and random forest algorithms for cancer prediction. They used random forest algorithm for screening variables and give rank for those variables. Then used MARS procedure for creating a model for predicting cancer survivability. The performance of this method by calculating accuracy rate, specificity, sensitivity and 10-fold cross-validation is evaluated and the results showed that, this method had higher accuracy in compare with the models.
○ Saleema et al.  proposed a sampling method by examining the effect of sampling techniques (such as stratified sampling, random sampling and balanced stratified sampling) in classifying the prognosis variables. Three classification techniques including K-Nearest Neighbour, Naïve Bayes and Decision Tree are used as a classifiers. Three prognosis factors including stage of cancer, metastasis and survival are used as class labels. The results of experimental results showed that, by applying their proposed method, accuracy of prediction by balanced stratified model is increased continually by increasing the sample size, whereas, it is not true in traditional approaches.
○ Reeti Yadav et al.  proposed a procedure for identifying patients whose are require urgent chemotherapy without wasting time. In this procedure, in the first step, they used support vector machines (SVMs) and Decision Tree for classifying all patients into two classes including Malignant and Benign. In the second step, clusters (including: Poor, Intermediate and Good) are made by the help of k-means algorithms for identifying patients whose are need urgently for chemotherapy. In this clusters, whose are belong to Poor cluster, are need for urgent chemotherapy. Based on the experimental results of these researchers, SVM had highest accuracy rate (98%) for classifying the patients.
○ Chul-Heui Lee et al.  proposed a classification method based on hierarchical granulation structure (for finding classification rules) and applied the rough set theory. Their experimental results showed that, their proposed method, created minimal classification rules and made analysing of information system easier.
Challenges: The recurrence of cancer is not similar in all of the patients. In one group of patients, never breast cancer recur after treatment, in the other group, it comes back approximately 6 months after treatment and in the third group of patients, within 5 years that will be recur. Therefore, developing a tool or proposing a method for predicting the recurrence of cancer and exact time of recurrence is required. Currently, there is no popular tool with acceptable accuracy rate for this matter.
This paper reviewed several research works which are done for diagnosis, treatment or prognosis breast cancers. Based on the results of this study, most of the research works are concerned on comparing the accuracy rate of data mining various algorithms or techniques. Unfortunately, there is no tool that automatically diagnose or prognoses breast cancer. Further, there is no research work which apply personalized features for proposing the best treatment for patients.
In the future work, we will attempt to develop a tool with the help of intelligent agents and applying data mining tools with the capability of automatically breast cancer diagnosis and proposing the best treatment.