Anale. Seria Informatică. Vol. XVII fasc. 1 – 2019 Annals. Computer Science Series. 17 th Tome 1 st Fasc. – 2019 229 DDEETTEECCTTIIOONN OOFF KK--UUPPPPEERR OOUUTTLLIIEERRSS IINN EEXXPPOONNEENNTTIIAALL SSAAMMPPLLEESS UUSSIINNGG MMUULLTTIIPPLLEE UUPPPPEERR OOUUTTLLIIEERR TTEESSTTSS RReemmii JJuulliiuuss DDaarree 11 ,, OOlluummiiddee SSuunnddaayy AAddeessiinnaa 22 ,, OOlluuddoollaappoo KKeehhiinnddee FFaammuurreewwaa 11 ,,OOwwoosseennii TTiimmootthhyy 11 1 Department of Mathematical Sciences, Kings University, Odeomu, Nigeria 2 Department of Mathematical Sciences, Olabisi Onabanjo University, Nigeria Corresponding Author: Remi Julius Dare, jr.dare@kingsuniversity.edu.ng ABSTRACT: Outlying values have been an issue of concern to researchers and data analysts. The study of multiple outliers is imperative because outlying values may lead to model misspecification, wrong estimation of parameters and incorrect statistical results. This paper proposed Tietjien-Moore test statistics for upper outliers in exponential samples. A simulated study was carried to investigate the strength of the test statistics. KEYWORDS: Outlier, Exponential distribution, Tietjen- Moore test, Gap test, Simulation 1. INTRODUCTION Talking about study on outliers [W+02] and [LK12] made distinct contributions stating that, in data analysis, variables are being recorded or sampled without prior investigation into possible outlying values. One of the ways of obtaining a coherent statistical analysis is to carry out statistical investigations on the detection of outlying observations. Although outliers are often considered as an error or noise, they may carry important information. When outliers are not detected, it poses a challenge which results in aberrant data that ultimately lead to model misspecification, biased parameter estimation and wrong results. It is therefore important to identify them prior to modeling to avoid statistical errors. [BL94] provided a detailed introduction to different ways of describing outliers. An exact definition of an outlier often depends on hidden assumptions regarding the data structure and the applied detection method. Yet, some definitions are regarded general enough to cope with various types of data and methods. [H+92] defines an outlier as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. [BL94] point out that an outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs, similarly, [DG93] defines an outlier as an observation in a dataset which appears to be inconsistent with the remainder of that set of data. [BL94] submitted that Outlier detection methods have been suggested for numerous applications, such as credit card fraud detection, clinical trials, voting irregularity analysis, data cleansing, network intrusion, severe weather prediction, geographic information systems, athlete performance analysis, and other data-mining tasks. This submission was further established by [H+92], [BL94], [RW95], [FP97], [AR04] and [LK12] proposed Gap family test for testing k-upper outlier, and little works have been done in recent times on k-upper outlier detection. This study proposed Tietjien-Moore statistics for checking k-upper outlier in exponential samples and access the performance of the proposed test with that of the one proposed by [LK12]. Next section is Material and Methods, followed by Simulation Study, then section 4 is the discussion, and paper was concluded in section 5. 2. MATERIALS AND METHODS 2.1 Dixson test Statistic for Upper-Outlier Detection Dickson type test statistic for k-upper outlier is stated below: (1) A large value of the test statistic signifies the presence of k-upper outliers in the sample. 2.2 Zerbet and Nikulin Test for Outlier Zerbet and Nikulin also proposed a test statistic for identifying outliers as follows: (2) A small value of kT establishes the presence of outliers in the sample. mailto:jr.dare@kingsuniversity.edu.ng Anale. Seria Informatică. Vol. XVII fasc. 1 – 2019 Annals. Computer Science Series. 17 th Tome 1 st Fasc. – 2019 230 2.3 Maximum likelihood ratio test One of the popular test statistics is also used for testing upper outliers is the maximum likelihood ratio test given by (3) If kL is greater than specified value, the test indicates the presence of outliers. 2.4 Gap-test statistic The test statistic for k-upper outliers may be defined as: (4) A very high value of kZ indicates the presence of k upper outliers in the sample. Therefore, the null hypothesis is rejected for , where is the critical value at α level of significance. The exact null distribution kZ for is rather complex. However, the critical values of test for are found to be very close to: (5) For k-upper outliers, the test statistic may be defined as: (6) We use 5% level of significance for while 1% level of significant for respectively. 2.5 Tietjen-Moore test statistics The following steps are considered for Tietjen-Moore test statistic: (i) Data sorting from smallest to the largest (7) (7) is test statistics for the largest point. Where test statistics for k-smallest point is (8) To test outlier in both tails, we compute the absolute residuals and then Ek denote the sorted absolute residuals. The test statistics for the case is: (9) 2.6 Critical values It is interesting to see that the simulated critical values for Tietjen-Moore test statistic in this paper are obtained using r-software at two levels of significance. Simulated Critical values of Gap-test statistic are obtained from the formula (6) given above without tables. 2.7 Test hypothesis A discordance test needs to be performed in order to identify outliers. Let be a random sample from an exponential distribution and its corresponding order statistics To perform the discordance test for k-upper outliers we assume a null hypothesis (i) 0H : all the observations are coming from the exponential distribution against an alternative hypothesis (ii) 1H ; that observations are from this population but k values are from , population. If we reject the null hypothesis and say that there is k-upper outlier in the given exponential sample (where Ek represents the simulated value from the given test statistic and E*k(ᾳ) is the critical value of the test statistic). 3. SIMULATION STUDY The random sample from exponential distribution was simulated, and upper outliers were tested. The exponential distribution is the distribution with probability density function: , (10) A discordance test has to be performed in order to investigate the presence of outliers. Let be a random sample in an exponential distribution and its corresponding order statistics One of the steps in obtaining a coherent and structured analysis is the investigation on detection of outlaying observations. Outliers mostly refer to errors or noise, they are very crucial and give vital information. [W+02] establish the fact that, detected outliers are candidates for aberrant data that may otherwise adversely lead to model Anale. Seria Informatică. Vol. XVII fasc. 1 – 2019 Annals. Computer Science Series. 17 th Tome 1 st Fasc. – 2019 231 misspecification, biased parameter estimation and incorrect results. It is therefore important to identify them prior to modeling and analysis. Table 1: Descriptive Statistics of exponential simulated sample, at different value of n n MEAN STDEV SKEWNESS KURTOSIS 15 1.02949 0.89031 0.81961 -0.81798 16 0.69586 0.52797 0.57929 -1.01206 17 1.10798 1.20354 1.21317 0.51089 18 1.05834 0.80689 0.60443 -0.75001 19 1.33233 1.37831 1.37831 1.378305 20 1.10598 0.93664 1.80601 0.78441 50 0.96628 0.96562 1.80601 3.140291 100 0.77457 0.77480 1.765782 3.19201 200 1.08986 1.134765 2.074985 5.60357 Figure 1: Normal Quantile Quantile plot for 15n  Figure 2: Normal Quantile Quantile plot for 20n  Figure 3: Normal Quantile Quantile plot for 200n  Figure 4: Normal Quantile Quantile plot for 100n  Investigation into outlier detection methods have been suggested for a series applications, such as credit card fraud detection, clinical trials, voting irregularity analysis, data cleansing, network intrusion, severe weather prediction, geographic information systems, athlete performance analysis, and other data-mining tasks. 3.1 Detection of k upper outliers using Tietjen- Moore and Gap test statistics Table 2 shows the approximate critical Values for 5% and 1% test for upper outliers in exponential Sample Using Gap-test and Tietjen- Moore as test Statistics. Table 2:Tietjen-Moore and Gap-test critical values n 15 0.6885 0.5587 0.2290 0.3146 0.6792 0.5593 0.2496 0.3339 0.6872 0.5532 0.2640 0.3473 16 0.6508 0.5322 0.2155 0.2971 0.6454 0.5265 0.2351 0.3156 0.6483 0.5404 0.2488 0.3285 17 0.6161 0.5147 0.2035 0.2815 0.6187 0.5088 0.2222 0.2992 0.6139 0.5103 0.2353 0.3116 18 0.5875 0.4910 0.1928 0.2673 0.5848 0.4859 0.2106 0.2844 0.5856 0.4855 0.2231 0.2963 19 0.5663 0.4706 0.1831 0.2546 0.5588 0.4709 0.2002 0.2710 0.5608 0.4633 0.2121 0.2824 20 0.5392 0.4490 0.1743 0.2430 0.5375 0.4443 0.1907 0.2587 0.5385 0.4574 0.2022 0.2698 50 0.2449 0.2160 0.0726 0.1025 0.2439 0.2158 0.0788 0.1096 0.2427 0.2148 0.0839 0.1148 100 0.1319 0.1191 0.0364 0.0521 0.1322 0.1206 0.0398 0.0558 0.1313 0.1208 0.0424 0.0586 200 0.0696 0.0655 0.0305 0.0435 0.0697 0.0655 0.0332 0.0467 0.0796 0.0766 0.0354 0.0490 and , for k=2, 3, 4 are Tietjen-Moore and Gap test simulated and approximate critical values respectively. In each two-line entry, the first line is the critical values for α=0.05 and the second line is the critical values for α=0.01.The table represents 10,000 replications generated from the standard exponential distribution for different sample sizes n. Anale. Seria Informatică. Vol. XVII fasc. 1 – 2019 Annals. Computer Science Series. 17 th Tome 1 st Fasc. – 2019 232 Table 3 :Tietjen-Moore Statistic for k-upper outlier for exponential samples n E2, E3, E4, 15 2.8204** 2.0842** 2.4002* 16 0.6154* 0.7964** 15.4177** 17 4.57123** 0.47932 2.0188** 18 1.25438** 0.9856** 1.16307** 19 0.7326** 1.1628** 1.2280** 20 2.3116** 0.8224** 1.1616** 50 0.5378** 0.5015** 0.6694** 100 0.1964** 0.1997** 0.2324** 200 0.1196** 0.0703** 0.09521** ** indicate that the k-upper outlier is found in both 5% and 1% while * indicate that k-upper outlier is found at 5% only. We compare Table 3 with 4.2 if we reject the null hypothesis and say that there is k-upper outlier in a given exponential sample. Table 4: Estimated Gap test Statistics for k-upper outliers for exponential samples n Z1, Z2, Z3, 15 0.1509 1.8121** 0.1477 16 0.8497** 0.0698 0.1276 17 0.0733 0.0771 0.1118 18 0.0232 0.1081 0.1131 19 0.0169 0.6958** 1.6059** 20 0.5708** 0.8025** 0.9708** 50 0.0884** 0.0643** 0.1312** 100 0.0943** 0.05642** 0.0598** 200 0.0643** 0.0465** 0.0534** ** indicate that the k-upper outlier is found in both 5% and 1% while * indicate that k-upper outlier is found at 5% only. We compare table 4 with 6 if we reject the null hypothesis and say that there is k-upper outlier in a given exponential sample. Table 5: Table for Test Statistics of Tietjen-moore for k-Upper Outlier Criterion n 15 2.8204** 0.6885 2.0842** 0.6792 0.5593 2.4002* 0.6872 0.5532 0.5587 16 0.6154* 0.6508 0.7964** 0.6454 0.5265 15.4177** 0.6483 0.5404 0.5322 17 4.57123** 0.6161 0.47932 0.6187 0.5088 2.0188** 0.6139 0.5103 0.5147 18 1.25438** 0.5875 0.9856** 0.5848 0.4859 1.16307** 0.5856 0.4855 0.491 19 0.7326** 0.5663 1.1627** 0.5588 0.4709 1.2280** 0.5608 0.4633 0.4706 20 2.3116** 0.5392 0.8224 0.5375 0.4443 1.1616** 0.5385 0.4574 0.449 50 0.5378** 0.2449 0.5015** 0.2439 0.2158 0.6694** 0.2427 0.2148 0.216 100 0.1964** 0.1319 0.1997** 0.1322 0.1206 0.2324** 0.1313 0.1208 0.1191 200 0.1196** 0.0696 0.0703** 0.0697 0.0655 0.0952** 0.0796 0.0766 0.0655 * First line is 5% and second line is 1%. From Table 5, it is observed that k-upper outliers were seen in all the sample for and respectively except for =17, and =3 for 5% and 1%, and 5% for =16 and =2 In fairness the test was able to identify outliers as expected; same is not for the test proposed by Gap Family as shown table Table 6: Table for Test Statistics of Gap test for k- Upper Outliers Z2 Z3 Z4 15 0.1509 0.2290 0.3146 1.8121** 0.2496 0.3339 0.1477 0.2640 0.3473 16 0.8497** 0.2155 0.2971 0.0698 0.2351 0.3156 0.1276 0.2488 0.3285 17 0.0733 0.2035 0.2815 0.0771 0.2222 0.2992 0.1118 0.2353 0.3116 18 0.0232 0.1928 0.2673 0.1081 0.2106 0.2844 0.1131 0.2231 0.2963 19 0.0169 0.1831 0.2546 0.6958** 0.2002 0.2710 1.6059** 0.2121 0.2824 20 0.5708** 0.1743 0.2430 0.8025** 0.1907 0.2587 0.9708** 0.2022 0.2698 50 0.0884** 0.0726 0.1025 0.0643** 0.0788 0.1096 0.1312** 0.0839 0.1148 100 0.0943** 0.0364 0.0521 0.05642** 0.0398 0.0558 0.0598** 0.0424 0.0586 200 0.0643** 0.0305 0.0435 0.0465** 0.0332 0.0467 0.0534** 0.0354 0.0490 * First line is 5% and second line is 1%. From Table 6 k-upper outlier were only observed in 14 cells out of 36 cells, it may be an indication of the ineffectiveness of the test statistics. Figure 5: Comparative plots between Maximum likelihood ratio, Tietjen-Moore, Gap family test, and Dixon-type at 5% level of significance 0 0,5 1 1,5 2 2,5 1 2 3 4 5 6 7 8 9 10 11 12 EK ZK DK LK Anale. Seria Informatică. Vol. XVII fasc. 1 – 2019 Annals. Computer Science Series. 17 th Tome 1 st Fasc. – 2019 233 Figure 6: Plots of the difference between Tietjen- Moore and three other and Dixon-type at 5% level of significance Comparison of the three test based on power factor The difference of and appear to be the same, hence the line of covers that of . 4. DISCUSSION Simulation was done replicating 10,000 exponential samples and from which samples of 15-20, 50, 100 and 200 were drawn respectively. Various test statistics were carried out for discordancy tests on the samples. Table 1 shows the descriptive statistics for the exponential sample; the mean, standard deviation, skewness and the kurtosis respectively. From the descriptive, it is observed that the data is not normally distributed hence, contains outlying values which presence was tested for in the data analysis. Figure 5 shows the line plots comparing maximum likelihood ratio test statistics with Dixon-type and Tietjen-Moore test statistics for values of and where represents suspected number of upper outlier in the sample. Table 2 displays critical values for the calculated with a given Gap family formula and that of Tiejen-Moore critical value for 5% and 1% for =15 to 20, 50, 100 and 200 respectively, these critical values are reliable and there will not be a need for table. In all the test statistics null hypothesis that the samples at each value of and would be rejected, comparing the calculated to critical values.The table for ‘gap family test’ and Tietjen-Moore Statistics was presented in chapter four and it was found out that ‘Gap family test’ detected 16 outliers in 9 by 3 matrix while Tietjen-Moore detects the presence of 25 for both 5% and 1% and 1 for only 5% making twenty (26) altogether. However, Tietjen-Moore is found to be more effective in detecting outlier than the test suggested by Gap family. The Power test also confirms that Tietjen-Moore test is better to detect k- Upper Outlier more so that critical value and Test Statistics can be obtained with statistical Software. Table 2 displays critical values for the given Gap and Tietjen-Moore tests critical values at 5% and 1% for =15 to 20, 50, 100 and 200 respectively. This was generated using Software by R Core team (2018). In the entire test statistics null hypothesis would be rejected at each value of and comparing the simulated values to critical values. The table for‘gap family test’ and Tietjen-Moore Statistics were presented, it was found out that Gap-test detected 16 occurrences in 9 by 3 matrix while Tietjen-Moore detected 26 occurrences for both 5% and 1%. However, Tietjen-Moore is found to be more effective in detecting outliers than the test suggested by Gap family. 5. CONCLUSION This study explored the use of gap family test and Tietjen-Moore Statistics to detect the presence of k- upper outliers in exponential samples. On comparing the two tests, Tietjen-Moore Statistics proved to be better test for k-upper outlier relative to the Gap family test proposed by [LK12]. The results obtained in this study agree with that of [AAD16]. Therefore, Tietjen-Moore test statistics is recommended to test k-upper outliers. REFERENCES [AR04] Acuna E., Rodriguez C. A. – Meta analysis study of outlier detection methods in classification, Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, Retrieved from academic.uprm.edu/ eacuna/paperout.pdf. In proceedings IPSI 2004, Venice, 2004. [AAD16] Adesina O. S., Ayoola F. J., Dare R. J. – Testing For Multiple Upper Outliers in Distribution Samples: A Study of Foreign Exchange Data. Assumption University-eJournal of Interdisciplinary Research (AU-eJIR), 1 (2), pp. 80-92, 2016. [BL94] Barnett V., Lewis T. – Outliers in Statistical Data. John Wiley, 1994. [BS03] Bay S. D., Schwabacher M. – Mining distance-based outliers in near linear time with randomization and a simple pruning rule, In Proc. of the ninth 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 1 2 3 4 5 6 7 8 9 10 11 12 EK-ZK EK-DK EK-LK Anale. Seria Informatică. Vol. XVII fasc. 1 – 2019 Annals. Computer Science Series. 17 th Tome 1 st Fasc. – 2019 234 ACMSIGKDD Conference on Knowledge Discovery and Data Mining,Washington, DC, USA, 2003. [B+00] Breunig M. M., Kriegel H. P., Ng R. T., Sander J. – Lof: Identifying density based local outliers, In Proc. ACMSIGMOD Conf. 2000, 93–104, 2000. [Dav79] David H. A. – Robust estimation in the presence of outliers, In Robustness in Statistics, eds. 2000 R. L. Launer and G. N. Wilkinson, Academic Press, New York, 61-74, 1979. [DG93] Davies L., Gather U. – The identification of multiple outliers, Journal of the American Statistical Association, 88(423), 782-792, 1993. [Fer61] Ferguson T. S. – On the Rejection of outliers, In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 253- 287, 1961. [FP97] Fawcett T., Provost F. – Adaptive fraud detection, Data-mining and Knowledge Discovery, 1(3), 291–316, 1997. [Gru69] Grubbs F. E. - Procedures for detecting outlying observations in Samples, Technometrics, 11,121, 1969. [Had92] Hadi A. S. – Identifying multiple outliers in multivariate data, Journal of the Royal Statistikical Society. Series B, 54, 761-771, 1992. [H+92] Hawkins S., He H. X., Williams G. J., Baxter R. A. – Outlier detection using replicator neural networks, In Proceedings of the Fifth International Conference and Data Warehousing and Knowledge Discovery (DaWaK02), Aixen Provence, France, 2002. [JW92] Johnson R. A., Wichern D. W. – Applied Multivariate Statistical Analysis. Prentice Hall, 1992. [LK12] Lalitha S., Kumar N. – Multiple outlier test for upper outliers in an exponential sample. Journal of applied statistics, 39:6 1323-1330, 2012. [R18] R Core Team – R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. https://www.R- project.org [RW95] Runger G., Willemain T. – Model- based and Model-free Control of Autocorrelated Processes, Journal of Quality Technology, 27 (4), 283-292, 1995. [W+02] Williams G. J., Baxter R. A., He H. X., Hawkins S., Gu L. – A Comparative Study of RNN for Outlier Detection in Data Mining, IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002. https://www.r-project.org/ https://www.r-project.org/