Binary Data Mining

Professor Vaclav Snasel

Dean of FIT
Department of Computer Science,
Faculty of Electrical Engineering and Computer Science,
VSB-Technical University of Ostrava, Czech Republic

Biography: http://www.cs.vsb.cz/snasel
Email: vaclav.snasel@vsb.cz


Binary data have been occupying a special place in the domain of data analysis. Analysis of binary data sets, however, generally leads to NP-complete/hard problems. Consequently, the focus here is on effective heuristics for reducing the problem size. Matrix factorization or factor analysis is an important task helpful in the analysis of high dimensional real world data. There are several well known methods and algorithms for factorization of real data but many application areas including information retrieval, pattern recognition and data mining require processing of binary rather than real data see [4],[5],[7],[8],[11],[14]. Unfortunately, the methods used for real matrix factorization fail in the latter case. In this paper we introduce background for binary matrix factorization.In order to perform object recognition (no matter which one) it is necessary to learn representations of the underlying characteristic components. Such components correspond to object-parts, or features [10]. These data sets may comprise discrete attributes, such as those from market basket analysis, information retrieval, and bioinformatics, as well as continuous attributes such as those in scientific simulations, astrophysical measurements, and sensor networks. The feature extraction if applied on binary datasets, addresses many research and application fields, such as association rule mining [1], market basket analysis [2], discovery of regulation patterns in DNA microarray experiments [12], etc. Many of these problem areas have been described in tests of PROXIMUS framework (e.g. [7]). So called bars problem [13] is used as the benchmark. Set of artificial signals generated as a Boolean sum of given number of bars is analyzed by these methods. Here we will concentrate on the case of black and white pictures of bars combinations represented as binary vectors, so the complex feature extraction methods are unnecessary [6]. Many applications in computer and system science involve analysis of large scale and often high dimensional data. When dealing with such extensive information collections, it is usually very computationally expensive to perform some operations on the raw form of the data. Therefore, suitable methods approximating the data in lower dimensions or with lower rank are needed. In the following, we focus on the factorization of hight-dimensional binary data or high order binary tensors [3].

[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases. In: VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (1994) Pages 487-499

[2] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and implication rules for market basket data. In: SIGMOD ’97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, New York, NY, USA, ACM Press (1997) Pages 255-264

[3] L. Elden. Matrix Methods in Data Mining and Pattern Recognition. SIAM 2007.

[4] A.A. Frolov, D. Husek, P. Muravjev, P. Polyakov, Boolean Factor Analysis by Attractor Neural Network. Neural Networks, IEEE Transactions 18(3) (2007) Pages 698-670

[5] H. Lu, J. Vaidya and V. Atluri, Optimal Boolean Matrix Decomposition: Application to Role Engineering, ICDE 2008, in print.

[6] D. Husek, P. Moravec, V. Snasel, A.A. Frolov, H. Rezankova, P. Polyakov: Comparison of Neural Network Boolean Factor Analysis Method with Some Other Dimension Reduction Methods on Bars Problem. Springer, LNCS 4815, PReMI 2007: 235-243

[7] M. Koyuturk, A. Grama, N. Ramakrishnan, Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Trans. Math. Softw. 32(1) (2006) Pages 33-69

[8] P. Miettinen, T. Mielikainen, A. Gionis, G. Das, H. Mannila: The Discrete Basis Problem. PKDD 2006: 335-346

[9] P. Moravec and V. Snasel. Dimension Reduction Methods for Image Retrieval. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA2006), 6 pages, Jinan, Shandong, China, October 2006. IEEE Press.

[10] V. Snasel, P. Moravec, and J. Pokorny. Using BFA with WordNet Based Model for Web Retrieval. Journal of Digital Information Management, 4(2):107-111, 2006.

[11] V. Snasel, D. Husek, Alexander A. Frolov, H. Rezankova, P. Moravec, P. Polyakov: Bars Problem Solving - New Neural Network Method and Comparison. Lecture Notes in Computer Science 4827, MICAI 2007: 671-682

[12] Spellman, P.T., Sherlock, G., Zhang, M.Q., Anders, V.I.K., Eisen, M.B., Brown, P., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. In: Molecular Biology of the Cell. (1998) Pages 3273-3297

[13] M. W. Spratling: Learning Image Components for Object Recognition. Journal of Machine Learning Research 7 (2006) 793-815.

[14] Z. Zhang, T. Li, Ch. Ding, X. Zhang, Binary Matrix Factorization with Applications, ICDM 2007