「Compressed Vector Set: A Fast and Space-Efficient Data Mining Framework」

Compressed Vector Set: A Fast and Space-Efficient Data Mining Framework

[Journal of Information Processing Vol.26, pp.416-426]
[情報処理学会論文誌 データベース Vol.11 No.1, Preprint掲載]


 In this paper, we present CVS (Compressed Vector Set), a fast and space-efficient data mining framework that efficiently handles both sparse and dense datasets. CVS holds a set of vectors in a compressed format and conducts primitive vector operations, such as lp-norm and dot product, without decompression. By combining these primitive operations, CVS accelerates prominent data mining or machine learning algorithms including k-nearest neighbor algorithm, stochastic gradient descent algorithm on logistic regression, and kernel methods. In contrast to the commonly used sparse matrix/vector representation, which is not effective for dense datasets, CVS efficiently handles sparse datasets and dense datasets in a unified manner. Our experimental results demonstrate that CVS can process both dense datasets and sparse datasets faster than conventional sparse vector representation with smaller memory usage.

[Reasons for the award]

 This paper proposes a data mining framework that enables basic vector operations such as the norm and dot product operations while keeping data compressed. The proposed method can be applied to both sparse and dense datasets, is expected to be used in various applications, and is considered important since a large-scale of vector data have been produced today.

Masafumi Oyamada

He received the M.E. and Ph.D. degrees from the University of Tsukuba, Japan. He is currently a Principal Researcher of Data Science Research Laboratories, NEC Corporation. His research interests include data integration, information extraction, and query optimization. He is a member of IPSJ, AAAI, and the Database Society of Japan (DBSJ).

Jianquan Liu

He received M.E. and Ph.D. degrees from the University of Tsukuba, Japan, in 2009, and 2012, respectively. He joined NEC Corporation in 2012, and currently is a Principal Researcher at Biometrics Research Laboratories. He is also an Adjunct Assistant Professor at Hosei University, Japan. His research interests include multimedia databases, data mining and information retrieval. Currently, he is/was serving as an Associate Editor of IEEE MultiMedia and Journal of Information Processing, the General Co-chair of IEEE MIPR’21, and the PC Co-chair for a series of IEEE conferences including ICME’20, BigMM’19, ISM’18, ICSC’17 etc. He is a member of IEEE, ACM,IPSJ, APSIPA and DBSJ.

Shinji Ito

He is a researcher at Data Science Research Laboratories, NEC Corporation. He received his BSc, MA and PhD from the University of Tokyo in 2013, 2015 and 2020, respectively. His research interests include mathematical optimization, machine learning, and numerical analysis.

Kazuyo Narita

She is a senior research engineer at dotData, Inc. She is working on the development of data analytics products from designing to system development. Her research interests include feature engineering, machine learning, and distributed systems. She received an MEng degree in Science and Engineering from the University of Tsukuba in 2006.

Takuya Araki

He received the B.E., M.E., and Ph.D. degrees from the University of Tokyo, Japan in 1994, 1996, and 1999, respectively. He was a visiting researcher at Argonne National Laboratory from 2003 to 2004. He is currently a Senior Principal Researcher of Data Science Research Laboratories, NEC Corporation. His research interests include programming language, parallel and distributed computing, big data analytics, database, and application of HPC to AI/machine learning. He was a director of the Information Processing Society of Japan from 2017 to 2018.

Hiroyuki Kitagawa

He received the B.Sc. degree in physics and the M.Sc. and Dr. Sc. degrees in computer science, all from the University of Tokyo, in 1978, 1980, and 1987, respectively. He is currently a full professor at International Institute for Integrative Sleep Medicine, University of Tsukuba. After working as a researcher at NEC Corporation, he joined University of Tsukuba in 1988, and has been a full professor since 1998. His areas of research include databases, big data, data mining, data integration, information retrieval,and data analysis in medical and scientific domains. He is an IPSJ Fellow and an IEICE Fellow.