Graph sketching-based Space-efficient Data Clustering
hal.structure.identifier | ||
dc.contributor.author | Morvan, Anne | |
hal.structure.identifier | ||
dc.contributor.author | Choromanski, Krzysztof | |
hal.structure.identifier | ||
dc.contributor.author | Gouy-Pailler, Cedric
HAL ID: 6827 ORCID: 0000-0003-1298-7845 | |
hal.structure.identifier | Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision [LAMSADE] | |
dc.contributor.author | Atif, Jamal
HAL ID: 15689 | |
dc.date.accessioned | 2020-06-09T13:53:39Z | |
dc.date.available | 2020-06-09T13:53:39Z | |
dc.date.issued | 2018 | |
dc.identifier.uri | https://basepub.dauphine.fr/handle/123456789/20861 | |
dc.language.iso | en | en |
dc.subject | space constraints | en |
dc.subject | resources-limited mobile devices | en |
dc.subject | DBMSTClu | en |
dc.subject | clustering partition | en |
dc.subject | Spectral Clustering method | en |
dc.subject | data cluster | en |
dc.subject.ddc | 005 | en |
dc.title | Graph sketching-based Space-efficient Data Clustering | en |
dc.type | Communication / Conférence | |
dc.description.abstracten | In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based non-parametric method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a sketched version of the dissimilarity graph between the N objects to cluster. Unlike k-means, k-medians or k-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic semi-streaming model in handling the dissimilarity graph as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring O(N polylog(N)) space, far better than the theoretical memory cost O(N2) of . The recovered approximate MST as input, DBMSTClu then successfully detects the right number of nonconvex clusters by performing relevant cuts on in a time linear in N. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets. | en |
dc.identifier.citationpages | 10-18 | en |
dc.relation.ispartoftitle | Proceedings of the 2018 SIAM International Conference on Data Mining | en |
dc.relation.ispartofeditor | Ester, Martin | |
dc.relation.ispartofeditor | Pedreschi, Dino | |
dc.relation.ispartofpublname | SIAM - Society for Industrial and Applied Mathematics | en |
dc.relation.ispartofpublcity | Philadelphia | en |
dc.relation.ispartofpages | 764 | en |
dc.relation.ispartofurl | 10.1137/1.9781611975321 | en |
dc.subject.ddclabel | Programmation, logiciels, organisation des données | en |
dc.relation.ispartofisbn | 978-1-61197-532-1 | en |
dc.relation.conftitle | 2018 SIAM International Conference on Data Mining | en |
dc.relation.confdate | 2018-05 | |
dc.relation.confcity | San Diego | en |
dc.relation.confcountry | United States | en |
dc.relation.forthcoming | non | en |
dc.identifier.doi | 10.1137/1.9781611975321.2 | en |
dc.description.ssrncandidate | non | en |
dc.description.halcandidate | non | en |
dc.description.readership | recherche | en |
dc.description.audience | International | en |
dc.relation.Isversionofjnlpeerreviewed | non | en |
dc.relation.Isversionofjnlpeerreviewed | non | en |
dc.date.updated | 2020-06-09T13:49:35Z | |
hal.author.function | aut | |
hal.author.function | aut | |
hal.author.function | aut | |
hal.author.function | aut |