Show simple item record

hal.structure.identifierLaboratoire de Recherche en Informatique [LRI]
dc.contributor.authorBaazizi, Mohamed-Amine
HAL ID: 13062
*
hal.structure.identifierLaboratoire d'analyse et modélisation de systèmes pour l'aide à la décision [LAMSADE]
dc.contributor.authorColazzo, Dario*
hal.structure.identifierDipartimento di Informatica [Pisa]
dc.contributor.authorGhelli, Giorgio*
hal.structure.identifierDipartimento di Matematica Informatica ed Economia [DiMIE]
dc.contributor.authorSartiani, Carlo*
dc.date.accessioned2019-09-30T14:00:37Z
dc.date.available2019-09-30T14:00:37Z
dc.date.issued2019
dc.identifier.issn1066-8888
dc.identifier.urihttps://basepub.dauphine.fr/handle/123456789/19935
dc.language.isoenen
dc.subjectJSONen
dc.subjectSchema inferenceen
dc.subjectMap-reduceen
dc.subjectSparken
dc.subjectBig data collectionsen
dc.subject.ddc005en
dc.titleParametric schema inference for massive JSON datasetsen
dc.typeArticle accepté pour publication ou publié
dc.description.abstractenIn recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.en
dc.relation.isversionofjnlnameThe VLDB Journal
dc.relation.isversionofjnlvol28en
dc.relation.isversionofjnlissue4en
dc.relation.isversionofjnldate2019-08
dc.relation.isversionofjnlpages497-521en
dc.relation.isversionofdoi10.1007/s00778-018-0532-7en
dc.relation.isversionofjnlpublisherSpringeren
dc.subject.ddclabelProgrammation, logiciels, organisation des donnéesen
dc.relation.forthcomingnonen
dc.relation.forthcomingprintnonen
dc.description.ssrncandidatenonen
dc.description.halcandidateouien
dc.description.readershiprechercheen
dc.description.audienceInternationalen
dc.relation.Isversionofjnlpeerreviewedouien
dc.relation.Isversionofjnlpeerreviewedouien
dc.date.updated2019-09-27T12:15:42Z
hal.identifierhal-02301677*
hal.version1*
hal.update.actionupdateMetadata*
hal.update.actionupdateFiles*
hal.author.functionaut
hal.author.functionaut
hal.author.functionaut
hal.author.functionaut


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record