Tesi di Dottorato
Permanent URI for this communityTesi di Dottorato
Browse
16 results
Search Results
Item Getting knowledge from presentation‐oriented documents(2014-03-10) Oro,Ermelinda; Saccà,Domenico; Ruffolo,Massimo; Palopoli,LuigiItem Service-Oriented workflows for distributed data mining applications(2014-03-07) Lackovic,Marco; Talia,Domenico; Palopoli,LuigiItem Tools and methods for engine control systems development and validation(2014-03-07) De Cristofaro,Ferdinando; Casavola,Alessandro; Palopoli,LuigiItem Metodi di rilevazione ed isolamento guasti per sistemi LPV ed ibridi(2014-03-07) Gagliardi,Gianfranco; Casavola,Alessandro; Palopoli,Luigi; Famularo,DomenicoItem Data warehousing and mining on open ended data(2014-03-06) Russo,Vincenzo; Saccà,Domenico; Masciari,Elio; Palopoli,LuigiItem Mining imprecise data using domain knowledge(2014-03-06) Ritacco,Ettore; Saccà,Domenico; Manco,Giuseppe; Palopoli,LuigiItem Delivery the common sense of the time to synchronize measurement instruments co-operating into the DMS(2014-03-06) Lamonaca,Francesco; Grimaldi,Domenico; Palopoli,LuigiThis research is devoted to investigate the coordination in the time domain of the operations executed by the Measurement Instrument (MI) connected to the nodes of the Distributed Measurement System (DMS) by Hardware Interface (HI). On the basis of the synchronization procedures of the node clocks of the DMS presented in the literature, the HIs can work on a synchronized modality. Nevertheless, the hardware and software architecture of the path involved in the communication HI-MI can randomly time delay the commands. Often in the DMS the PC is used as HI. In this case two different aspects are taken into account: (i) the characterization of the hardware connection PCMI offering the minimum time delay, and (ii) the criteria to modify and to set up the software to reduce the random time delay occurring in the processing steps performed into the PC. The results of the research performed highlights that the main cause of random time delay is the concurrency of the processes running in the PC. In particular, the delay depends on: (i) number of concurrent processes, (ii) their priority, (iii) behavior of the kernel managing the concurrency. In order to overcome the delay caused, the HI based on the Programmable Logic Device (PLD) is taken into account and proposed in the place of the PC. From the analysis of the operations executed on the PLD, operating conditions are shown, making random variation of the synchronization time delay of the MI to the node clock. In order to detect the causes of the random variation, the polling cycle is taken into account and analyzed. From this analysis the model of the uncertainty affecting the synchronization time delay is pointed out in order to evaluate the effects of each cause affecting the trigger check. The result of this evaluation furnishes: (i) information to point out the adequate strategy to reduce the random time variation of the detection of the trigger condition, and (ii) requirements to completely avoid the polling cycle. The experimental tests performed by using implemented embedded HI assess the efficacy of the presented HI to achieve sub-microsecond synchronization accuracy, according to the standard IEEE 1588. VIII In the research the problem of the stand alone MIs is also taken into account. In some cases, indeed, synchronized measurement procedures must be performed in places where the node of the DMS are not reachable, or not convenient to reach, by wired or wireless connections. For this reason it is not possible to use standard synchronization protocol to synchronize the MIs. In these cases the use of the Personal Digital Assistant (PDA) is proposed to physically bring the common sense of time to the stand alone MI. In order to achieve synchronization accuracy in the order of sub-microsecond, it is presented the conjunct use of the PDA and the proper embedded HI. This solution meets the advantages of both the devices. The PDA permits the advantages of the high level programming languages in order to interface the MI and collect the data. The embedded HI guarantees high synchronization accuracy on the basis of the deterministic behavior of the proposed hardware architecture. Experimental tests confirm the suitability of the proposed solution to the requirements of the standard IEEE 1588Item Overcoming uncertainty and the curse of dimensionality in data clustering(2014-03-05) Gullo,Francesco; Greco,Sergio; Palopoli,LuigiUncertainty and the curse of dimensionality are two crucial problems that usually a®ect data clustering. Uncertainty in data clustering may be typically considered at data level or clustering level. Data level uncertainty is inherently present in the repre- sentation of several kinds of data objects from various application contexts (e.g., sensor networks, moving objects databases, biomedicine). This kind of uncertainty should be carefully taken into account in a clustering task in order to achieve adequate accuracy; unfortunately, traditional clustering methods are designed to work only on deterministic vectorial representations of data objects. Clustering uncertainty is related to the output of any clustering algo- rithm. Indeed, the ill-posed nature of clustering leads to clustering algorithms that cannot be generally valid for any input dataset, i.e., their output results are necessarily uncertain. Clustering ensembles has been recognized as a pow- erful solution to overcome clustering uncertainty. It aims to derive a single clustering solution (i.e., the consensus partition) from a set of clusterings rep- resenting di®erent partitions of the same input dataset (i.e., the ensemble). A major weakness of the existing clustering ensembles methods is that they compute the consensus partition by equally considering all the solutions in the ensemble. The curse of dimensionality in data clustering concerns all the issues that naturally arise from data objects represented by a large set of features and are responsible of poor accuracy and e±ciency achieved by traditional clustering methods working on high dimensional data. Classic approaches to the curse of dimensionality include global and local dimensionality reduction. Global techniques aim at reducing the dimensionality of the input dataset by applying the same algorithm(s) to all the input data objects. Local dimensionality reduction acts by considering subsets of the input dataset and performing dimensionality reduction speci¯c for any of such subsets. Projective clustering is an e®ective class of methods falling into the category of local dimensionality reduction. It aims to discover clusters of objects along with the corresponding subspaces, following the principle that objects in the same cluster are close to each other if and only if they are projected onto the subspace associated to that cluster. viii Abstract The focus of this thesis is on the development of proper techniques for overcoming the crucial problems of uncertainty and the curse of dimension- ality arising from data clustering. This thesis provides the following main contributions. Uncertainty. Uncertainty at a representation level is addressed by proposing: UK-medoids, which is a new partitional algorithm for clustering un- certain objects, which is designed to overcome e±ciency and accuracy issues of some existing state-of-the-art methods; U-AHC, i.e., the ¯rst (agglomerative) hierarchical algorithm for clus- tering uncertain objects; a methodology to exploit U-AHC for clustering microarray biomedical data with probe-level uncertainty. Clustering uncertainty is addressed by focusing on the problem of weighted consensus clustering, which aims to automatically determine weighting schemes to discriminate among clustering solutions in a given ensemble. In particular: three novel diversity-based, general schemes for weighting the individ- ual clusterings in a given ensemble are proposed, i.e., Single-Weighting (SW), Group-Weighting (GW), and Dendrogram-Weighting (DW); three algorithms, called WICE, WCCE, and WHCE, are de¯ned to eas- ily involve clustering weighting schemes into any clustering ensembles algorithm falling into one of the main classes of clustering ensembles approaches, i.e., instance-based, cluster-based, and hybrid. The curse of dimensionality. Global dimensionality reduction is addressed by focusing on the time series data application context: the Derivative time series Segment Approximation (DSA) model is proposed as a new time series dimensionality reduction method de- signed for accurate and fast similarity detection and clustering; Mass Spectrometry Data Analysis (MaSDA) system is presented; it mainly aims at analyzing mass spectrometry (MS) biomedical data by exploiting DSA to model such data according to a time series-based representation; DSA is exploited for pro¯ling low-voltage electricity customers. Regarding local dimensionality reduction, a uni¯ed view of projective clus- tering and clustering ensembles is provided. In particular: the novel Projective Clustering Ensembles (PCE) problem is addressed and formally de¯ned according two speci¯c optimization formulations, i.e., two-objective PCE and single-objective PCE; MOEA-PCE and EM-PCE algorithms are proposed as novel heuristics to solve two-objective PCE and single-objective PCE, respectively. Absolute accuracy and e±ciency performance achieved by the proposed techniques, as well as the performance with respect to the prominent state- of-the-art methods are evaluated by performing extensive sets of experiments on benchmark, synthetically generated, and real-world datasetsItem Advances in mining complex data: modeling and clustering(2009) Ponti,Giovanni; Greco,Sergio; Palopoli,LuigiIn the last years, there has been a great production of data that come from di®erent application contexts. However, although technological progress pro- vides several facilities to digitally encode any type of event, it is important to de¯ne a suitable representation model which underlies the main character- istics of the data. This aspect is particularly relevant in ¯elds and contexts where data to be archived can not be represented in a ¯x structured scheme, or that can not be described by simple numerical values. We hereinafter refer to these data with the term complex data. Although it is important de¯ne ad-hoc representation models for complex data, it is also crucial to have analysis systems and data exploration tech- niques. Analysts and system users need new instruments that support them in the extraction of patterns and relations hidden in the data. The entire process that aims to extract useful information and knowledge starting from raw data takes the name of Knowledge Discovery in Databases (KDD). It starts from raw data and consists in a set of speci¯c phases that are able to transform and manage data to produce models and knowledge. There have been many knowledge extraction techniques for traditional structured data, but they are not suitable to handle complex data. Investigating and solving representation problems for complex data and de¯ning proper algorithms and techniques to extract models, patterns and new information from such data in an e®ective and e±cient way are the main challenges which this thesis aims to face. In particular, two main aspects related to complex data management have been investigated, that are the way in which complex data can be modeled (i.e., data modeling), and the way in which homogeneous groups within complex data can be identi¯ed (i.e., data clustering). The application contexts that have been objective of such studies are time series data, uncertain data, text data, and biomedical data. It is possible to illustrate research contributions of this thesis by dividing them into four main parts, each of which concerns with one speci¯c area and data type: vi Abstract Time Series | A time series representation model has been developed, which is conceived to support accurate and fast similarity detection. This model is called Derivative time series Segment Approximation (DSA), as it achieves a concise yet feature-rich time series representation by com- bining the notions of derivative estimation, segmentation and segment approximation. Uncertain Data | Research in uncertain data mining went into two di- rections. In a ¯rst phase, a new proposal for partitional clustering has been de¯ned by introducing the Uncertain K-medoids (UK-medoids) al- gorithm. This approach provides a more accurate way to handle uncertain objects in a clustering task, since a cluster representative is an uncertain object itself (and not a deterministic one). In addition, e±ciency issue has been addressed by de¯ning a distance function between uncertain objects that can be calculated o²ine once per dataset. In a second phase, research activities aimed to investigate issues related to hierarchical clustering of uncertain data. Therefore, an agglomera- tive centroid-based linkage hierarchical clustering framework for uncer- tain data (U-AHC) has been proposed. The key point lies in equipping such scheme with a more accurate distance measure for uncertain objects. Indeed, it has been resorted to information theory ¯eld to ¯nd a mea- sure able to compare probability distributions of uncertain objects used to model uncertainty. Text Data |Research results on text data can be summarized in two main contributions. The ¯rst one regards clustering of multi-topic documents, and a framework for hard clustering of documents according to their mix- tures of topics has been proposed. Documents are assumed to be modeled by a generative process, which provides a mixture of probability mass functions (pmfs) to model the topics that are discussed within any spe- ci¯c document. The framework combines the expressiveness of generative models for document representation with a properly chosen information- theoretic distance measure to group the documents. The second proposal concerns distributional clustering of XML documents, focusing on a the development of a distributed framework for e±ciently clustering XML documents. The distributed environment consists of a peer-to-peer network where each node in the network has access to a portion of the whole document collection and communicates with all the other nodes to perform a clustering task in a collaborative fashion. The proposed framework is based on modeling and clustering XML documents by structure and content. Indeed, XML documents are transformed into transactional data based on the notion of tree tuple. The framework is based on the well-known paradigm of centroid-based partitional clustering to conceive the distributed, transactional clustering algorithm. Biomedical Data | Research results on time series and uncertain data have been involved to support e®ective and e±cient biomedical data man- agement. The focus regarded both proteomics and genomics, investigat- Abstract vii ing Mass Spectrometry (MS) data and microarray data. In the speci¯c, a Mass Spectrometry Data Analysis (MaSDA) system has been de¯ned. The key idea consists in exploiting temporal information implicitly contained in MS data and model such data as time series. The major advantages of this solution are the dimensionality and the noise reduction. As re- gards micrarray data, U-AHC has been employed to perform clustering of microarray data with probe-level uncertainty. A strategy to model probe- level uncertainty has been de¯ned, together with a hierarchical clustering scheme for analyzing such data. This approach performs a gene-based clustering to discover clustering solutions that are well-suited to capture the underlying gene-based patterns of microarray data. The e®ectiveness and the e±ciency of the proposed techniques in clus- tering complex data are demonstrated by performing intense and exhaustive experiments, in which such proposals are extensively compared with the main state-of-the-art competitors.Item On the Computational Complexity of solution concepts in compact coalitional games(2014-03-05) Malizia,Enrico; Palopoli,Luigi; Scarcello,Francesco