IEEE 2014 / 13 - Data mining Projects

IEEE 2014 / 13 - Data mining  Projects

IEEE 2014: A Generic Framework for Top-k Pairs and Top-k Objects Queries over Sliding Windows


IEEE 2014 Transactions on Knowledge and Data Engineering

Top-k pairs and top-k objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of top-k pairs and top-k objects queries over sliding windows. Our framework handles multiple top-k queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Furthermore, the framework allows the users to define arbitrarily complex scoring functions and supports out-of-order data streams. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. For top-k pairs queries, we demonstrate the efficiency of our approach by comparing it with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. For top-k objects queries, our experimental results demonstrate the superiority of our algorithm over the state-of-the-art algorithm.


IEEE 2014: Approximate Shortest Distance Computing: A Query-Dependent Local 
Landmark Scheme

IEEE 2014 Transactions on Knowledge and Data Engineering
Abstract—shortest distance query between two nodes is a fundamental operation in large-scale networks. Most existing methods in the literature take a landmark embedding approach, which selects a set of graph nodes as landmarks and computes the shortest distances from each landmark to all nodes as an embedding. To handle a shortest distance query between two nodes, the pre computed distances from the landmarks to the query nodes are used to compute an approximate shortest distance based on the triangle inequality.



IEEE 2014: CoRE: A Context-Aware Relation Extraction Method for Relation Completion


IEEE 2014 Transactions on Knowledge and Data Engineering

Abstract—we identify Relation Completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation R, RC attempts at linking entity pairs between two entity lists under the relation R. To accomplish the RC goals, we propose to formulate search queries for each query entity _ based on some auxiliary information, so that to detect its target entity _ from the set of retrieved documents. For instance, a Pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.


IEEE 2014: Efficient Ranking on Entity Graphs with Personalized Relationships

IEEE 2014 Transactions on Knowledge and Data Engineering

Abstract—Authority flow techniques like Page Rank and Object Rank can provide personalized ranking of typed entity-relationship graphs. There are two main ways to personalize authority flow ranking: Node-based personalization, where authority originates from a set of user-specific nodes; Edge-based personalization, where the importance of different edge types is user-specific. We propose the first approach to achieve efficient edge-based personalization using a combination of pre computation and runtime algorithms. In particular, we apply our method to Object Rank, where a personalized weight assignment vector (WAV) assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings for various WAVs. We consider the following two classes of approximation: (a) Schema Approx is formulated as a distance minimization problem at the schema level; (b) Data Approx is a distance minimization problem at the data graph level. Schema Approx is not robust since it does not distinguish between important and trivial edge types based on the edge distribution in the data graph. In contrast, Data Approx has a provable error bound. Both Schema Approx and Data Approx are expensive so we develop efficient heuristic implementations, Scale Rank and Pick One respectively. Extensive experiments on the DBLP data graph show that Scale Rank provides a fast and accurate personalized authority
Flow ranking.

IEEE 2014:Secure Mining of Association Rules in Horizontally Distributed Databases
IEEE 2014 Transactions on Knowledge and Data Engineering
Abstract—we propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. [8], which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms — one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in [18]. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.

IEEE 2014:Facilitating Document Annotation using Content and Querying Value

IEEE 2014 Transactions on  Knowledge and Data Engineering

Abstract—A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest.

IEEE 2014:An Empirical Performance Evaluation of Relational Keyword Search Systems

IEEE 2014 Transactions on  Knowledge and Data Engineering

Abstract—In the past decade, extending the keyword search paradigm to relational data has been an active area of research within the database and information retrieval (IR) community. A large number of approaches have been proposed and implemented, but despite numerous publications, there remains a severe lack of standardization for system evaluations. This lack of standardization has resulted in contradictory results from different evaluations, and the numerous discrepancies muddle what advantages are proffered by different approaches. In this paper, we present a thorough empirical performance evaluation of relational keyword search systems. Our results indicate that many existing search techniques do not provide acceptable performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques from scaling Beyond small datasets with tens of thousands of vertices. We also explore the relationship between execution time and factors varied in previous evaluations; our analysis indicates that these factors have relatively little impact on performance. In summary, our work confirms previous claims regarding the unacceptable performance of these systems and underscores the need for standardization—as exemplified by the IR community—when evaluating these retrieval systems


IEEE 2013: SUSIE: Search Using Services and Information Extraction

IEEE 2013 Transactions on Knowledge and Data Engineering 

Abstract—The API of a Web service restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called – even though the underlying database might have the desired piece of information. This asymmetry is particularly problematic if the service is used in a Web service orchestration system. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. We show how this idea can be integrated into a Web service orchestration system. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach.


IEEE 2013 : A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

IEEE  2013  Transactions on Knowledge and Data Engineering 

Abstract—Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers 


IEEE 2013: Facilitating Document Annotation using Content and Querying Value

IEEE 2013 Transactions on Knowledge and Data Engineering

Abstract—A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest


IEEE 2013:A Web Usage Mining Approach Based On New Technique In Web Path Recommendation Systems

IEEE 2013 Transactions on Engineering Research & Technology  

A Web Usage Mining Approach Based On New Technique In Web Path Recommendation Systems The Internet is one of the fastest growing areas of intelligence gathering.  The ranking of web page for the  Web search-engine  is one of the significant  problems at present. This leads to the  important attention to the research community.  Web Perfecting is used to reduce the access latency of the Internet. However, if most perfected Web pages are not visited by the users in their subsequent accesses,  the limited network bandwidth and server resources will not be used efficiently and may worsen the access delay problem.  Therefore, it is critical that we have an accurate prediction method during perfecting.  To provide prediction efficiently, we advance architecture for    predicting in Web Usage Mining system and propose a novel approach for classifying user navigation patterns for predicting users’ requests based on clustering users browsing behavior knowledge.  The Excremental results show that the approach can improve accuracy, precision, recall and F measure  of classification in the architecture


IEEE 2013: SUSIE: Search Using Services and Information Extraction
IEEE 2013 Transactions on Knowledge and Data Engineering  

restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called – even though the underlying database might have the desired piece of information. This asymmetry is particularly problematic if the service is used in a Web service orchestration system. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. We show how this idea can be integrated into a Web service orchestration system. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach. 


IEEE 2013:PMSE: A Personalized Mobile Search Engine
IEEE 2013 Transactions on Knowledge and Data Engineering  
We propose a personalized mobile search engine (PMSE) that captures the users’ preferences in the form of concepts by mining their click through data. Due to the importance of location information in mobile search, PMSE classifies these concepts into content concepts and location concepts. In addition, users’ locations (positioned by GPS) are used to supplement the location concepts in PMSE. The user preferences are organized in an ontology-based, multifacet user profile, which are used to adapt a personalized ranking function for rank adaptation of future search results. To characterize the diversity of the concepts associated with a query and their relevance to the user’s need, four entropies are introduced to balance the weights between the content and location facets. Based on the client-server model, we also present a detailed architecture and design for implementation of PMSE. In our design, the client collects and stores locally the click through data to protect privacy, whereas heavy tasks such as concept extraction, training, and re ranking are performed at the PMSE server. Moreover, we address the privacy issue by restricting the information in the user profile exposed to the PMSE server with two privacy parameters. We prototype PMSE on the Google Android platform. Experimental results show that PMSE significantly improves the precision comparing to the baseline.


IEEE 2013: Generation of Personalized Ontology Based on Consumer Emotion and Behavior Analysis

IEEE 2013 Transactions on Affective Computing

The relationships between consumer emotions and their buying behaviors have been well documented. Technology-savvy consumers often use the web to find information on products and services before they commit to buying. We propose a semantic web usage mining approach for discovering periodic web access patterns from annotated web usage logs which incorporates information on consumer emotions and behaviors through self-reporting and behavioral tracking. We use fuzzy logic to represent real-life temporal concepts (e.g., morning) and requested resource attributes (ontological domain concepts for the requested URLs) of periodic pattern-based web access activities. These fuzzy temporal and resource representations, which contain both behavioral and emotional cues, are incorporated into a Personal Web Usage Lattice that models the user’s web access activities. From this, we generate a Personal Web Usage Ontology written in OWL, which enables semantic web applications such as personalized web resources recommendation. Finally, we demonstrate the effectiveness of our approach by presenting experimental results in the context of personalized web resources recommendation with varying degrees of emotional influence. Emotional influence has been found to contribute positively to adaptation in personalized recommendation


IEEE 2013: Identity-Based Secure Distributed Data Storage Schemes

IEEE 2013 Transactions on Computers 

Secure distributed data storage can shift the burden of maintaining a large number of files from the owner to proxy servers. Proxy servers can convert encrypted files for the owner to encrypted files for the receiver without the necessity of knowing the content of the original files. In practice, the original files will be removed by the owner for the sake of space efficiency. Hence, the issues on confidentiality and integrity of the outsourced data must be addressed carefully. In this paper, we propose two identity-based secure distributed data storage (IBSDDS) schemes. Our schemes can capture the following properties: The file owner can decide the access permission independently without the help of the private key generator (PKG);  For one query, a receiver can only access one file, instead of all files of the owner; Our schemes are secure against the collusion attacks, namely even if the receiver can compromise the proxy servers, he cannot obtain the owner’s secret key. Although the first scheme is only secure against the chosen plain text attacks (CPA), the second scheme is secure against the chosen cipher text attacks (CCA). To the best of our knowledge, it is the first IBSDDS schemes where an access permissions is made by the owner for an exact file and collusion attacks can be protected
in the standard model.
 



IEEE 2013: Ginix: Generalized Inverted Index for Keyword Search

IEEE 2013 Transactions on Knowledge and Data Mining

Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX (Ginix), which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in data sets using two scalable  algorithms. Experiments on the performance and scalability of Ginix on real data sets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes


IEEE 2013: An Ontology-based Framework for Context-aware Adaptive E-learning System

IEEE 2013 Transactions on Computer Communication and Informatics
in web-based e-learning environment every learner has a distinct background, learning style and a specific goal when searching for learning material on the web. The goal of personalization is to tailor search results to a particular user based on that user’s contextual information. The effectiveness of accessing learning material involves two important challenges: identifying the user context and modeling the user context as ontological profiles. This work describes the ontology-based framework for context-aware adaptive learning system, with detailed discussions on the categorization contextual information and modeling along with the use of ontology to explicitly specify learner context in an e-learning environment. Finally we conclude by showing the applicability of the proposed ontology with appropriate architectural overview of e-learning system

IEEE 2013: ELCA Evaluation for Keyword Search on Probabilistic XML Data

IEEE 2013 Transactions on Knowledge and Data Engineering 

As probabilistic data management is becoming one of the main re-search focuses and keyword search is turning into a more popular query means, it is natural to think how to support keyword queries on probabilistic XML data. With regards to keyword query on De-terministic XML documents, ELCA (Exclusive Lowest Common Ancestor) semantics allows more relevant fragments rooted at the ELCAs to appear as results and is more popular compared with other keyword query result semantics (such as SLCAs). In this paper, we investigate how to evaluate ELCA results for keyword queries on probabilistic XML documents. After defin-ing probabilistic ELCA semantics in terms of possible world se-mantics, we propose an approach to compute ELCA probabilities without generating possible worlds. Then we develop an efficient stack-based algorithm that can find all probabilistic ELCA results and their ELCA probabilities for a given keyword query on a prob-abilistic XML document. Finally, we experimentally evaluate the proposed ELCA algorithm and compare it with its SLCA counter-part in aspects of result effectiveness, time and space efficiency, and scalability 


IEEE 2013: Crowd sourcing Predictors of Behavioral Outcomes 
IEEE 2013:Transactions on Knowledge and Data Engineering  
Generating models from large data sets—and deter-mining which subsets of data to mine—is becoming increasingly automated. However choosing what data to collect in the first place requires human intuition or experience, usually supplied by a domain expert. This paper describes a new approach to machine science which demonstrates for the first time that non-domain experts can collectively formulate features, and provide values for those features such that they are predictive of some behavioral outcome of interest. This was accomplished by building a web platform in which human groups interact to both respond to questions likely to help predict a behavioral outcome and pose new questions to their peers. This results in a dynamically-growing online survey, but the result of this cooperative behavior also leads to models that can predict user’s outcomes based on their responses to the user-generated survey questions. Here we describe two web-based experiments that instantiate this approach: the first site led to models that can predict users’ monthly electric energy consumption; the other led to models that can predict users’ body mass index. As exponential increases in content are often observed in successful online collaborative communities, the proposed methodology may, in the future, lead to similar exponential rises in discovery and insight into the causal factors of behavioral outcomes



IEEE 2013: Facilitating Document Annotation usingContent and Querying Value

IEEE 2013 Transactions on Knowledge and Data Engineering in cloud

A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of struc-tured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, es-pecially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain informa-tion of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that hu-mans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest


IEEE 2012: One Size Does Not Fit All: Toward User- and Query-Dependent Ranking for Web Databases

IEEE 2012 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract— With the emergence of the deep web, searching web databases in domains such as vehicles, real estate, etc., has become a routine task. One of the problems in this context is ranking the results of a user query. Earlier approaches for addressing this problem have used frequencies of database values, query logs, and user profiles. A common thread in most of these approaches is that ranking is done in a user- and/or query-independent manner. This paper proposes a novel query- and user-dependent approach for ranking query results in web databases. We present a ranking model, based on two complementary notions of user and query similarity, to derive a ranking function for a given user query. This function is acquired from a sparse workload comprising of several such ranking functions derived for various user-query pairs. The model is based on the intuition that similar users display comparable ranking preferences over the results of similar queries. We define these similarities formally in alternative ways and discuss their effectiveness analytically and experimentally over two distinct web databases.

Application of Data Mining in Educational Databases for Predicting Academic Trends and Patterns


Abstract— Data mining is a process of identifying and extracting hidden patterns and information from databases and data warehouses. There are various algorithms and tools available for this purpose. Data mining has a vast range of applications ranging from business to medicine to engineering. In this paper, we discuss the application of data mining in education for student profiling and grouping. We make use of Apriori algorithm for student profiling which is one of the popular approaches for mining associations i.e. discovering co-relations among set of items. The other algorithm used, for grouping students is K-means clustering which assigns a set of observations into subsets. In the field of academics, data mining can be very useful in discovering valuable information which can be used for profiling students based on their academic record. We apply Apriori algorithm to the database containing academic records of various students and try to extract association rules in order to profile students based on various parameters like exam scores, term work grades, attendance and practical exams. We also apply K-means clustering to the same set of data in order to group the students. The implemented algorithms offer an effective way of profiling students which can be used in educational systems.


IEEE 2011:  Ranking Spatial Data by Quality Preferences

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, March 2011

Abstract— A spatial preference query ranks objects based on the qualities of features in their spatial neighborhood. For example, using a real estate agency database of flats for lease, a customer may want to rank the flats with respect to the appropriateness of their location, defined after aggregating the qualities of other features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial neighborhood. Such a neighborhood concept can be specified by the user via different functions. It can be an explicit circular region within a given distance from the flat. Another intuitive definition is to consider the whole spatial domain and assign higher weights to the features based on their proximity to the flat. In this paper, we formally define spatial preference queries and propose appropriate indexing techniques and search algorithms for them. Extensively evaluation of our methods on both real and synthetic data reveals that an optimized branch-and-bound solution is efficient and robust with respect to different parameters.



IEEE 2012: A Novel Profit Maximizing Metric for Measuring Classification Performance of Customer Churn Prediction Models

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Abstract—  The interest for data mining techniques has increased   tremendously during the past decades, and numerous classification techniques have been applied in a wide range of business applications. Hence, the need for adequate performance measures has become more important than ever. In this paper, a cost benefit analysis framework is formalized in order to define performance measures which are aligned with the main objectives of the end users, i.e. profit maximization. A new performance measure is defined, the expected maximum profit criterion. This general framework is then applied to the customer churn problem with its particular cost benefit structure. The advantage of this approach is that it assists companies with selecting the classifier which maximizes the profit. Moreover, it aids with the practical implementation in the sense that it provides guidance about the fraction of the customer base to be included in the retention campaign.

IEEE 2012: Prediction of User’s Web-Browsing Behavior: Application of Markov Model

IEEE TRANSACTIONS ON SYSTEMSAUGUST 2012

Abstract— Web prediction is a classification problem in which we attempt to predict the next set of Web pages that a user may visit based on the knowledge of the previously visited pages. Predicting user’s behavior while serving the Internet can be applied effectively in various critical applications. Such application has traditional tradeoffs between modeling complexity and prediction accuracy. In this paper, we analyze and study Markov model and all-Kth Markov model in Web prediction. We propose a new modified Markov model to alleviate the issue of scalability in the number of paths. In addition, we present a new two-tier prediction framework that creates an example classifier EC, based on the training examples and the generated classifiers. We show that such framework can improve the prediction time without compromising Prediction accuracy. We have used standard benchmark data sets to analyze, compare, and demonstrate the effectiveness of our techniques using variations of Markov models and association rule mining. Our experiments show the effectiveness of our modified Markov model in reducing the number of paths without compromising accuracy. Additionally, the results support our analysis conclusions that accuracy improves with higher orders of all-Kth model.


IEEE 2012: Learn to Personalized Image Search from the Photo Sharing Websites

2012 IEEE TRANSACTIONS ON MULTIMEDIA

Abstract— Increasingly developed social sharing websites, like Flicker and YouTube, allow users to create, share, annotate and comment medias. The large-scale user-generated meta-data not only facilitate users in sharing and organizing multimedia content, but provide useful information to improve media retrieval and management. Personalized search serves as one of such examples where the web search experience is improved by generating the returned list according to the modified user search intents. In this paper, we exploit the social annotations and propose a novel framework simultaneously considering the user and query relevance to learn to personalized image search. The basic premise is to embed the user preference and query-related search intent into user-specific topic spaces. Since the users’ original annotation is too sparse for topic modeling, we need to enrich users’ annotation pool before user-specific topic spaces construction. The proposed framework contains two components: 1) A Ranking based Multi-correlation Tensor Factorization model is proposed to perform annotation prediction, which is considered as users’ potential annotations for the images; 2) We introduce
User-specific Topic Modeling to map the query relevance and user preference into the same user-specific topic space. For performance evaluation, two resources involved with users’ social Activities are employed. Experiments on a large-scale Flicker dataset demonstrate the effectiveness of the proposed method.


IEEE 2012: Road: A New Spatial Object Search Framework for Road Networks

IEEE 2012 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract— In this paper, we present a new system framework called ROAD for spatial object search on road networks. ROAD is extensible to diverse object types and efficient for processing various location-dependent spatial queries (LDSQs), as it maintains objects separately from a given network and adopts an effective search space pruning technique. Based on our analysis on the two essential operations for LDSQ processing, namely, network traversal and object lookup, ROAD organizes a large road network as a hierarchy of interconnected regional subnetworks (called Rnets). Each Rnet is augmented with 1) shortcuts and 2) object abstracts to accelerate network traversals and provide quick object lookups, respectively. To manage those shortcuts and object abstracts, two cooperating indices, namely, Route Overlay and Association Directory are devised. In detail, we present 1) the Rnet hierarchy and several properties useful in constructing and maintaining the Rnet hierarchy, 2) the design and implementation of the ROAD framework, and 3) a suite of efficient search algorithms for single-source LDSQs and multisource LDSQs. We conduct a theoretical performance analysis and carry out a comprehensive empirical study to evaluate ROAD. The analysis and experiment results show the superiority of ROAD over the state-of-the-art approaches.

IEEE 2012:  Measuring The Sky: On Computing Data Cubes Via Skylining The Measures

IEEE 2012 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract— Data cube is a key element in supporting fast OLAP. Traditionally, an aggregate function is used to compute the values in data cubes. In this paper, we extend the notion of data cubes with a new perspective. Instead of using an aggregate function, we propose to build data cubes using the skyline operation as the “aggregate function”. Data cubes built in this way are called “group-by skyline cubes” and can support a variety of analytical tasks. Nevertheless, there are several challenges in implementing group-by skyline cubes in data warehouses: (i) the skyline operation is computational intensive, (ii) the skyline operation is holistic, and (iii) a group-by skyline cube contains both grouping and skyline dimensions, rendering it infeasible to pre-compute all cuboids in advance. This paper gives details on how to store, materialize, and query such cubes.

IEEE 2012:  SLICING: A New Approach For Privacy Preserving Data Publishing

IEEE 2012 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract— Several anonymization techniques, such as generalization and bucketization, have been designed for privacy preserving microdata publishing. Recent work has shown that generalization loses considerable amount of information, especially for highdimensional data. Bucketization, on the other hand, does not prevent membership disclosure and does not apply for data that do not have a clear separation between quasi-identifying attributes and sensitive attributes. In this paper, we present a novel technique called slicing, which partitions the data both horizontally and vertically. We show that slicing preserves better data utility than generalization and can be used for membership disclosure protection. Another important advantage of slicing is that it can handle high-dimensional data. We show how slicing can be used for attribute disclosure protection and develop an efficient algorithm for computing the sliced data that obey the ?-diversity requirement. Our workload experiments confirm that slicing preserves better utility than generalization and is more effective than bucketization in workloads involving the sensitive attribute. Our experiments also demonstrate that slicing can be used to prevent membership disclosure.

IEEE 2012:  DDD: A New Ensemble Approach For Dealing With Concept Drift

IEEE 2012 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract— Online learning algorithms often have to operate in the presence of concept drifts. A recent study revealed that different diversity levels in an ensemble of learning machines are required in order to maintain high generalization on both old and new concepts. Inspired by this study and based on a further study of diversity with different strategies to deal with drifts, we propose a new online ensemble learning approach called Diversity for Dealing with Drifts (DDD). DDD maintains ensembles with different diversity levels and is able to attain better accuracy than other approaches. Furthermore, it is very robust, outperforming other drift handling approaches in terms of accuracy when there are false positive drift detections. In all the experimental comparisons we have carried out, DDD always performed at least as well as other drift handling approaches under various conditions, with very few exceptions.

IEEE 2012:  ANO´NIMOS: An Lp-Based Approach For Anonymizing Weighted Social Network Graphs

IEEE 2012 TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Abstract— The increasing popularity of social networks has initiated a fertile research area in information extraction and data mining. Anonymization of these social graphs is important to facilitate publishing these data sets for analysis by external entities. Prior work has concentrated mostly on node identity anonymization and structural anonymization. But with the growing interest in analyzing social networks as a weighted network, edge weight anonymization is also gaining importance. We present Ano´nimos, a Linear Programming-based technique for anonymization of edge weights that preserves linear properties of graphs. Such properties form the foundation of many important graph-theoretic algorithms such as shortest paths problem, k-nearest neighbors, minimum cost spanning tree, and maximizing information spread. As a proof of concept, we apply Ano´nimos to the shortest paths problem and its extensions, prove the correctness, analyze complexity, and experimentally evaluate it using real social network data sets. Our experiments demonstrate that Ano´nimos anonymizes the weights, improves k-anonymity of the weights, and also scrambles the relative ordering of the edges sorted by weights, thereby providing robust and effective anonymization of the sensitive edge-weights. We also demonstrate the composability of different models generated using Ano´nimos, a property that allows a single anonymized graph to preserve multiple linear properties.

No comments:

Post a Comment