The popularity of the Internet has grown to the point that people use it every day. The number of web pages continues to increase at amazing rates, turning the Internet into a massive database with rich resources. With this in mind we aim at constructing second generation search engines for better web querying, analyze, and mine web data and user behavior. For this we develop advanced algorithms for:
♦ Construct Hierarchical Ontology for Information Integration
As the number of web pages increases dramatically, the problem of information overload becomes more severe while browsing and searching the WWW. To alleviate this problem, search engines are dedicated to filter out unwanted information based on user input queries. However, since the retrieved data is not properly organized and integrated, the information overload problem still remains. For improving search efficiency, many techniques are proposed to integrate information towards user needs and preferences.
The common critical barrier of these information integration techniques is the quality of the relationship sets that indicates the correlation among different concepts (or vocabularies) and their hierarchical relationships. Such data sets are needed and achieved by manual designs (via Entity-Relationship Model or UML model) in the traditional databases or data warehouses. However, these models are domain-specific and are not applicable for the large scale of web data on the Internet. Thus, the automatic construction of the relationship data for all web pages is demanded for web information integration.
The objective of this proposal is to automatically construct the relationship data (using the Ontology model). Our construction mechanism relies on the query logs and a variety of external resources such as WordNet, Wikipedia, and DMOZ. During the construction, the query terms are first clustered into semantic groups (termed concepts); then the relationships and the hierarchies between concepts are formed by leveraging the external resources. In this fashion, a hierarchical Ontology is constructed.
The constructed ontology not only can be employed in information integration, but also helps us in many applications such as question-answering, language translation, and classification. The outcome of this project may consolidate the database research as a leading field of the global research community.
♦ Automatic Opinion Analysis on Micro Blog
Opinion analysis has recently become one of the trendiest researches. With the advent of the Web 2.0, many users share their opinions about certain brand, product or people. It is important to mine valuable information from these User Generated Contents (UGC).
Compared to other platforms in Internet, Microblog services are simple, updated fast, and are easy-to-use. It attracts users to share sentiment-rich opinions on Microblog services. To analyze the opinions is indispensible.
Our research focuses on the opinion analysis on Microblog. Instead of classifying the sentiment of opinions into only positive and negative categories, we apply clustering and learning mechanism to obtain fine-grained results.
This work spans across different important research fields, including database, natural language processing and machine learning.
CUE: Concept-level Understandable Explorer