TDD Projects
Design and Development
Team project: a group of 2 or 3 people.
 Project 1. Evaluating Boolean XPath queries on large XML data.
This project is to develop algorithms for evaluating Boolean XPath queries on large XML documents in a centralized system. You are required to develop two algorithms:
 an algorithm that partitions a large XML document into fragments and stores the fragments in the secondary storage of the system, and
 an algorithm for evaluating XPath queries on the fragmented document such that each fragment needs to be loaded into the main memory at most once.
You may only focus on XPath queries of the following form:
p ::= A  *  p/p  p//p  p[q]
q ::= @a = `c'  q and q  q or q  not q
where
A is an element tag,
@a indicates an attribute and 'c' is a constant. Such a query returns
true on an XML document
T if and only if the set of XML elements reachable from the root of
T via
p is nonempty.
The final report should consist of

 a problem statement;
 your algorithms;
 complexity analyses of both algorithms;
 justification of the design of your partitioning algorithm, ie, you need to show that your partitioning strategy guarantee the "minimum" I/O cost;
 justification of your evaluation algorithm, ie, you need to prove that your algorithm is correct and it indeed has the performance guarantee described above; and
 an experimental study to verify the efficiency of your algorithms.
Investigators (at most three people):
 Project 2. A MapReduce algorithm for evaluating Boolean XPath queries.
This project is to develop a MapReduce algorithms for evaluating Boolean XPath queries on large XML data. You may only consider XPath queries of the form given in the last project.
The final report should consist of
 a problem statement,
 a MapReduce algorithm,
 complexity analysis of your MapReduce algorithm,
 justification of your evaluation algorithm, ie, you need to prove that your algorithm is correct and it has certain performance guarantees; and
 an experimental study to verify the effectiveness of your algorithm (possibly via simulation).
Investigators (at most two people):
 Project 3. Evaluating distance queries on distributed graphs.
This project is to develop an algorithm for evaluating distance queries on graphs that are fragmented and distributed . A distance query consists of a pair of nodes ( s, t). Given a graph G, it is to find the distance from s to t in G.
You are required to develop an evaluation algorithm with the following performance guarantees: (a) each fragment (site) is visited only once; (b) the response time depends on the largest fragment of the graph G, but is independent of the size of the entire graph G; and (c) the amount of data shipped is independent of the size of the entire graph G.
The final report should consist of
 a problem statement,
 the evaluation algorithm,
 complexity analyses of your algorithm,
 justification of your evaluation algorithm, ie, you need to prove that your algorithm is correct and it indeed has the performance guarantee described above, and
 an experimental study to verify the scalability and efficiency of your algorithm.
Investigators (at most two people):
 Project 4. A MapReduce algorithm for evaluating distance queries.
This project is to develop a MapReduce algorithm for evaluating distance queries on large graphs. See the last project for the definition of distance queries.
The final report should consist of
 a problem statement,
 a MapReduce algorithm,
 complexity analyses of your algorithm,
 the correctness of your evaluation algorithm, and
 an experimental study to verify the scalability and efficiency of your algorithm.
Investigators (at most two people):
 Project 5. Evaluating Boolean XPath queries on DAGs that are fragmented and distributed.
This project is to develop an algorithm for evaluating Boolean XPath queries on nodelabeled directed acyclic graphs (DAGs) that are fragmented and distributed. You may consider only XPath queries specified in Project 1. You are required to develop an evaluation algorithm that (a) visits each fragment at most once, and (b) its communication cost is independent of the size of the DAG.
The final report should consist of
 a problem statement,
 your evaluation algorithm,
 complexity analyses of your algorithm,
 justification of your evaluation algorithm, ie, you need to prove that your algorithm is correct and it indeed has the performance guarantee described above, and
 an experimental study to verify the scalability and efficiency of your algorithm.
Investigators (at most 2 people):
 Project 6. Graph embedding.
Consider two nodelabeled graphs G1 and G2 . We say that G1 is embedded in G2 if there exists a binary relation R that "maps" the nodes of G1 to the nodes of G2 such that (A) for each node u inG1 , there exists a v in G2 such that (u, v) is in R , and (B) for any pair (u, v) of nodes in R , (1) the label of u is identical to the label of v , and (2) for every edge (u, u') in G1 , there exists a nonempty path of G2 from v to v' , such that (u', v') is also in R . In other words, each edge in G1 is mapped to a nonempty path in G2 . We refer to R as an embedding relation from G1 to G2 .
An embedding relation R is said to be maximum if for any embedding relation R' from G1 to G2 , R' is contained in R .
This project consists of two parts. (1) Show that if G1 can be embedded in G2 , then there exists a unique maximum embedding relation R from G1 to G2 . (2) Develop an algorithm, in O((V1 + V2) (E1 + V2^2)) time, for computing the maximum embedding relation if G1 can be embedded in G2 , where V1 and E1 are the numbers of nodes and edges in G1 , respectively; similarly forV2 .
The final report should consist of
 a problem statement for graph embedding,
 a proof showing that there exists a unique maximum embedding relation from G1 to G2 if G1 can be embedded in G2 ;
 an algorithm for computing the maximum embedding relation from G1 to G2 if it exists, and it returns an empty set otherwise;
 a proof showing that your algorithm is correct;
 a complexity analysis showing that your algorithm is in O((V1 + V2) (E1 + V2^2)) time; and
 an experimental study to verify the efficiency of your algorithm.
Readings: (a) "Computing simulations on finite and infinite graphs", M. R. Henzinger and T. A. Henzinger and P. W. Kopke, FOCS 1995. (b) "Graph pattern matching: From intractable to polynomial time"
Investigators (at most three people):
 Project 7. Graph embedding on distributed graphs.
This project is to develop an algorithm for deciding whether a distributed graph G matches a graph pattern P , based on graph embedding defined in the last project. Here both G and P are nodelabeled acyclic directed graphs (DAGs). Your algorithm should guarantee that its communication cost is independent of the size of the DAG G , i.e., it should explore maximum parallelism when conducting graph pattern matching.
The final report should consist of
 a problem statement,
 your evaluation algorithm,
 complexity analyses of your algorithm,
 justification of your evaluation algorithm, ie, you need to prove that your algorithm is correct and it indeed has the performance guarantee described above, and
 an experimental study to verify the scalability and efficiency of your algorithm.
Investigators (at most three people):
 Project 8. Schemadirected XML integration
The objective of this project is to develop a program that integrates information about LNCS books published by Springer from two data sources, and generates an XML view that conforms to a predefined recursive DTD.
One source of information about books is the DBLP server , which collects more than 830000 entries, in XML records. Another source of book information is Amazon.
 Write a program that extracts LNCS entries from DBLP and stores the data in a relation of the following schema:
book(editor, title, year)
 Write another program that extracts LNCS books from Amazon and stores the data in a relation of the following schema:
book(editor, title, year, price, ISBN)
 Populate the two relational databases with data from Amazon and DBLP, respectively, such that each database consists of at least 500 entries.
 Find a set of matching rules to identify records from the two different sources that refer to the same book or the same author.
 Write an AIG that builds an XML view of the two relational sources, such that the view conforms to the DTD given below (only productions are listed):
db > book*,
book > editor*, title, year, price, ISBN
author > name, book*
where for each book B published in year Y, under each author A of B, all books coauthored by A published before Y are listed. Note that this is a recursive DTD.
 Implement the AIG in whatever programming language you like.
 Generate an XML view using the AIG, based on the two relational databases you created.
 Write an XSLT program to display your XML view as a userfriendly HTML page.
The final report should consist of (a) a problem statement for data integration in this context, (b) your programs for generating relational data, and their justification, (c) your AIG for integrating the relational data, (d) how you implement your AIG, and (e) matching rules. Justify your solution.
Readings: (a) "Capturing both Types and Constraints in Data Integration". (b) "Reasoning about Record Matching Rules".
Investigators (at most three people):
 Project 9. Matching rules.
This project is to extend relative candidate keys (RCKs) by supporting negative rules, e.g., a person record for a male may not match another for female. We refer to the extended RCKs as eRCKs .
A set of eRCKs is said to be consistent if there exist no tuples t1 and t2 , such that the eRCKs conclude that t1 and t2 both match to each other, and at the same time, that they may not match.
The objective of the project is to define eRCKs and investigate its consistency problem. The final report should consist of the following:
 a problem statement,
 the syntax of eRCKs (hint: use inequality in the rules),
 the semantics of eRCKs,
 an algorithm that, given a set of eRCKs, determines whether these eRCKs are consistent; and
 a proof showing that your algorithm is correct and moreover, is in polynomial time.
Readings "Reasoning about Record Matching Rules", and "Interaction between Record Matching and Data Repairing".
Investigators (at most two people):
 Project 10. Discovering conditional functional dependencies
This project aims to develop an algorithm that, given a relation D , finds all conditional functional dependencies (CFDs) of the following form that hold on D , i.e., the set of CFDs that D satisfies. The CFDs are of the form (X > A, tp) , where X is a set of at most 5 attributes, A is a single attribute, and tp is a pattern tuple.
The final report should consist of the following:
 a problem statement;
 your algorithm;
 a proof showing that your algorithm is correct, i.e., it finds all CFDs of the form above that hold on D ;
 a proof showing that your algorithm is in polynomial time, and
 an experimental study verifying the scalability of your algorithm with the size of D and the arity of D .
Reading: "Discovering Conditional Functional Dependencies".
Investigators (at most 2 people):
 Project 11. Detection of violations of conditional inclusion dependencies.
This project is to develop an algorithm that detects inconsistencies of relational data that emerge as violations of conditional inclusion dependencies (CINDs).
Develop an algorithm that, given a set S of CINDs as input, returns as output a fixed number of SQL queries. When evaluated against a database DB , the SQL queries return all those tuples in DB that violate at least one CIND in S .
The final report should consist of
 a problem statement for violation detection,
 an algorithm for generating violation checking SQL queries,
 a proof showing that your algorithm is correct, ie, it generates SQL queries that can catch all and only those violating tuples,
 a proof showing that the the number and size of the SQL queries are dependent of neither the cardinality of S nor the size of the pattern tableaux of the CINDs in S , and
 an experimental study verifying the efficiency of your algorithm.
Readings: "Extending Dependencies with Conditions" and "Conditional Functional Dependencies for Capturing Data Inconsistencies".
Investigators (at most two people):
 Project 12. Incremental validation of conditional functional dependencies.
As opposed to traditional functional dependencies, a set S of conditional functional dependencies (CFDs) may be inconsistent. That is, there may not exist any nonempty database that satisfies S . Worse still, the problem for determining whether a set of CFDs is consistent is NPcomplete.
The incremental validation problem for CFDs is stated as follows. Given a consistent set S of CFDs and a single CFD c , it is to determine whether S and c put together are consistent, ie, whether there exists a nonempty database that satisfies both S and c .
Show that the incremental validation problem for CFDs is also NPcomplete.
The final report should consist of
 a proof showing that the incremental validation problem for CFDs is in NP, and
 a proof showing that the problem is NPhard.
Reading: Conditional Functional Dependencies for Capturing Data Inconsistencies.
Investigators (a team of at most 2 people):
 Project 13. Detecting errors in vertically partitioned data.
Conditional functional dependencies (CFDs) have recently been studied for capturing inconsistencies in relational data, ie, errors and inconsistencies in the data are detected as violations of CFDs. Theinconsistency detection problem for CFDs is to find, given a set S of CFDs and a database DB , all those tuples in DB that violate at least one CFD in S .
It is known that when DB is a centralized database, it suffices to use two SQL queries to find all violations of S in DB . In contrast, when DB is fragmented and distributed, it is often necessary to ship data from one site to another in order to detect inconsistencies. In this setting, the inconsistency detection problem for CFDs becomes NPcomplete when we require the data shipment to be minimum.
This project is to develop an efficient heuristic algorithm for detecting violations of CFDs in a database that is vertically partitioned and distributed.
The final report should consist of
 an algorithm for detecting violations of CFDs in a vertically partitioned database,
 justification for the correctness of the algorithm,
 complexity analysis of the algorithm,
 performance guarantees, e.g., no data item is shipped more than once; and
 an experimental study to verify the efficiency of your algorithm.
Readings: (a) Conditional Functional Dependencies for Capturing Data Inconsistencies. (b) Detecting Inconsistencies in Distributed Data.
Investigators (a team of at most three people):
 Project 14. Incremental detection of violations of conditional functional dependencies.
The objective of this project is to develop algorithms that incrementally detect inconsistencies of relational data emerged as violations of at least one dependency in a given set of conditional functional dependencies (CFDs).
The algorithm takes as input a set S of CFDs, and a database DB in which tuples that violate dependencies in S have already been marked. It returns as output a fixed number of SQL queries. The SQL queries, when evaluated against the database DB and an update relation U to DB , incrementally catch tuples that violate at least one dependency in S in response to U . The SQL queries should minimize unnecessary checking on the tuples in DB that remain violating (or not violating) CFDs in S after the update U is incurred to DB . In other words, the SQL queries should not start the checking from scratch.
The final report should consist of
 a problem statement for incremental violation detection,
 an algorithm for generating SQL queries for incremental violation checking,
 a proof showing that your algorithm is correct, ie, it catches all and only those violating tuples,
 your incremental algorithm minimizes unnecessary checking, and
 an experimental study verifying the efficiency of your algorithm, compared with the algorithm that checks inconsistencies starting from scratch, when updates are small.
Readings: "Conditional Functional Dependencies for Capturing Data Inconsistencies", and "Incremental Detection of Inconsistencies in Distributed Data"
Investigators (a team of at most 3 people):
 Project 15. Propagation of functional dependencies.
The objective of this project is to develop an algorithm that derives conditional functional dependencies propagated from functional dependencies via SP views (selection and projection).
Develop a polynomialtime algorithm that, given as input a set S of functional dependencies defined on schema R and a view V defined in terms of selection and projection operators, returns as output the set of all conditional functional dependencies propagated from S via V. A conditional functional dependency ( CFD) is said to be propagated from S via V if for all instances DB of R, if DB satisfiesS , then V(DB) must satisfy the CFD. The query V is defined on R in terms of a selection operator with a conjunction of equality atoms, and a projection on a set of attributes.
The final report should consist of
 a problem statement for dependency propagation,
 an algorithm for deriving all conditional functional dependencies propagated from functional dependencies via SP views,
 a proof showing that your algorithm is correct, ie, it finds all and only those propagated conditional functional dependencies,
 a complexity analysis of your algorithm, showing that your algorithm is in polynomialtime, and
 an experimental study verifying that your algorithm is efficient.
Readings: "Conditional Functional Dependencies for Capturing Data Inconsistencies", and "Propagating Functional Dependencies with Conditions".
Investigators (a team of at most 3 people):
Survey
These are individual projects rather than team efforts. The investigator of each project needs to select five representative research papers on the topic chosen, and carefully evaluate the papers. The final report should consist of (a) a clear problem statement; (b) criteria for the evaluation, (c) justification for the criteria; and (d) evaluation and assessment of the research papers selected, based on the criteria you determined. Topics include, but are not limited to the following:
 Project 16. Evaluating XML queries on distributed XML data.
Investigator:
 Project 17. Evaluating queries on distributed graphs.
Investigator:
 Project 18. Distributed graph database systems.
Investigator:
 Project 19. MapReduce algorithms for relational query evaluation.
Investigator:
 Project 20. Distributed data storage (data centers).
Investigator:
 Project 21. Querying distributed semistructured data
Investigator:
 Project 22. Schema matching.
Investigator:
 Project 23. Schema mapping.
Investigator:
 Project 24. XML data integration from relational sources.
Investigator:
 Project 25. Record matching.
Investigator:
 Project 26. Integrity constraints used in data cleaning.
Investigator:
 Project 27. Conflict resolution methods.
Investigator:
 Project 28. Data repairing.
Investigator:
 Project 29. Consistent query answering.
Investigator:
 Project 30. XML data cleaning.
Investigator:
 WenyuanYu  21 Jan 2012