## Design and Development

#### Team project: a group of 2 or 3 people.

• Project 1: DTD Subsumption
Responsibility: develop an algorithm to decide whether a DTD is subsumed by another, as defined below.
Background. A DTD D1 = (E1, r1, P1) subsumes another DTD D2 = (E2, r2, P2) if there exists a tag renaming function f: E2 -> E1 such that for any XML tree T conforming to D2 , f(T) conforms to D1.
Objective. This project is to develop an algorithm that, given any two DTDs D1 and D2 , determines whether or not D1 subsumes D2 .
The final report should consist of (a) a problem statement, (b) the complexity of the problem, (c) the algorithm, (d) a proof for the correctness of the algorithm, ie, you need to prove that your algorithm returns true if D1 subsumes D2, (e) a complexity analysis of your algorithm, and (f) an experimental study showing that your algorithm is efficient.
You may focus on DTDs with productions of the form A -> W, where
W ::= A1, ..., An | A1 + ... + An | A* | PCDATA
and Ai is an element tag for 0 <= n .
Investigators (a team of at most three people): Mihaila

• Project 2: Validating XML keys.
A relative XML key is defined as (Q, (P, {@A1, ..., @An})), where the context and target paths Q, P are XPath expressions defined by:
p::= A | * | p/p | p//p
If the context path Q is empty, the key is called an absolute key. Find out more about XML keys from "Keys for XML".
• Write a query in XQuery such that given an XML absolute key k defined as above and an XML document T , it returns true if and only if T satisfies k.
• Write a query in XQuery such that given an XML relative key k defined as above and an XML document T , it returns true if and only if T satisfies k.
• Develop an algorithm that, given a set K of XML relative keys, generates a query Q in XQuery such that given an XML document T , Q(T) is true if and only if T satisfies K.
• Provide a complexity analysis of your algorithm and the query generated.
• Experimentally evaluate your validating queries with respect to various XML documents and XML keys.
• What is the complexity of the following problems?
• Given an XML document T , a set K of XML relative keys, and a DTD D , it is to determine whether T satisfies K and conforms to D .
• Given an XML document T , a set K of XML absolute keys, and a DTD D , it is to determine whether T satisfies K and conforms to D .

If these problems are intractable, develop heuristic algorithms for them.

Investigators (a team of at most three people):

• Project 3: Tree pattern queries
Responsibility: develop effective algorithms for evaluating tree pattern queries on directed acyclic graphs (DAGs).
Background. A number of algorithms have been developed for tree-pattern (twig) queries defined in terms of:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant. A tree-pattern query can be depicted as a tree, and is evaluated on XML data modeled as a tree.
In practice XML data is often modeled as a graph rather than a tree, when ID and IDREF attributes are taken into account.
Objective. This project aims to develop efficient algorithms for evaluating tree pattern queries on DAGs with performance guarantee, i.e., traversing each node in the XML "DAG" no more than twice.
The final report should consist of (a) a problem statement, (b) yours algorithms, (c) a proof for the correctness of the algorithms, (d) a complexity analysis of your algorithms, and (e) an experimental study verifying the efficiency of your algorithms.
Investigators (a team of at most two people):

• Project 4: Querying XML stream
The objective is to extend an XML SAX parser to evaluate a small fragment of XPath.
The XPath fragment supports downward traversal only and does not allow predicates. More specifically, it is defined as follows:
p::= A | * | p/p | p//p
Objective.
• Develop an algorithm that, given an XPath query q defined as above, generates SAX events to evaluate the query.
• Implement the algorithm on top of any SAX parser of your choice, to evaluate XPath queries in a single traversal of any XML document.
• Provide a complexity analysis of your algorithm.
• Experimentally evaluate the algorithm and implementation.
Investigators (a team of at most two people): Na Fan and Renjun Zou

• Project 5: Invertible Shredding of XML data
The objective is to store the data of an XML document in relations, without loss of information.
• Find a real-life recursive DTD D from the Web.
• Design a relational schema R, in 3NF, for storing XML documents of the DTD.
• Write an XML2DB mapping f from documents of D to database instances of R.
• Write an ATG g from databases of R to XML instances of DTD D.
• Prove that for any XML instances T of D, T = g(f(T)), ie, g is the inverse of f; in other words, the document can be restored from its relational representation.
• Discuss the complexity of your XML2DB mapping f and ATG g in terms of the size of XML instances of D .
• Implement your XML2DB mapping and ATG.
• Generate XML documents of various sizes based on the DTD by using, eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/)
• Select data from these documents via your XML2DB mappings and store it in a relational database (using whatever relational DBMS you like).
• Experimentally verify that your ATG can restore the XML data stored in relations via your XML2DB mapping.
Investigators (a team of at most three people): Lydia Markoulaki, Oluwasoji Sanyaolu, Sifau Oniyangi

• Project 6: Lossless Shredding of XML data
The objective is to store the data of an XML document in relations, without loss of information when XPath queries are concerned.
• Find a real-life recursive DTD D from the Web.
• Design a relational schema R, in 3NF, for storing XML documents of the DTD.
• Write an XML2DB mapping f from documents of D to database instances of R.
• Develop an algorithm that, given any XPath query posed on documents of DTD D, translates the query into an equivalent query in SQL'99 on the corresponding relations of R. In other words, XPath queries on the documents can be answered in SQL.
• Generate XML documents of various sizes based on the DTD by using, eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/)
• Select data from these documents via your XML2DB mappings and store it in a relational database (using whatever relational DBMS you like).
• Experimentally verify that your XML2DB mapping and query translation algorithm are correct, by comparing the results of XPath queries posed on XML documents with the results of the SQL queries that are generated by your translation algorithms and posed on the relations produced by your XML2DB mapping.
You may focus on the following XPath fragment:
p::= A | * | p/p | p//p | p[q]
q ::= @a = `c'
where @a denotes an attribute.
Investigators (a team of at most three people): Ivo Dimitrov, Stelios Stylianou, Niki Manolitsi

• Project 7: Selective Shredding of XML Data
The objective is to select data from an XML document and store it in an existing relational database, by using XML2DB proposed by "Selectively Storing XML Data in Relations".
• Find a real-life recursive DTD from the Web.
• Design a relational schema, in 3NF, for storing XML documents of the DTD.
• Write 3 sufficiently diverse XML2DB mappings from documents of the DTD to its relational storage.
• Incorporate the mappings into an SAX parser for XML DTD.
• Generate XML documents of various sizes based on the DTD by using, eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/)
• Select data from these documents via your XML2DB mappings and SAX-implementation, and store it in a relational database (using whatever relational DBMS you like).
• Experimentally verify the scalability of your shredding mappings with respect to the size of the XML documents generated.
Investigators (a team of at most two people): Ivan Nikolov

• Project 8: Storing DBLP XML records in relations.
The DBLP server provides bibliographic information on major computer science journals and proceedings. It now collects more than 830000 article entries, in XML records, and is being widely used by computer scientists and students worldwide.
The objective of this project is to store DBLP XML data in relations and query DBLP data using relational DBMS.
• Design a relational schema in 3NF (with appropriate functional dependencies). Your schema should be able to store the title, authors, conference/journal, year, page numbers of each entry.
• Write an XML2DB mapping for each set of the following data. An XML2DB mapping traverses DBLP XML data, extracts the relevant entries and extends the relations of your schema using the data:
• Select entries of the top 5 database conferences: SIGMOD, PODS, VLDB, ICDT, ICDE (ignoring the pdf files).
• Select entries of major computer science journals: JACM, SICOMP, JCSS, TCS, TODS.
• Design a DTD to specify an XML document that consists of all publications in the conferences or journals above. Based upon the DTD, write an ATG to generate an XML view T from your relational database.
• Write an XSLT program to generate a HTML page from T , such that the page is of the form http://www.lfcs.inf.ed.ac.uk/research/database/Publications.html.
Investigators (a team of at most two people): Diana Pastor, Daniel Stanoescu

• Project 9: Schema-Directed XML Transformations
Objective. The objective of this project is to develop a systematic method to transform XML data of a DTD D1 to instances of another DTD D2. More specifically, it is to design the following:
• A specification language that allows users to specify what data they want from an XML document of DTD D1, extract the data, and generate an XML document that is guaranteed to conform to DTD D2 based on the extracted information;
• An algorithm for supporting the language such that given a specification with respect to a source DTD D1 and a target DTD D2, and any instance T of D1, it generates an XML document that conforms to D2; and
• Optimization techniques.
The final report should consist of (a) a problem statement, (b) the specification language, (c) the algorithm, (d) a proof for the correctness of the algorithm, (e) optimization techniques, and (f) an experimental study verifying the effectiveness of the optimization techniques.
Investigators (a team of at most three people): Roberto Bezoari, Maciej Czubak, Oleg Filippov

• Project 10: Answering XPath queries in SQL.
The objective of this project is to develop a program to translate simple XPath queries posed on DBLP data into equivalent SQL queries.
Read "Query Translation from XPath to SQL in the Presence of Recursive DTDs".
• Extract the DBLP XML records described in Project 8 above, and store the records in relations of the following schema:
conference(id, name),
journal(id, name),
publish(author, title, id, year)
where id is the acronym of a conference or journal, e.g., SIGMOD, JACM; name indicates the full name of a conference or journal, e.g., Proceedings of the ACM SIGMOD International Conference on Management of Data, Journal of the ACM; a tuple (a, t, id, year) in the publish table means that author a published a paper with title t , at conference (or in journal) id , in the particular year .
Use XML2DB mappings to specify XML shredding.
• Write a program that, given an XPath query posed on the DBLP XML records, translates it to an equivalent SQL query on the relational database storing DBLP data. You may only consider XPath queries of the form:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant.
• Populate the relations using your XML2DB mapping and DBLP data.
• Experimentally evaluate your translation program by answering at least 12 various XPath queries on DBLP XML records. Compare the performance with its counterpart by evaluating your XPath queries on the DBLP XML file directly.
Investigators (a team of at most two people): Mohammed, Sicong Liu

• Project 11: Updating XML data using a relational DBMS.
This project is to develop a program to translate simple XML updates posed on DBLP XML records into relational updates.
Read "Query Translation from XPath to SQL in the Presence of Recursive DTDs".
• Extract the DBLP XML records described in Project 8 above, and store the records in relations of the following schema:
conference(id, name)
journal(id, name),
publish(author, title, id, year)
see described in project 10 above. You may use XML2DB mappings to specify XML shredding.
• Write a program that, given an XML update posed on the DBLP XML records, translates the update into an equivalent relational update on the tables you created.
An XML update is either insert e into p, or delete p, where e is a constant XML element, and p is an XPath query in the fragment below:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant. If an input update violates the DBLP DTD or an equivalent SQL translation does not exist, raise exception.
• Populate the relations as described in Project 10.
• Generate at least 12 XML updates.
• Experimentally evaluate your view-update program by issuing the XML updates generated on DBLP XML records. Report the running time.
• Compare the performance with its counterpart by updating the DBLP XML file directly.
Investigators (a team of at most two people): Qing Wang, Yumeng Qin

• Project 12: Incremental maintenance of XML views of relational data.
The objective of this project is to develop a program to propagate changes from relational course data to its XML views.
Read "Incremental evaluation of schema-directed XML publishing".
• Collect data from the teaching page of the School of Informatics, and store the data in a relational database of the following schema
• course(title, cno, type), where type indicates informatics 1, informatics 2, 3rd year, 4th year, MSc, and MEng.
• prereq(cno1, cno2), which indicates that course cno2 is a prerequisite of cno1.
• Write an ATG that builds an XML view of the relational database, such that the view conforms to the DTD given below (only productions are listed):
db -> course*,
course -> cno, title, type, prerequisite*,
prerequisite -> course,
where prerequisite under each course C should list only direct prerequisites of C. Note that this is a recursive DTD.
• Write a program that incrementally modify your published XML view in response to one of the following relational update:
• Insertion of a set of tuples into the course table.
• Insertion of a set of tuples into the prereq table.
• Deletion of a set of tuples from the course table.
• Deletion of a set of tuples from the prereq table.
• Experimentally evaluate the response time of your incremental maintenance program by testing various insertions and deletions posed on the relational database, vs. re-computing the views starting from scratch.
Investigators (a team of at most two people):

• Project 13: XML View Updates.
This project is to develop a program to process simple XML updates posed on XML views published from relational data.
Read "Updating XML Views of Relations".
• Collect data from the teaching page of the School of Informatics, and store the data in a relational database of the following schema
• course(title, cno, type), where type indicates informatics 1, informatics 2, 3rd year, 4th year, MSc, and MEng.
• prereq(cno1, cno2), which indicates that course cno2 is a prerequisite of cno1.
• Write an ATG that builds an XML view of the relational database, such that the view conforms to the DTD given below (only productions are listed):
db -> course*,
course -> cno, title, type, prerequisite*,
prerequisite -> course,
where prerequisite under each course C should list only direct prerequisites of C. Note that this is a recursive DTD.
• Write a program that, given an XML update posed on the ATG view, translates the update into an equivalent relational update on the tables you created. That is, when the relational updates are executed, the new view published by the ATG from the updated relations is precisely the one obtained from the old view modified by the input XML updates. If a translation does not exist, raise exception.
An XML update is either insert e into p, or delete p, where e is a constant XML element, and p is an XPath query in the fragment below:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant.
• Experimentally evaluate your view-update program by issuing various XML updates on the ATG view.
Investigators (a team of at most two people): Liyu Cai, Cheng Feng

• Project 14: Providing XQuery Engines with Update Functionality
The objective is to support XML updates on top of XQuery engines.
An XML update is specified as either insert e into q, or delete q, where e is a constant XML element, and q is a query in XQuery. You may focus on a class of queries q for which you can find an available parser (eg, XPath or even a fragment of XPath) from the Web.
• Develop an algorithm that, given an XML update defined as above, rewrites the update into an equivalent query in XQuery.
• Implement the algorithm on top of any XQuery engine of your choice, to provide the update functionality. You may use the XQuery engine to evaluate the rewritten XML query.
• Generate XML documents of various sizes by using eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/), and design a generator to produce a set of XML updates.
• Experimentally evaluate your algorithm using the XML data and XML updates generated.
Investigators (a team of at most two people):

• Project 15: Constraint Propagation from Relations to XML
Responsibility: Investigate constraint propagation from relations to its XML views.
Background. Consider XML publishing via an ATG g defined in terms of SPJ queries from relations of schema R to XML trees of a DTD D. Suppose that there are keys and foreign keys defined onR. The propagation problem is to determine whether an XML key defined on D is propagated from the keys and foreign keys on R via g. That is, for any relational instance I of R, if I satisfies the relational keys and foreign keys, then g(I) must satisfy the XML key.
Objective. (a) Establish the complexity of the propagation problem. (b) Develop a heuristic algorithm to derive XML keys from relational keys and foreign keys via ATG mappings. (c) Verify the correctness of your algorithm. (d) Experimentally verify the effectiveness and efficiency of your algorithm.
Investigators (a team of at most two people):

• Project 16: Distance Queries on Graphs
Background. A distance query consists of a pair of nodes ( s, t). Given a graph G, it is to find the distance from s to t in G.
Objective.
• Develop two algorithms for evaluating distance queries, one with index and the other without.
• Collect a set of real-life data from the Web (e.g., YouTube ).
• Experimentally evaluate the scalability of your algorithms.
Investigators (a team of at most two people):

• Project 17: Regular Reachability Queries on Graphs
Background. A regular reachability query is of the form ( s, t, R), where ( s, t) is a pair of nodes, and R is a regular expression defined on node labels. Given a graph G, the regular reachability query is to find whether there exists a path from s to t in G, and moreover, the node labels on the path satisfy the regular expression R.
Objective.
• Develop an algorithms for evaluating regular reachability queries on large graphs.
• Give a detailed complexity analysis of your algorithm.
• Collect a set of real-life data from the Web (e.g., YouTube ).
• Experimentally evaluate the scalability of your algorithm.
Investigators (a team of at most three people): Samuel Giffard, Miha Strumbelj, Poh Camee

• Project 18: Graph pattern matching
Read "Graph pattern matching: From intractable to polynomial time"
The objective is to experimentally evaluate graph pattern matching defined in terms of (a) sub-graph isomorphism, and (b) bounded graph simulation.
• Collect a set of real-life data from the Web (e.g., YouTube ).
• Implement an algorithm for graph pattern matching based on each of the following: (a) sub-graph isomorphism, and (b) bounded graph simulation.
• Experimentally evaluate the scalability of the algorithms with respect to the size of data graphs you collected.
• Experimentally evaluate the "accuracy" of the algorithms in terms of sensible matches found, using the real-life data you collected.
Investigators (a team of at most two people):

• Project 19: Incremental graph pattern matching
This project is to develop and experimentally evaluate algorithms for incremental graph pattern matching, based on (a) sub-graph isomorphism, and (b) bounded graph simulation.
• Develop an incremental (heuristic) algorithm for graph pattern matching based on sub-graph isomorphism.
• Implement an algorithm for graph pattern matching based on bounded graph simulation, which handles multiple graph updates.
• Collect a set of real-life data from the Web (e.g., YouTube ).
• Design a generator to produce a set of graph updates.
• Experimentally evaluate the scalability of the incremental algorithms with respect to (a) the size of graph updates, and (b) the size of data graphs you collected.
• Verify that your incremental algorithm for sub-graph isomorphism is correct and is efficient when the changes are small, compared with its batch counterpart.
• Experimentally evaluate the "accuracy" of the algorithms in terms of sensible matches found, using the real-life data you collected.

Investigators (a team of at most two people):

• Project 20: The similarity of graphs
Read "Graph homomorphism revisited for graph matching".
This project is to experimentally evaluate algorithms for measuring the similarity between graphs, based on (a) maximum common subgraphs, and (b) p-homomorphism with maximum cardinality.
• Find and implement a (heuristic) algorithm for finding the maximum subgraph between two data graphs.
• Implement an algorithm for p-homomorphism with maximum cardinality.
• Collect a set of real-life data from the Web (e.g., YouTube ).
• Experimentally evaluate the scalability of the two algorithms with respect to the size of data graphs you collected, and the quality of similarity matches found.
Investigators (a team of at most three people): Rustam Aliyev, Mindaugas Tvaronavicius, Zhisen Li

## Survey

These are individual projects rather than team efforts. The investigator of each project needs to select five recent research papers ( not on the reading list of this course) or commercial systems (or fully implemented research prototype systems) on the topic chosen, and carefully evaluate the papers or systems. The final report should consist of (a) a clear problem statement; (b) a set of criteria for the evaluation, (c) justification for the criteria; and (d) evaluation and assessment of the papers or systems selected, based on the criteria you determined. Topics include, but are not limited to the following:

• Project 21: XML data generators and benchmarks
Investigators: Alexandru Seran

• Project 22: Native XML stores
Investigators:

• Project 23: XQuery engines
Investigators:

• Project 24: Integrity constraints for XML
Investigators: Yuge Kong

• Project 25: Specifications of XML data: Schema
Investigators:

• Project 26: XML data compression
Investigators: Maxim Cramer

• Project 27: XML shredding in commercial systems
Investigators: Tassybayeva Assel

• Project 28: XML publishing in commercial systems
Investigators:

• Project 29: Querying XML data stored in relations
Investigators: Yi Ran

Investigators: Mihaela Dragomir

• Project 31: Incremental maintenance of XML views
Investigators:

• Project 32: XML view updates
Investigators: Olesya Razuvayevskaya

• Project 33: XML publish-subscribe systems
Investigators:

• Project 34: Access control for XML data
Investigators: Anukanti Hegde

• Project 35: Graph pattern matching in social networks
Investigators:

• Project 36: Incremental graph pattern matching
Investigators: Ying Zeng

• Project 37: Algorithms for evaluating reachability queries and distance queries
Investigators:

• Project 38: Detecting mirrors of Web sites
Investigators: Fatima Asger

• Project 39: Query languages for graph data
Investigators: Euan Gillespie

• Project 40: Graph query engines
Investigators:

Topic revision: r20 - 27 Mar 2012 - 14:59:30 - Main.s0944873

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback