Tuesday, December 4, 2012

Database Management System Chapter 10: XML

Chapter 10: XML

Introduction

n XML: Extensible Markup Language

n Defined by the WWW Consortium (W3C)

n Originally intended as a document markup language not a database language

H Documents have tags giving extra information about sections of the document

4 E.g. <title> XML </title> <slide> Introduction …</slide>

H Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML

H Extensible, unlike HTML

4 Users can add new tags, and separately specify how the tag should be handled for display

H Goal was (is?) to replace HTML as the language for publishing documents on the Web

n The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents.

H Much of the use of XML has been in data exchange applications, not as a replacement for HTML

n Tags make data (relatively) self-documenting

H E.g.
<bank>

<account>

<account-number> A-101 </account-number>

<branch-name> Downtown </branch-name>

<balance> 500 </balance>

</account>

<depositor>

<account-number> A-101 </account-number>

<customer-name> Johnson </customer-name>

</depositor>

</bank>

XML: Motivation

n Data interchange is critical in today’s networked world

H Examples:

4 Banking: funds transfer

4 Order processing (especially inter-company orders)

4 Scientific data

– Chemistry: ChemML, …

– Genetics: BSML (Bio-Sequence Markup Language), …

H Paper flow of information between organizations is being replaced by electronic flow of information

n Each application area has its own set of standards for representing information

n XML has become the basis for all new generation data interchange formats

n Earlier generation formats were based on plain text with line headers indicating the meaning of fields

H Similar in concept to email headers

H Does not allow for nested structures, no standard “type” language

H Tied too closely to low level document structure (lines, spaces, etc)

n Each XML based standard defines what are valid elements, using

H XML type specification languages to specify the syntax

4 DTD (Document Type Descriptors)

4 XML Schema

H Plus textual descriptions of the semantics

n XML allows new tags to be defined as required

H However, this may be constrained by DTDs

n A wide variety of tools is available for parsing, browsing and querying XML documents/data

Structure of XML Data

n Tag: label for a section of data

n Element: section of data beginning with <tagname> and ending with matching </tagname>

n Elements must be properly nested

H Proper nesting

4 <account> … <balance> …. </balance> </account>

H Improper nesting

4 <account> … <balance> …. </account> </balance>

H Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.

n Every document must have a single top-level element

Example of Nested Elements

<bank-1>
<customer>

<customer-name> Hayes </customer-name>

<customer-street> Main </customer-street>

<customer-city> Harrison </customer-city>

<account>

<account-number> A-102 </account-number>

<branch-name> Perryridge </branch-name>

<balance> 400 </balance>

</account>

<account>

…

</account>

           </customer>
         .
         .

</bank-1>

Motivation for Nesting

n Nesting of data is useful in data transfer

H Example: elements representing customer-id, customer name, and address nested within an order element

n Nesting is not supported, or discouraged, in relational databases

H With multiple orders, customer name and address are stored redundantly

H normalization replaces nested structures in each order by foreign key into table storing customer name and address information

H Nesting is supported in object-relational databases

n But nesting is appropriate when transferring data

H External application does not have direct access to data referenced by a foreign key

n Mixture of text with sub-elements is legal in XML.

H Example:

<account>

This account is seldom used any more.

<account-number> A-102</account-number>

<branch-name> Perryridge</branch-name>

<balance>400 </balance>
</account>

H Useful for document markup, but discouraged for data representation

Attributes

n Elements can have attributes

H <account acct-type = “checking” >

<account-number> A-102 </account-number>

<branch-name> Perryridge </branch-name>

<balance> 400 </balance>

</account>

n Attributes are specified by name=value pairs inside the starting tag of an element

n An element may have several attributes, but each attribute name can only occur once

4 <account acct-type = “checking” monthly-fee=“5”>

Attributes Vs. Subelements

n Distinction between subelement and attribute

H In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents

H In the context of data representation, the difference is unclear and may be confusing

4 Same information can be represented in two ways

– <account account-number = “A-101”> …. </account>

– <account>
<account-number>A-101</account-number> …
</account>

H Suggestion: use attributes for identifiers of elements, and use subelements for contents

More on XML Syntax

n Elements without subelements or text content can be abbreviated by ending the start tag with a /> and deleting the end tag

H <account number=“A-101” branch=“Perryridge” balance=“200 />

n To store string data that may contain tags, without the tags being interpreted as subelements, use CDATA as below

H <![CDATA[<account> … </account>]]>

4 Here, <account> and </account> are treated as just strings

Namespaces

n XML data has to be exchanged between organizations

n Same tag name may have different meaning in different organizations, causing confusion on exchanged documents

n Specifying a unique string as an element name avoids confusion

n Better solution: use unique-name:element-name

n Avoid using long unique names all over document by using XML Namespaces

<bank Xmlns:FB=‘http://www.FirstBank.com’>
…

<FB:branch>

<FB:branchname>Downtown</FB:branchname>

<FB:branchcity> Brooklyn </FB:branchcity>

</FB:branch>
…

</bank>

XML Document Schema

n Database schemas constrain what information can be stored, and the data types of stored values

n XML documents are not required to have an associated schema

n However, schemas are very important for XML data exchange

H Otherwise, a site cannot automatically interpret data received from another site

n Two mechanisms for specifying XML schema

H Document Type Definition (DTD)

4 Widely used

H XML Schema

4 Newer, increasing use

Document Type Definition (DTD)

n The type of an XML document can be specified using a DTD

n DTD constraints structure of XML data

H What elements can occur

H What attributes can/must an element have

H What subelements can/must occur inside each element, and how many times.

n DTD does not constrain data types

H All values represented as strings in XML

n DTD syntax

H <!ELEMENT element (subelements-specification) >

H <!ATTLIST element (attributes) >

Element Specification in DTD

n Subelements can be specified as

H names of elements, or

H #PCDATA (parsed character data), i.e., character strings

H EMPTY (no subelements) or ANY (anything can be a subelement)

n Example

<! ELEMENT depositor (customer-name account-number)>

<! ELEMENT customer-name (#PCDATA)>

<! ELEMENT account-number (#PCDATA)>

n Subelement specification may have regular expressions

<!ELEMENT bank ( ( account | customer | depositor)+)>

4 Notation:

– “|” - alternatives

– “+” - 1 or more occurrences

– “*” - 0 or more occurrences

Bank DTD

<!DOCTYPE bank [

<!ELEMENT bank ( ( account | customer | depositor)+)>

<!ELEMENT account (account-number branch-name balance)>

<! ELEMENT customer(customer-name customer-street
customer-city)>

<! ELEMENT depositor (customer-name account-number)>

<! ELEMENT account-number (#PCDATA)>

<! ELEMENT branch-name (#PCDATA)>

<! ELEMENT balance(#PCDATA)>

<! ELEMENT customer-name(#PCDATA)>

<! ELEMENT customer-street(#PCDATA)>

<! ELEMENT customer-city(#PCDATA)>

]>

Attribute Specification in DTD

n Attribute specification : for each attribute

H Name

H Type of attribute

4 CDATA

4 ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)

– more on this later

H Whether

4 mandatory (#REQUIRED)

4 has a default value (value),

4 or neither (#IMPLIED)

n Examples

H <!ATTLIST account acct-type CDATA “checking”>

H <!ATTLIST customer

customer-id ID # REQUIRED

accounts IDREFS # REQUIRED >

IDs and IDREFs

n An element can have at most one attribute of type ID

n The ID attribute value of each element in an XML document must be distinct

H Thus the ID attribute value is an object identifier

n An attribute of type IDREF must contain the ID value of an element in the same document

n An attribute of type IDREFS contains a set of (0 or more) ID values. Each ID value must contain the ID value of an element in the same document

Bank DTD with Attributes

n Bank DTD with ID and IDREF attribute types.

<!DOCTYPE bank-2[

<!ELEMENT account (branch, balance)>

<!ATTLIST account

account-number ID # REQUIRED

owners IDREFS # REQUIRED>

<!ELEMENT customer(customer-name, customer-street,

customer-city)>

<!ATTLIST customer

customer-id ID # REQUIRED

accounts IDREFS # REQUIRED>

… declarations for branch, balance, customer-name,
customer-street and customer-city
]>

XML data with ID and IDREF attributes

<bank-2>

<account account-number=“A-401” owners=“C100 C102”>

<branch-name> Downtown </branch-name>

<balance> 500 </balance>

</account>

<customer customer-id=“C100” accounts=“A-401”>

<customer-name>Joe </customer-name>

<customer-street> Monroe </customer-street>

<customer-city> Madison</customer-city>

</customer>

<customer customer-id=“C102” accounts=“A-401 A-402”>

<customer-name> Mary </customer-name>

<customer-street> Erin </customer-street>

<customer-city> Newark </customer-city>

</customer>

</bank-2>

Limitations of DTDs

n No typing of text elements and attributes

H All values are strings, no integers, reals, etc.

n Difficult to specify unordered sets of subelements

H Order is usually irrelevant in databases

H (A | B)* allows specification of an unordered set, but

4 Cannot ensure that each of A and B occurs only once

n IDs and IDREFs are untyped

H The owners attribute of an account may contain a reference to another account, which is meaningless

4 owners attribute should ideally be constrained to refer to customer elements

XML Schema

n XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports

H Typing of values

4 E.g. integer, string, etc

4 Also, constraints on min/max values

H User defined types

H Is itself specified in XML syntax, unlike DTDs

4 More standard representation, but verbose

H Is integrated with namespaces

H Many more features

4 List types, uniqueness and foreign key constraints, inheritance ..

n BUT: significantly more complicated than DTDs, not yet widely used.

XML Schema Version of Bank DTD

<xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema>

<xsd:element name=“bank” type=“BankType”/>

<xsd:element name=“account”>
<xsd:complexType>
      <xsd:sequence>
            <xsd:element name=“account-number” type=“xsd:string”/>
            <xsd:element name=“branch-name”      type=“xsd:string”/>
            <xsd:element name=“balance”               type=“xsd:decimal”/>
      </xsd:squence>
</xsd:complexType>

</xsd:element>

….. definitions of customer and depositor ….

<xsd:complexType name=“BankType”>
<xsd:squence>

<xsd:element ref=“account” minOccurs=“0” maxOccurs=“unbounded”/>

<xsd:element ref=“customer” minOccurs=“0” maxOccurs=“unbounded”/>

<xsd:element ref=“depositor” minOccurs=“0” maxOccurs=“unbounded”/>

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

Querying and Transforming XML Data

n Translation of information from one XML schema to another

n Querying on XML data

n Above two are closely related, and handled by the same tools

n Standard XML querying/translation languages

H XPath

4 Simple language consisting of path expressions

H XSLT

4 Simple language designed for translation from XML to XML and XML to HTML

H XQuery

4 An XML query language with a rich set of features

n Wide variety of other languages have been proposed, and some served as basis for the Xquery standard

H XML-QL, Quilt, XQL, …

Tree Model of XML Data

n Query and transformation languages are based on a tree model of XML data

n An XML document is modeled as a tree, with nodes corresponding to elements and attributes

H Element nodes have children nodes, which can be attributes or subelements

H Text in an element is modeled as a text node child of the element

H Children of a node are ordered according to their order in the XML document

H Element and attribute nodes (except for the root node) have a single parent, which is an element node

H The root node has a single child, which is the root element of the document

n We use the terminology of nodes, children, parent, siblings, ancestor, descendant, etc., which should be interpreted in the above tree model of XML data.

Xpath

n XPath is used to address (select) parts of documents using
path expressions

n A path expression is a sequence of steps separated by “/”

H Think of file names in a directory hierarchy

n Result of path expression: set of values that along with their containing elements/attributes match the specified path

n E.g. /bank-2/customer/customer-name evaluated on the bank-2 data we saw earlier returns

<customer-name>Joe</customer-name>

<customer-name>Mary</customer-name>

n E.g. /bank-2/customer/customer-name/text( )

returns the same names, but without the enclosing tags

n The initial “/” denotes root of the document (above the top-level tag)

n Path expressions are evaluated left to right

H Each step operates on the set of instances produced by the previous step

n Selection predicates may follow any step in a path, in [ ]

H E.g. /bank-2/account[balance > 400]

4 returns account elements with a balance value greater than 400

4 /bank-2/account[balance] returns account elements containing a balance subelement

n Attributes are accessed using “@”

H E.g. /bank-2/account[balance > 400]/@account-number

4 returns the account numbers of those accounts with balance > 400

H IDREF attributes are not dereferenced automatically (more on this later)

Functions in Xpath

n XPath provides several functions

H The function count() at the end of a path counts the number of elements in the set generated by the path

4 E.g. /bank-2/account[customer/count() > 2]

– Returns accounts with > 2 customers

H Also function for testing position (1, 2, ..) of node w.r.t. siblings

n Boolean connectives and and or and function not() can be used in predicates

n IDREFs can be referenced using function id()

H id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks

H E.g. /bank-2/account/id(@owner)

4 returns all customers referred to from the owners attribute of account elements.

More XPath Features

n Operator “|” used to implement union

H E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower)

4 gives customers with either accounts or loans

4 However, “|” cannot be nested inside other operators.

n “//” can be used to skip multiple levels of nodes

H E.g. /bank-2//customer-name

4 finds any customer-name element anywhere under the /bank-2 element, regardless of the element in which it is contained.

n A step in the path can go to:

parents, siblings, ancestors and descendants

of the nodes generated by the previous step, not just to the children

H “//”, described above, is a short from for specifying “all descendants”

H “..” specifies the parent.

H We omit further details,

XSLT

n A stylesheet stores formatting options for a document, usually separately from document

H E.g. HTML style sheet may specify font colors and sizes for headings, etc.

n The XML Stylesheet Language (XSL) was originally designed for generating HTML from XML

n XSLT is a general-purpose transformation language

H Can translate XML to XML, and XML to HTML

n XSLT transformations are expressed using rules called templates

H Templates combine selection using XPath with construction of results

XSLT Templates

n Example of XSLT template with match and select part

<xsl:template match=“/bank-2/customer”>

<xsl:value-of select=“customer-name”/>

</xsl:template>

<xsl:template match=“*”/>

n The match attribute of xsl:template specifies a pattern in XPath

n Elements in the XML document matching the pattern are processed by the actions within the xsl:template element

H xsl:value-of selects (outputs) specified values (here, customer-name)

n For elements that do not match any template

H Attributes and text contents are output as is

H Templates are recursively applied on subelements

n The <xsl:template match=“*”/> template matches all
elements that do not match any other template

H Used to ensure that their contents do not get output.

n If an element matches several templates, only one is used

H Which one depends on a complex priority scheme/user-defined priorities

H We assume only one template matches any element

Creating XML Output

n Any text or tag in the XSL stylesheet that is not in the xsl namespace is output as is

n E.g. to wrap results in new XML elements.

<xsl:template match=“/bank-2/customer”>

<customer>

<xsl:value-of select=“customer-name”/>

</customer>

</xsl;template>

<xsl:template match=“*”/>

H Example output:
<customer> Joe </customer>
<customer> Mary </customer>

n Note: Cannot directly insert a xsl:value-of tag inside another tag

H E.g. cannot create an attribute for <customer> in the previous example by directly using xsl:value-of

H XSLT provides a construct xsl:attribute to handle this situation

4 xsl:attribute adds attribute to the preceding element

4 E.g. <customer>

<xsl:attribute name=“customer-id”>

<xsl:value-of select = “customer-id”/>

</xsl:attribute>

</customer>

results in output of the form

<customer customer-id=“….”> ….

n xsl:element is used to create output elements with computed names

Structural Recursion

Joins in XSLT

Sorting in XSLT

n Using an xsl:sort directive inside a template causes all elements matching the template to be sorted

H Sorting is done before applying other templates

n E.g.
<xsl:template match=“/bank”>
      <xsl:apply-templates select=“customer”>
      <xsl:sort select=“customer-name”/>
      </xsl:apply-templates>
</xsl:template>
<xsl:template match=“customer”>
      <customer>
                <xsl:value-of select=“customer-name”/>
                <xsl:value-of select=“customer-street”/>
                <xsl:value-of select=“customer-city”/>
      </customer>
<xsl:template>
<xsl:template match=“*”/>

Xquery

n XQuery is a general purpose query language for XML data

n Currently being standardized by the World Wide Web Consortium (W3C)

H The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged.

n Alpha version of XQuery engine available free from Microsoft

n XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL

n XQuery uses a
      for … let … where .. result …
syntax
     for      ó SQL from
     where ó SQL where
     result ó SQL select
     let allows temporary variables, and has no equivalent in SQL

FLWR Syntax in XQuery

n For clause uses XPath expressions, and variable in for clause ranges over values in the set returned by XPath

n Simple FLWR expression in XQuery

H find all accounts with balance > 400, with each result enclosed in an <account-number> .. </account-number> tag
    for      $x in /bank-2/account
   let         $acctno := $x/@account-number
     where $x/balance > 400
     return <account-number> $acctno </account-number>

n Let clause not really needed in this query, and selection can be done In XPath. Query can be written as:

          for $x in /bank-2/account[balance>400]
      return <account-number> $x/@account-number
                                                           </account-number>

Path Expressions and Functions

n Path expressions are used to bind variables in the for clause, but can also be used in other places

H E.g. path expressions can be used in let clause, to bind variables to results of path expressions

n The function distinct( ) can be used to removed duplicates in path expression results

n The function document(name) returns root of named document

H E.g. document(“bank-2.xml”)/bank-2/account

n Aggregate functions such as sum( ) and count( ) can be applied to path expression results

n XQuery does not support group by, but the same effect can be got by nested queries, with nested FLWR expressions within a result clause

H More on nested queries later

Joins

n Joins are specified in a manner very similar to SQL

for $a in /bank/account,

$c in /bank/customer,

$d in /bank/depositor

where $a/account-number = $d/account-number
and $c/customer-name = $d/customer-name

return <cust-acct> $c $a </cust-acct>

n The same query can be expressed with the selections specified as XPath selections:

       for $a in /bank/account
         $c in /bank/customer
      $d in /bank/depositor[
                      account-number = $a/account-number and
                      customer-name = $c/customer-name]

return <cust-acct> $c $a</cust-acct>

Changing Nesting Structure

n The following query converts data from the flat structure for bank information into the nested structure used in bank-1

<bank-1>

for $c in /bank/customer

return

<customer>

$c/*

for $d in /bank/depositor[customer-name = $c/customer-name],

$a in /bank/account[account-number=$d/account-number]

return $a

</customer>

</bank-1>

n $c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag

n Exercise for reader: write a nested query to find sum of account
balances, grouped by branch.

XQuery Path Expressions

n $c/text() gives text content of an element without any
subelements/tags

n XQuery path expressions support the “–>” operator for dereferencing IDREFs

H Equivalent to the id( ) function of XPath, but simpler to use

H Can be applied to a set of IDREFs to get a set of results

H June 2001 version of standard has changed “–>” to “=>”

Sorting in XQuery

n Sortby clause can be used at the end of any expression. E.g. to return customers sorted by name
for $c in /bank/customer
return <customer> $c/* </customer> sortby(name)

n Can sort at multiple levels of nesting (sort by customer-name, and by account-number within each customer)

        <bank-1>
   for $c in /bank/customer
   return
      <customer>
          $c/*
          for $d in /bank/depositor[customer-name=$c/customer-name],
                $a in /bank/account[account-number=$d/account-number]
         return <account> $a/* </account> sortby(account-number)
      </customer> sortby(customer-name)

</bank-1>

Functions and Other XQuery Features

n User defined functions with the type system of XMLSchema
function balances(xsd:string $c) returns list(xsd:numeric) {
     for $d in /bank/depositor[customer-name = $c],
           $a in /bank/account[account-number=$d/account-number]
     return $a/balance

}

n Types are optional for function parameters and return values

n Universal and existential quantification in where clause predicates

H some $e in path satisfies P

H every $e in path satisfies P

n XQuery also supports If-then-else clauses

Application Program Interface

n There are two standard application program interfaces to XML data:

H SAX (Simple API for XML)

4 Based on parser model, user provides event handlers for parsing events

– E.g. start of element, end of element

– Not suitable for database applications

H DOM (Document Object Model)

4 XML data is parsed into a tree representation

4 Variety of functions provided for traversing the DOM tree

4 E.g.: Java DOM API provides Node class with methods
          getParentNode( ), getFirstChild( ), getNextSibling( )
          getAttribute( ), getData( ) (for text node)
          getElementsByTagName( ), …

4 Also provides functions for updating DOM tree

Storage of XML Data

n XML data can be stored in

H Non-relational data stores

4 Flat files

– Natural for storing XML

– But has all problems discussed in Chapter 1 (no concurrency, no recovery, …)

4 XML database

– Database built specifically for storing XML data, supporting DOM model and declarative querying

– Currently no commercial-grade systems

H Relational databases

4 Data must be translated into relational form

4 Advantage: mature database systems

4 Disadvantages: overhead of translating data and queries

Storage of XML in Relational Databases

n Alternatives:

H String Representation

H Tree Representation

H Map to relations

String Representation

n Store each top level element as a string field of a tuple in a relational database

H Use a single relation to store all elements, or

H Use a separate relation for each top-level element type

4 E.g. account, customer, depositor relations

– Each with a string-valued attribute to store the element

n Indexing:

H Store values of subelements/attributes to be indexed as extra fields of the relation, and build indices on these fields

4 E.g. customer-name or account-number

H Oracle 9 supports function indices which use the result of a function as the key value.

4 The function should return the value of the required subelement/attribute

n Benefits:

H Can store any XML data even without DTD

H As long as there are many top-level elements in a document, strings are small compared to full document

4 Allows fast access to individual elements.

n Drawback: Need to parse strings to access values inside the elements

H Parsing is slow.

Tree Representation

n Tree representation: model XML data as tree and store using relations
nodes(id, type, label, value)
child (child-id, parent-id)

n Each element/attribute is given a unique identifier

n Type indicates element/attribute

n Label specifies the tag name of the element/name of attribute

n Value is the text value of the element/attribute

n The relation child notes the parent-child relationships in the tree

H Can add an extra attribute to child to record ordering of children

n Benefit: Can store any XML data, even without DTD

n Drawbacks:

H Data is broken up into too many pieces, increasing space overheads

H Even simple queries require a large number of joins, which can be slow

Mapping XML Data to Relations

n Map to relations

H If DTD of document is known, can map data to relations

H A relation is created for each element type

4 Elements (of type #PCDATA), and attributes are mapped to attributes of relations

4 More details on next slide …

n Benefits:

H Efficient storage

H Can translate XML queries into SQL, execute efficiently, and then translate SQL results back to XML

n Drawbacks: need to know DTD, translation overheads still present

n Relation created for each element type contains

H An id attribute to store a unique id for each element

H A relation attribute corresponding to each element attribute

H A parent-id attribute to keep track of parent element

4 As in the tree representation

4 Position information (i^th child) can be store too

n All subelements that occur only once can become relation attributes

H For text-valued subelements, store the text as attribute value

H For complex subelements, can store the id of the subelement

n Subelements that can occur multiple times represented in a separate table

H Similar to handling of multivalued attributes when converting ER diagrams to tables

n E.g. For bank-1 DTD with account elements nested within customer elements, create relations

H customer(id, parent-id, customer-name, customer-stret, customer-city)

4 parent-id can be dropped here since parent is the sole root element

4 All other attributes were subelements of type #PCDATA, and occur only once

H account (id, parent-id, account-number, branch-name, balance)

4 parent-id keeps track of which customer an account occurs under

4 Same account may be represented many times with different parents

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)