FourthEdition ., FUNDAMENTALS OF DATABASE SYSTEMS FUNDAMENTALS OF DATABASE SYSTEMS Fourth Edition Ramez Elmasri Department of Computer Science Engineering University of Texas at Arlington Shamkant B. N avathe College of Computing Georgia Institute of Technology • .~"-• . . Boston San Francisco New York London Toronto Sydney Tokyo Singapore Madrid Mexico City Munich Paris Cape Town Hong Kong Montreal Sponsoring Editor: Project Editor: Senior Production Supervisor: Production Services: Cover Designer: Marketing Manager: Senior Marketing Coordinator: Print Buyer: Cover image © 2003 Digital Vision Maite Suarez-Rivas Katherine Harutunian Juliet Silveri Argosy Publishing Beth Anderson Nathan Schultz Lesly Hershman Caroline Fell Access the latest information about Addison-Wesley titles from our World Wide Web site: http://www.aw.com/cs Figure 12.14 is a logical data model diagram definition in Rational Rose®. Figure 12.15 is a graphi? cal data model diagram in Rational Rose'", Figure 12.17 is the company database class diagram drawn in Rational Rose®. IBM® has acquired Rational Rose®. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial caps or all caps. The programs and applications presented in this book have been included for their instructional value. They have been tested with care, but are not guaranteed for any particular purpose. The pub? lisher does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs or applications. Library of Congress Cataloging-in-Publication Data Elmasri, Ramez. Fundamentals of database systems / Ramez Elmasri, Shamkant B. Navathe.--4th ed. p. cm. Includes bibliographical references and index. ISBN 0-321-12226-7 I. Database management. 1. Navathe, Sham. II. Title. QA 76.9.03E57 2003 005.74--dc21 2003057734 ISBN 0-321-12226-7 For information on obtaining permission for the use of material from this work, please submit a writ? ten request to Pearson Education, Inc., Rights and Contracts Department, 75 Arlington St., Suite 300, Boston, MA 02116 or fax your request to 617-848-7047. Copyright © 2004 by Pearson Education, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or other? wise, without the prior written permission of the publisher. Printed in the United States of America. 1 2 3 4 5 6 7 8 9 lO-HT-06050403 To Amalia with love R. E. To my motherVijaya and wife Aruna for their love and support S. B.N. Preface This book introduces the fundamental concepts necessary for designing, using, and imple? menting database systems and applications. Our presentations stresses the fundamentals of database modeling and design, the languages and facilities provided by the database management systems, and system implementation techniques. The book is meant to be used as a textbook for a one- or two-semester course in database systems at the junior, senior or graduate level, and as a reference book. We assume that the readers are familiar with elementary programming and data-structuring concepts and that they have had some exposure to the basic computer organization. We start in Part I with an introduction and a presentation of the basic concepts and terminology, and database conceptual modeling principles. We conclude the book in Parts 7 and 8 with an introduction to emerging technologies, such as data mining, XML, security, and Web databases. Along the way-in Parts 2 through 6-we provide an in? depth treatment of the most important aspects of database fundamentals. The following key features are included in the fourth edition: • The entire book follows a self-contained, flexible organization that can be tailored to individual needs. • Coverage of data modeling now includes both the ER model and UML. • A new advanced SQL chapter with material on SQL programming techniques, such as ]DBC and SQL/CLl. VII viii Preface • Two examples running throughout the book-----called COMPANY and UNIVER? SITY-allow the reader to compare different approaches that use the same application. • Coverage has been updated on security, mobile databases, GIS, and Genome data management. • A new chapter on XML and Internet databases. • A new chapter on data mining. • A significant revision of the supplements to include a robust set of materials for instructors and students, and an online case study. Main Differences from the Third Edition There are several organizational changes in the fourth edition, as well as some important new chapters. The main changes are as follows: • The chapters on file organizations and indexing (Chapters 5 and 6 in the third edi? tion) have been moved to Part 4, and are now Chapters 13 and 14. Part 4 also includes Chapters 15 and 16 on query processing and optimization, and physical database design and tuning (this corresponds to Chapter 18 and sections 16.3-16.4 of the third edition). • The relational model coverage has been reorganized and updated in Part 2. Chapter 5 covers relational model concepts and constraints. The material on relational alge? bra and calculus is now together in Chapter 6. Relational database design using ER? to-relational and EER-to-relational mapping is in Chapter 7. SQL is covered in Chapters 8 and 9, with the new material in SQL programming techniques in sections 9.3 through 9.6. • Part 3 covers database design theory and methodology. Chapters 10 and lion normal? ization theory correspond to Chapters 14 and 15 of the third edition. Chapter 12 on practical database design has been updated to include more UML coverage. • The chapters on transactions, concurrency control, and recovery (19, 20, 21 in the third edition) are now Chapters 17, 18, and 19 in Part 5. • The chapters on object-oriented concepts, ODMG object model, and object-relational systems (11,12,13 in the third edition) are now 20, 21, and 22 in Part 6. Chapter 22 has been reorganized and updated. • Chapters 10 and 17 of the third edition have been dropped. The material on client? server architectures has been merged into Chapters 2 and 25. • The chapters on security, enhanced models (active, temporal, spatial, multimedia), and distributed databases (Chapters 22, 23, 24 in the third edition) are now 23, 24, and 25 in Part 7. The security chapter has been updated. Chapter 25 of the third edition on deductive databases has been merged into Chapter 24, and is now section 24.4. • Chapter 26 is a new chapter on XML (eXtended Markup Language), and how it is related to accessing relational databases over the Internet. • The material on data mining and data warehousing (Chapter 26 of the third edition) has been separated into two chapters. Chaprer 27 on data mining has been expanded and updated. Preface IIX Contents of This Edition Part 1 describes the basic concepts necessary for a good understanding of database design and implementation, as well as the conceptual modeling techniques used in database sys? tems. Chapters 1 and 2 introduce databases, their typical users, and DBMS concepts, ter? minology, and architecture. In Chapter 3, the concepts of the Entity-Relationship (ER) model and ER diagrams are presented and used to illustrate conceptual database design. Chapter 4 focuses on data abstraction and semantic data modeling concepts and extends the ER model to incorporate these ideas, leading to the enhanced-ER (EER) data model and EER diagrams. The concepts presented include subclasses, specialization, generaliza? tion, and union types (categories). The notation for the class diagrams of UML are also introduced in Chapters 3 and 4. Part 2 describes the relational data model and relational DBMSs. Chapter 5 describes the basic relational model, its integrity constraints and update operations. Chapter 6 describes the operations of the relational algebra and introduces the relational calculus. Chapter 7 discusses relational database design using ER and EER-to-relational mapping. Chapter 8 gives a detailed overview of the SQL language, covering the SQL standard, which is implemented in most relational systems. Chapter 9 covers SQL programming topics such as SQL], JDBC, and SQL/CLI. Part 3 covers several topics related to database design. Chapters 10 and 11 cover the formalisms, theories, and algorithms developed for the relational database design by nor? malization. This material includes functional and other types of dependencies and normal forms of relarions. Step-by-step intuitive normalizarion is presented in Chapter 10, and relational design algorithms are given in Chapter 11, which also defines other types of dependencies, such as multivalued and join dependencies. Chapter 12 presents an over? view of the different phases of the database design process for medium-sized and large applications, using UML. I Part 4 starts with a description of the physical file structures and access methods used in database systems. Chapter 13 describes primary methods of organizing files of records on disk, including static and dynamic hashing. Chapter 14 describes indexing techniques for files, including B-tree and B+-tree data structures and grid files. Chapter 15 introduces the basics of query processing and optimization, and Chapter 16 discusses physical data? base design and tuning. Part 5 discusses transaction processing, concurrency control, and recovery tech? niques, including discussions of how these concepts are realized in SQL. x I Preface Part 6 gives a comprehensive introduction to object databases and object-relational systems. Chapter 20 introduces object-oriented concepts. Chapter 21 gives a detailed overview of the ODMG object model and its associated ODL and OQL languages. Chapter 22 describes how relational databases are being extended to include object-oriented con? cepts and presents the features of object-relational systems, as well as giving an overview of some of the features of the SQL3 standard, and the nested relational data model. Parts 7 and 8 cover a number of advanced topics. Chapter 23 gives an overview of database security and authorization, including the SQL commands to GRANT and REVOKE privileges, and expanded coverage on security concepts such as encryption, roles, and flow control. Chapter 24 introduces several enhanced database models for advanced applications. These include active databases and triggers, temporal, spatial, mul? timedia, and deductive databases. Chapter 25 gives an introduction to distributed data? bases and the three-tier client-server architecture. Chapter 26 is a new chapter on XML (eXtended Markup Language). It first discusses the differences between structured, semi? structured, and unstructured models, then presents XML concepts, and finally compares the XML model to traditional database models. Chapter 27 on data mining has been expanded and updated. Chapter 28 introduces data warehousing concepts. Finally, Chap? ter 29 gives introductions to the topics of mobile databases, multimedia databases, GIS (Geographic Information Systems), and Genome data management in bioinformatics. Appendix A gives a number of alternative diagrammatic notations for displaying a con? ceptual ER or EER schema. These may be substituted for the notation we use, if the instructor so wishes. Appendix C gives some important physical parameters of disks. Appendixes B, E, and F are on the web site. Appendix B is a new case study that follows the design and imple? mentation of a bookstore's database. Appendixes E and F cover legacy database systems, based on the network and hierarchical database models. These have been used for over thirty years as a basis for many existing commercial database applications and transaction? processing systems and will take decades to replace completely. We consider it important to expose students of database management to these long-standing approaches. Full chapters from the third edition can be found on the web site for this edition. Guidelines for Using This Book There are many different ways to teach a database course. The chapters in Parts 1 through 5 can be used in an introductory course on database systems in the order that they are given or in the preferred order of each individual instructor. Selected chapters and sec? tions may be left out, and the instructor can add other chapters from the rest of the book, depending on the emphasis if the course. At the end of each chapter's opening section, we list sections that are candidates for being left out whenever a less detailed discussion of the topic in a particular chapter is desired. We suggest covering up to Chapter 14 in an introductory database course and including selected parts of other chapters, depending on the background of the students and the desired coverage. For an emphasis on system implementation techniques, chapters from Parts 4 and 5 can be included. Chapters 3 and 4, which cover conceptual modeling using the ER and EERmodels, are important for a good conceptual understanding of databases. However, they may be par- tially covered, covered later in a course, or even left out if the emphasis is on DBMS imple? mentation. Chapters 13 and 14 on file organizations and indexing may also be covered early on, later, or even left out if the emphasis is on database models and languages. For students who have already taken a course on file organization, parts of these chapters could be assigned as reading material or some exercises may be assigned to review the concepts. A total life-cycle database design and implementation project covers conceptual design (Chapters 3 and 4), data model mapping (Chapter 7), normalization (Chapter 10), and implementation in SQL (Chapter 9). Additional documentation on the specific RDBMS would be required. The book has been written so that it is possible to cover topics in a variety of orders. The chart included here shows the major dependencies between chapters. As the diagram illustrates, it is possible to start with several different topics following the first two intro? ductory chapters. Although the chart may seem complex, it is important to note that if the chapters are covered in order, the dependencies are not lost. The chart can be con? sulted by instructors wishing to use an alternative order of presentation. Preface I XI For a single-semester course based on this book, some chapters can be assigned as read? ing material. Parts 4,7, and 8 can be considered for such an assignment. The book can also \ xii Preface be used for a two-semester sequence. The first course, "Introduction to Database Design/ Systems," at the sophomore, junior, or senior level, could cover most of Chapters 1 to 14. The second course, "Database Design and Implementation Techniques," at the senior or first-year graduate level, can cover Chapters 15 to 28. Chapters from Parts 7 and 8 can be used selectively in either semester, and material describing the DBMS available to the stu? dents at the local institution can be covered in addition to the material in the book. Supplemental Materials The supplements to this book have been significantly revised. With Addison-Wesley's Database Place there is a robust set of interactive reference materials to help students with their study of modeling, normalization, and SQL. Each tutorial asks students to solve problems (such as writing an SQL query, drawing an ER diagram or normalizing a rela? tion), and then provides useful feedback based on the student's solution. Addison? Wesley's Database Place helps students master the key concepts of all database courses. For more information visit aw.corn/databaseplace. In addition the following supplements are available to all readers of this book at www.aw.com/cssupport. • Additional content: This includes a new Case Study on the design and implementa? tion of a bookstore's database as well as chapters from previous editions that are not included in the fourth edition. • A set of PowerPoint lecture notes A solutions manual is also available to qualified instructors. Please contact your local Addison-Wesley sales representative, or send e-mail to aw.cseteaw.com, for information on how to access it. Acknowledgements It is a great pleasure for us to acknowledge the assistance and contributions of a large num? ber of individuals to this effort. First, we would like to thank our editors, Maite Suarez? Rivas, Katherine Harutunian, Daniel Rausch, and Juliet Silveri. In particular we would like to acknowledge the efforts and help of Katherine Harutunian, our primary contact for the fourth edition. We would like to acknowledge also those persons who have contributed to the fourth edition. We appreciated the contributions of the following reviewers: Phil Bern? hard, Florida Tech; Zhengxin Chen, University of Nebraska at Omaha; Jan Chomicki, Univer? sity of Buffalo; Hakan Ferhatosmanoglu, Ohio State University; Len Fisk, California State University, Chico; William Hankley, Kansas State University; Ali R. Hurson, Penn State Uni? versitYi Vijay Kumar, University of Missouri-Kansas CitYi Peretz Shoval, Ben-Gurion Univer? sity, Israeli Jason T. L. Wang, New Jersey Institute of Technology; and Ed Omiecinski of Georgia Tech, who contributed to Chapter 27. Ramez Elmasri would like to thank his students Hyoil Han, Babak Hojabri, Jack Fu, Charley Li, Ande Swathi, and Steven Wu, who contributed to the material in Chapter 26. He would also like to acknowledge the support provided by the University of Texas at Arlington. Sham Navathe would like to acknowledge Dan Forsythe and the following students at Georgia Tech: Weimin Feng, Angshuman Guin, Abrar Ul-Haque, Bin Liu, Ying Liu, Wanxia Xie and Waigen Yee. We would like to repeat our thanks to those who have reviewed and contributed to ptevious editions of Fundamentals of Database Systems. For the first edition these individu? als include Alan Apt (editor), Don Batory, Scott Downing, Dennis Heimbinger, Julia Hodges, Yannis Ioannidis, Jim Larson, Dennis McLeod, Per-Ake Larson, Rahul Patel, Nicholas Roussopoulos, David Stemple, Michael Stonebraker, Frank Tampa, and Kyu? Young Whang; for the second edition they include Dan [oraanstad (editor), Rafi Ahmed, Antonio Albano, David Beech, Jose Blakeley, Panos Chrysanthis, Suzanne Dietrich, Vic Ghorpadey, Goets Graefe, Eric Hanson, [unguk L. Kim, Roger King, Vram Kouramajian, VijayKumar, John Lowther, Sanjay Manchanda, Toshimi Minoura, Inderpal Mumick, Ed Omiecinski, Girish Pathak, Raghu Rarnakrishnan, Ed Robertson, Eugene Sheng, David Stotts, Marianne Winslett, and Stan Zdonick. For the third edition they include Suzanne Dietrich, Ed Omiecinski, Rafi Ahmed, Francois Bancilhon, Jose Blakeley, Rick Cattell, Ann Chervenak, David W. Embley, Henry A. Edinger, Leonidas Fegaras, Dan Forsyth, Farshad Fotouhi, Michael Franklin, Sreejith Gopinath, Goetz Craefe, Richard Hull, Sushil [ajodia, Ramesh K. Kame, Harish Kotbagi, Vijay Kumar, Tarcisio Lima, Ramon A. Mara-Toledo, Jack McCaw, Dennis McLeod, Rokia Missaoui, Magdi Morsi, M. Naraya? naswamy, Carlos Ordonez, Joan Peckham, Betty Salzberg, Ming-Chien Shan, [unping Sun, Rajshekhar Sunderraman, Aravindan Veerasamy, and Emilia E. Villareal. Last but not l,ast, we gratefully acknowledge the support, encouragement, and patience of our families. Preface I XIII R.E. S.B.N. Contents PART 1 INTRODUCTION AND CONCEPTUAL MODELING CHA'1JTER 1 Databases and Database Users 1.1 Introduction 4 1.2 An Example 6 1.3 Characteristics of the Database Approach 1.4 Actors on the Scene 12 14 1.5 Workers behind the Scene 1.6 Advantages of Using the DBMS Approach 1.7 A Brief History of Database Applications 1.8 When Not to Use a DBMS 23 1.9 Summary 23 Review Questions Exercises 24 Selected Bibliography 23 24 3 8 15 20 xv xvi Contents 46 CHAPTER 2 Database System Concepts and 25 Architecture 26 2.1 Data Models, Schemas, and Instances 2.2 Three-Schema Architecture and Data Independence 2.3 Database Languages and Interfaces 32 35 2.4 The Database System Environment 2.5 Centralized and Client/Server Architectures for DBMSs 2.6 Classification of Database Management Systems 43 45 2.7 Summary Review Questions 46 Exercises Selected Bibliography 29 38 78 47 CHAPTER 3 Data Modeling Using the Entity..Relationship Model 49 3.1 Using High-Level Conceptual Data Models for Database 50 Design 3.2 An Example Database Application 52 3.3 Entity Types, Entity Sets, Attributes, and Keys 53 3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints 61 3.5 Weak Entity Types 68 3.6 Refining the ER Design for the COMPANY Database 69 3.7 ER Diagrams, Naming Conventions, and Design Issues 3.8 Notation for UML Class Diagrams 74 3.9 Summary 77 Review Questions Exercises 78 Selected Bibliography 70 83 86 CHAPTER 4 Enhanced Entity..Relationship and UML Modeling 85 4.1 Subclasses, Superclasses, and Inheritance 4.2 Specialization and Generalization 88 4.3 Constraints and Characteristics of Specialization and 91 Generalization 4.4 Modeling of UNION Types Using Categories 98 4.5 An Example UNIVERSITY EER Schema and Formal Definitions for the EER Model 101 116 4.6 Representing Specialization/Generalization and Inheritance in UML Class Diagrams 104 4.7 Relationship Types of Degree Higher Than Two 105 4.8 Data Abstraction, Knowledge Representation, and Ontology Concepts 110 4.9 Summary 115 Review Questions 117 Exercises Selected Bibliography 121 PART 2 RELATIONAL MODEL: CONCEPTS, CONSTRAINTS, LANGUAGES, DESIGN, AND PROGRAMMING CHAPTER 5 The Relational Data Model and Relational Database Constraints 5.1 Relational Model Concepts 126 5.2 Relational Model Constraints and Relational Database Schemas 132 5.3 Update Operations and Dealing with Constraint Violations 143 5.4 Summary Review Questions 144 Exercist\ Selected Bibliography 125 144 185 147 CHAPTER 6 The Relational Algebra and Relational 149 Calculus 6.1 Unary Relational Operations: SELECT and PROJECT 6.2 Relational Algebra Operations from Set Theory 6.3 Binary Relational Operations: JOIN and DIVISION 6.4 Additional Relational Operations 165 6.5 Examples of Queries in Relational Algebra 173 6.6 The Tuple Relational Calculus 6.7 The Domain Relational Calculus 181 6.8 Summary 184 Review Questions 186 Exercises Selected Bibliography 171 Contents I xvii 140 189 155 158 151 xviii Contents CHAPTER 7 Relational Database Design by ER.. and EER..to..Relational Mapping 7.1 Relational Database Design Using ER-to-Relational 192 Mapping 7.2 Mapping EER Model Constructs to Relations 7.3 Summary 203 Review Questions Exercises 204 Selected Bibliography 204 251 205 CHAPTER 8 sQL ..99: Schema Definition, Basic Constraints, and Queries 209 8.1 SQL Data Definition and Data Types 8.2 Specifying Basic Constraints in SQL 213 8.3 Schema Change Statements in SQL 217 8.4 Basic Queries in SQL 218 8.5 More Complex SQL Queries 229 8.6 Insert, Delete, and Update Statements in SQL 8.7 Additional Features of SQL 248 8.8 Summary 249 Review Questions 251 Exercises Selected Bibliography 287 252 191 199 207 245 256 289 284 CHAPTER 9 More SQL: Assertions, Views, and Programming Techniques 255 9.1 Specifying General Constraints as Assertions 9.2 Views (Virtual Tables) in SQL 257 9.3 Database Programming: Issues and Techniques 264 9.4 Embedded SQL, Dynamic SQL, and SQL] 9.5 Database Programming with Function Calls: SQL/CLl and 275 ]OBC 9.6 Database Stored Procedures and SQL/PSM 9.7 Summary 287 Review Questions Exercises 287 Selected Bibliography 261 PART 3 DATABASE DESIGN THEORY AND METHODOLOGY 327 331 CHAPTER 10 Functional Dependencies and Normalization for Relational Databases 10.1 Informal Design Guidelines for Relation Schemas 304 10.2 Functional Dependencies 10.3 Normal Forms Based on Primary Keys 312 10.4 General Definitions of Second and Third Normal Forms 10.5 Boyce-Codd Normal Form 324 326 10.6 Summary Review Questions 328 Exercises Selected Bibliography CHAPTER 11 Relational Database Design Algorithms and Further Dependencies 334 11.1 Properties of Relational Decompositions 11.2 Algorithmsfor Relational Database Schema Design 11.3 Multivalued Dependencies and Fourth Normal Form 11.4 Join Dependencies and Fifth Normal Form 353 11.5 Inclusion Dependencies 354 11.6 Other Dependencies and Normal Forms 11.7 Summary 357 Review Questions Exercises 358 Selected Bibliography 293 355 295 320 358 405 360 333 340 347 406 Contents I xix CHAPTER 12 Practical Database Design Methodology 361 and Use of UML Diagrams 12.1 The Role ofInformation Systems in Organizations 12.2 The Database Design and Implementation Process 12.3 Use ofUML Diagrams as an Aid to Database Design 385 Specification 12.4 Rational Rose, A UML Based Design Tool 12.5 Automated Database Design Tools 402 12.6 Summary 404 Review Questions Selected Bibliography 395 362 366 xx I Contents PART 4 DATA STORAGE, INDEXING, QUERY PROCESSING, AND PHYSICAL DESIGN 415 422 430 431 443 450 454 CHAPTER 13 Disk Storage, Basic File Structures, and 411 Hashing 412 13.1 Introduction 13.2 Secondary Storage Devices 421 13.3 Buffering of Blocks 13.4 Placing File Records on Disk 13.5 Operations on Files 427 13.6 Files of Unordered Records (Heap Files) 13.7 Files of Ordered Records (Sorted Files) 13.8 Hashing Techniques 434 13.9 Other Primary File Organizations 442 13.10 Parallelizing Disk Access Using RAID Technology 13.11 Storage Area Networks 447 13.12 Summary 449 Review Questions Exercises 451 Selected Bibliography CHAPTER 14 Indexing Structures for Files 14.1 Types of Single- Level Ordered Indexes 14.2 Multilevel Indexes 464 14.3 Dynamic Multilevel Indexes Using B-Trees and W-Trees 14.4 Indexes on Multiple Keys 483 14.5 Other Types ofIndexes 485 486 14.6 Summary Review Questions 488 Exercises Selected Bibliography 456 487 490 455 CHAPTER 15 Algorithms for Query Processing 493 and Optimization 15.1 Translating SQL Queries into Relational Algebra 496 15.2 Algorithms for External Sorting 15.3 Algorithms for SELECT and JOIN Operations 15.4 Algorithms for PROJECT and SET Operations 498 508 469 495 534 536 15.5 Implementing Aggregate Operations and Outer Joins 511 15.6 Combining Operations Using Pipe lining 512 15.7 Using Heuristics in Query Optimization 15.8 Using Selectivity and Cost Estimates in Query Optimization 15.9 Overview of Query Optimization in ORACLE 532 533 15.10 Semantic Query Optimization 15.11 Summary 534 Review Questions Exercises 535 Selected Bibliography 547 509 523 CHAPTER 16 Practical Database Design and Tuning 16.1 Physical Database Design in Relational Databases 537 16.2 An Overview of Database Tuning in Relational Systems 547 16.3 Summary Review Questions Selected Bibliography 537 541 548 PART 5 TRANSACTION PROCESSING CONCEPTS 579 581 CHAPTER 1 7 Introduction to Transaction Processing Concepts and Theory 17.1 Introduction to Transaction Processing 552 559 17.2 Transaction and System Concepts 17.3 Desirable Properties of Transactions 562 17.4 Characterizing Schedules Based on Recoverability 17.5 Characterizing Schedules Based on Serializability 576 17.6 Transaction Support in SQL 17.7 Summary 578 Review Questions 580 Exercises Selected Bibliography 551 563 566 CHAPTER 18 Concurrency Control Techniques 583 18.1 Two-Phase Locking Techniques for Concurrency Control 18.2 Concurrency Control Based on Timestamp Ordering 596 18.3 Multiversion Concurrency Control Techniques 18.4 Validation (Optimistic) Concurrency Control Techniques 584 594 599 Contents I XXI XXII Contents 18.5 Granularity of Data Items and Multiple Granularity Locking 18.6 Using Locks for Concurrency Control in Indexes 605 18.7 Other Concurrency Control Issues 606 18.8 Summary 607 Review Questions 609 Exercises Selected Bibliography 600 608 632 609 625 635 611 618 622 CHAPTER 19 Database Recovery Techniques 612 19.1 Recovery Concepts 19.2 Recovery Techniques Based on Deferred Update 19.3 Recovery Techniques Based on Immediate Update 19A Shadow Paging 624 19.5 The ARIES Recovery Algorithm 629 19.6 Recovery in Multidatabase Systems 19.7 Database Backup and Recovery from Catastrophic Failures 19.8 Summary 631 Review Questions 633 Exercises Selected Bibliography PART 6 OBJECT AND OBJECT-RELATIONAL DATABASES 663 639 630 CHAPTER 20 Concepts for Object Databases 641 20.1 Overview of Object-Oriented Concepts 20.2 Object Identity, Object Structure, and Type Constructors 20.3 Encapsulation of Operations, Methods, and Persistence 20A Type and Class Hierarchies and Inheritance 654 657 20.5 Complex Objects 20.6 Other Objected-Oriented Concepts 662 20.7 Summary Review Questions 664 Exercises Selected Bibliography 643 649 659 664 CHAPTER 21 Object Database Standards, Languages, and 665 Design 21.1 Overview of the Object Model of ODMG 666 21.2 The Object Definition Language ODL 684 21.3 The Object Query Language OQL 21.4 Overview of the c++ Language Binding 21.5 Object Database Conceptual Design 21.6 Summary 697 Review Questions 698 Exercises Selected Bibliography 679 694 698 699 725 728 693 702 709 CHAPTER 22 Object-Relational and Extended-Relational 701 Systems 22.1 Overview of SQL and Its Object-Relational Features 22.2 Evolution and Current Trends of Database Technology 22.3 The Informix Universal Server 711 22.4 Object-Relational Features of Oracle 8 721 22.5 Implementation and Related Issues for Extended Type Systems 724 22.6 The Nested Relational Model 22.7 Summary 727 Selected Bibliography PART 7 FURTHER TOPICS 752 Contents I 749 CHAPTER 23 Database Security and Authorization 23.1 Introduction to Database Security Issues 732 23.2 Discretionary Access Control Based on Granting and Revoking 735 Privileges 23.3 Mandatory Access Control and Role- Based Access Control for Multilevel Security 23.4 Introduction to Statistical Database Security 746 23.5 Introduction to Flow Control 747 23.6 Encryption and Public Key Infrastructures 23.7 Summary 751 Review Questions Exercises 753 Selected Bibliography 753 XX/II 740 731 XXIV Contents 757 784 797 CHAPTER 24 Enhanced Data Models for Advanced Applications 755 24.1 Active Database Concepts and Triggers 24.2 Temporal Database Concepts 767 24.3 Multimedia Databases 780 24.4 Introduction to Deductive Databases 24.5 Summary 797 Review Questions 798 Exercises Selected Bibliography 833 801 835 803 827 CHAPTER 25 Distributed Databases and Client-Server Architectures 25.1 Distributed Database Concepts 804 25.2 Data Fragmentation, Replication, and Allocation Techniques for Distributed Database Design 25.3 Types of Distributed Database Systems 815 25.4 Query Processing in Distributed Databases 818 25.5 Overview of Concurrency Control and Recovery in Distributed 824 Databases 25.6 An Overview of 3-Tier Client-Server Architecture 25.7 Distributed Databases in Oracle 830 25.8 Summary 832 Review Questions 834 Exercises Selected Bibliography PART 8 EMERGING TECHNOLOGIES 841 CHAPTER 26 XML and Internet Databases 26.1 Structured, Semistructured, and Unstructured Data 26.2 XML Hierarchical (Tree) Data Model 846 26.3 XML Documents, OTO, and XML Schema 848 26.4 XML Documents and Databases 855 26.5 XML Querying 862 26.6 Summary 865 Review Questions Exercises 866 Selected Bibliography 865 842 866 810 CHAPTER 27 Data Mining Concepts 27.1 Overview of Data Mining Technology 871 27.2 Association Rules 882 27.3 Classification 27.4 Clustering 885 27.5 Approaches to Other Data Mining Problems 891 27.6 Applications of Data Mining 27.7 Commercial Data Mining Tools 891 27.8 Summary 894 Review Questions 895 Exercises Selected Bibliography 867 868 894 914 888 896 CHAPTER 28 Overview of Data Warehousing and OLAP 899 28.1 Introduction, Definitions, and Terminology 28.2 Characteristics of Data Warehouses 901 28.3 Data Modeling for Data Warehouses 902 28.4 Building a Data Warehouse 907 28.5 Typical Functionality of a Data Warehouse 911 28.6 Data Warehouse Versus Views 28.7 Problems and Open Issues in Data Warehouses 913 28.8 Summary Review Questions Selected Bibliography 916 914 900 910 912 936 930 CHAPTER 29 Emerging Database Technologies and Applications 915 29.1 Mobile Databases 29.2 Multimedia Databases 923 29.3 Geographic Information Systems 29.4 Genome Data Management Contents I xxv xxvi I Contents APPENDIX A Alternative Diagrammatic Notations 947 APPENDIX B Database Design and Application Implementation Case Study-located on the WI APPENDIX C Parameters of Disks 951 APPENDIX D Overview of the QBE Language 955 APPENDIX E Hierarchical Data Model-located on the web APPENDIX F Network Data Model-located on the web Selected Bibliography Index 963 1009 INTRODUCTION AND CONCEPTUAL MODELl NG Databases and Database Users Databases and database systems have become an essential component of everyday life in modern society. In the course of a day, most of us encounter several activities that involve some interaction with a database. For example, if we go to the bank to deposit or with? draw funds, if we make a hotel or airline reservation, if we access a computerized library catalog to search for a bibliographic item, or if we buy some item-such as a book, toy, or computer-from an Internet vendor through its Web page, chances are that our activities will involve someone or some computer program accessing a database. Even purchasing items from a supermarket nowadays in many cases involves an automatic update of the database that keeps the inventory of supermarket items. These interactions are examples of what we may call traditional database applications, in which most of the information that is stored and accessed is either textual or numeric. In the past few years, advances in technology have been leading to exciting new applications of database systems. Multimedia databases can now store pictures, video clips, and sound messages. Geographic information systems (CIS) can store and analyze maps, weather data, and satellite images. Data warehouses and online analytical processing (ot.Ar) systems are used in many companies to extract and analyze useful information from very large databases for decision making. Real-time and active database technology is used in controlling industrial and manufacturing processes. And database search techniques are being applied to the World Wide Web to improve the search for information that is needed by users browsing the Internet. 3 4 I Chapter 1 Databases and Database Users To understand the fundamentals of database technology, however, we must start from the basics of traditional database applications. So, in Section 1.1 of this chapter we define what a database is, and then we give definitions of other basic terms. In Section 1.2, we provide a simple UNIVERSITY database example to illustrate our discussion. Section 1.3 describes some of the main characteristics of database systems, and Sections 1.4 and 1.5 categorize the types of personnel whose jobs involve using and interacting with database systems. Sections 1.6, 1.7, and 1.8 offer a more thorough discussion of the various capabilities provided by database systems and discuss some typical database applications. Section 1.9 summarizes the chapter. The reader who desires only a quick introduction to database systems can study Sections 1.1 through 1.5, then skip or browse through Sections 1.6 through 1.8 and go on to Chapter 2. 1.1 INTRODUCTION Databases and database technology are having a major impact on the growing use of com? puters. It is fair to say that databases playa critical role in almost all areas where comput? ers are used, including business, electronic commerce, engineering, medicine, law, education, and library science, to name a few. The word database is in such common use that we must begin by defining what a database is. Our initial definition is quite general. A database is a collection of related data. 1 By data, we mean known facts that can be recorded and that have implicit meaning. For example, consider the names, telephone numbers, and addresses of the people you know. You may have recorded this data in an indexed address book, or you may have stored it on a hard drive, using a personal computer and software such as Microsoft Access, or Excel. This is a collection of related data with an implicit meaning and hence is a database. The preceding definition of database is quite general; for example, we may consider the collection of words that make up this page of text to be related data and hence to constitute a database. However, the common use of the term database is usually more restricted. A database has the following implicit properties: • A database represents some aspect of the real world, sometimes called the miniworld or the universe of discourse (DoD). Changes to the miniworld are reflected in the database. • A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot correctly be referred to as a database. • A database is designed, built, and populated with data for a specific purpose. It has an intended group of users and some preconceived applications in which these users are interested. 1. We will use the word data as both singular and plural, as is common in database literature; con? text will determine whether it is singular or plural. In standard English, data is used only for plural; datum is used fur singular. 1.1 Introduction I 5 In other words, a database has some source from which data is derived, some degree of interaction with events in the real world, and an audience that is actively interested in the contents of the database. A database can be of any size and of varying complexity. For example, the list of names and addresses referred to earlier may consist of only a few hundred records, each with a simple structure. On the other hand, the computerized catalog of a large library may contain half a million entries organized under different categories-by primary author's last name, by subject, by book title-with each category organized in alphabetic order. A database of even greater size and complexity is maintained by the Internal Revenue Service to keep track of the tax forms filed by u.S. taxpayers. If we assume that there are 100 million taxpayers and if each taxpayer files an average of five forms with approximately 400 characters of information per form, we would get a database of 100 X 106 X 400 X 5 characters (bytes) of information. If the IRS keeps the past three returns for each taxpayer in addition to the current return, we would get a database of 8 X 1011 bytes (800 gigabytes). This huge amount of information must be organized and managed so that users can search for, retrieve, and update the data as needed. A database may be generated and maintained manually or it may be computerized. For example, a library card catalog is a database that may be created and maintained manually. A computerized database may be created and maintained either by a group of application programs written specifically for that task or by a database management system. Of course, we are only concerned with computerized databases in this book. A database management system (DBMS) is a collection of programs that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that facilitates the processes of defining, constructing, manipulating, and sharing databases among various users and applications. Defining a database involves specifying the data types, structures, and constraints for the data to be stored in the database. Constructing the database is the process of storing the data itself on some storage medium that is controlled by the DBMS. Manipulating a database includes such functions as querying the database to retrieve specific data, updating the database to reflect changes in the miniworld, and generating reports from the data. Sharing a database allows multiple users and programs to access the database concurrently. Other important functions provided by the DBMS include protecting the database and maintaining it over a long period of time. Protection includes both system protection against hardware or software malfunction (or crashes), and security protection against unauthorized or malicious access. A typical large database may have a life cycle of many years, so the DBMS must be able to maintain the database system by allowing the system to evolve as requirements change over time. It is not necessary to use general-purpose DBMS software to implement a computerized database. We could write our own set of programs to create and maintain the database, in effect creating our own special-purpose DBMS software. In either case? whether we use a general-purpose DBMS or not-we usually have to deploy a considerable amount of complex software. In fact, most DBMSs are very complex software systems. To complete our initial definitions, we will call the database and DBMS software together a database system. Figure I. I illustrates some of the concepts we discussed so far. 6 I Chapter 1 Databases and Database Users DATABASE SYSTEM UserS/Programmers ~ DBMS SOFTWARE Application Programs/Queries Softwareto Process Queries/Programs Stored Database Definition (Meta-Data) Softwareto Access Stored Data Stored Database FIGURE 1.1 A simpl ified database system environment. 1.2 AN EXAMPLE Let us consider a simple example that most readers may be familiar with: a UNIVERSITY database for maintaining information concerning students, courses, and grades in a uni? versity environment. Figure 1.2 shows the database structure and a few sample data for such a database. The database is organized as five files, each of which stores data records of the same type. 2 The STUDENT file stores data on each student, the COURSE file stores data on each course, the SECTION file stores data on each section of a course, the GRADE_REPORT file stores the grades that students receive in the various sections they have completed, and the PREREQUISITE file stores the prerequisites of each course. To define this database, we must specify rhe structure of the records of each file by specifying the different types of data dements to be stored in each record. In Figure 1.2, each STUDENT record includes data to represent the student's Name, StudentNumber, Class 2. We use the term file informally here. At a conceptual level, a file is a collection of records that may or may not be ordered. ISTUDENTf¥--meith- j--.j-.-.I StudentNumber. .. 17 , Brawn 8 L. 2 1.2 An Example I 7 . Class Ma .- -_.C 1 C __._, -._-_._---~----_._---~_.._---_.._--~-_. FIGURE 1.2 A database that stores student and course information. (freshman or 1, sophomore or 2, ... ), and Major (mathematics or math, computer science or CS, . . .}; each COURSE record includes data to represent the CourscNamc, CourseN umber, CreditHours, and Department (the department that offers the course); and so on. We must also specify a data type for each data clement within a record. For example, we can specify that Name of STUDENT is a string of alphabetic characters, StudentN umber of STUDENT is an integer, and Grade of GRADE.. REPORT is a single character from the set lA, B, C, D, F, l}. We may also use a coding scheme to represent the values of 8 I Chapter 1 Databases and Database Users a data item. For example, in Figure 1.2 we represent the Class of a STUDENT as 1 for freshman, 2 for sophomore, 3 for junior, 4 for senior, and 5 for graduate student. To construct the UNIVERSITY database, we store data to represent each student, course, section, grade report, and prerequisite as a record in the appropriate file. Notice that records in the various files may be related. For example, the record for "Smith" in the STU? DENT file is related to two records in the GRADE_REPORT file that specify Smith's grades in two sections. Similarly, each record in the PREREQUISITE file relates two course records: one representing the course and the other representing the prerequisite. Most medium-size and large databases include many types of records and have many relationships among the records. Database manipulation involves querying and updating. Examples of queries are "retrieve the transcript-a list of all courses and grades-of Smith," "list the names of students who took the section of the Database course offered in fall 1999 and their grades in that section," and "what are the prerequisites of the Database course!" Examples of updates are "change the class of Smith to Sophomore," "create a new section for the Database course for this semester," and "enter a grade of A for Smith in the Database section of last semester." These informal queries and updates must be specified precisely in the query language of the DBMS before they can be processed. 1.3 CHARACTERISTICS OF THE DATABASE ApPROACH A number of characteristics distinguish the database approach from the traditional approach of programming with files. In traditional file processing, each user defines and implements the files needed for a specific software application as part of programming the application. For example, one user, the grade reporting office, may keep a file on students and their grades. Programs to print a student's transcript and to enter new grades into the file are implemented as part of the application. A second user, the accounting office, may keep track of students' fees and their payments. Although both users are interested in data about students, each user maintains separate files-and programs to manipulate these files-because each requires some data not available from the other user's files. This redundancy in defining and storing data results in wasted storage space and in redundant efforts to maintain common data up to date. In the database approach, a single repository of data is maintained that is defined once and then is accessed by various users. The main characteristics of the database approach versus the file-processing approach are the following: • Self-describing nature of a database system • Insulation between programs and data, and data abstraction • Support of multiple views of the data • Sharing of data and multiuser transaction procesing We next describe each of these characteristics in a separate section. Additional characteristics of database systems are discussed in Sections 1.6 through 1.8. 1.3 Characteristics of the Database Approach I 9 1.3.1 Self-Describing Nature of a Database System A fundamental characteristic of the database approach is that the database system con? tains not only the database itself but also a complete definition or description of the data? base structure and constraints. This definition is stored in the DBMS catalog, which contains information such as the structure of each file, the type and storage format of each data item, and various constraints on the data. The information stored in the catalog is called meta-data, and it describes the structure of the primary database (Figure 1.1). The catalog is used by the DBMS software and also by database users who need information about the database structure. A general-purpose DBMS software package is not written for a specific database application, and hence it must refer to the catalog to know the structure of the files in a specific database, such as the type and format of data it will access. The DBMS software must work equally well with any number of database applications-for example, a university database, a banking database, or a company database-as long as the database definition is stored in the catalog. In traditional file processing, data definition is typically part of the application programs themselves. Hence, these programs are constrained to work with only one specific database, whose structure is declared in the application programs. For example, an application program written in c++ may have struct or class declarations, and a COBOL program has Data Division statements to define its files. Whereas file-processing software can access only specific databases, DBMS software can access diverse databases by extracting the database definitions from the catalog and then using these definitions. In the example shown in Figure 1.2, the DBMS catalog will store the definitions of all the files shown. These definitions are specified by the database designer prior to creating the actual database and are stored in the catalog. Whenever a request is made to access, say, the Name of a STUDENT record, the DBMS software refers to the catalog to determine the structure of the STUDENT file and the position and size of the Name data item within a STUDENT record. By contrast, in a typical file-processing application, the file structure and, in the extreme case, the exact location of Name within a STUDENT record are already coded within each program that accesses this data item. 1.3.2 Insulation between Programs and Data, and Data Abstraction In traditional file processing, the structure of data files is embedded in the application pro? grams,so any changes to the structure of a file may require changing allprograms that access this file. By contrast, DBMS access programs do not require such changes in most cases. The struc? ture of data files is stored in the DBMS catalog separately from the access programs. We call this property program-data independence. For example, a file access program may be written in such a way that it can access only STUDENT records of the structure shown in Figure 1.3. If we want to add another piece of data to each STUDENT record, say the BirthDate, such a program will no longer work and must be changed. By contrast, in a DBMSenvironment, we just need to change the description of STUDENT records in the catalog to reflect the inclusion of the new data item BirthDate; no programs are changed. The next time a DBMS program refers to the catalog, the new structure of STUDENT records will be accessed and used. 10 I Chapter 1 Databases and Database Users Starting Position in Record Data ItemName Name 1 StudentNumber 31 35 Class 39 Major FIGURE 1.3 Internal storage format for a STUDENT record. Length in Characters (bytes) 30 4 4 4 In some types of database systems, such as object-oriented and object-relational systems (see Chapters 20 to 22), users can define operations on data as part of the database definitions. An operation (also called a function or method) is specified in two parts. The interface (or signature) of an operation includes the operation name and the data types of its arguments (or parameters). The implementation (or method) of the operation is specified separately and can be changed without affecting the interface. User application programs can operate on the data by invoking these operations through their names and arguments, regardless of how the operations are implemented. This may be termed program-operation independence. The characteristic that allows program-data independence and program-operation independence is called data abstraction. A DBMS provides users with a conceptual representation of data that does not include many of the details of how the data is stored or how the operations are implemented. Informally, a data model is a type of data abstraction that is used to provide this conceptual representation. The data model uses logical concepts, such as objects, their properties, and their interrelationships, that may be easier for most users to understand than computer storage concepts. Hence, the data model hides storage and implementation details that are not of interest to most database users. For example, consider again Figure 1.2. The internal implementation of a file may be defined by its record length-the number of characters (bytes) in each record-and each data item may be specified by its starting byte within a record and its length in bytes. The STUDENT record would thus be represented as shown in Figure 1.3. But a typical database user is not concerned with the location of each data item within a record or its length; rather, the concern is that when a reference is made to Name of STUDENT, the correct value is returned. A conceptual representation of the STUDENT records is shown in Figure 1.2. Many other details of file storage organization-such as the access paths specified on a file---can be hidden from database users by the DBMS; we discuss storage details in Chapters 13 and 14. In the database approach, the detailed structure and organization of each file are stored in the catalog. Database users and application programs refer to the conceptual representation of the files, and the DBMS extracts the details of file storage from the catalog when these are needed by the DBMS file access modules. Many data models can be used to provide this data abstraction to database users. A major part of this book is devoted to presenting various data models and the concepts they use to abstract the representation of data. In object-oriented and object-relational databases, the abstraction process includes not only the data structure but also the operations on the data. These operations provide an abstraction of miniworld activities commonly understood by the users. For example, 1.3 Characteristics of the Database Approach I 11 Smith Brown (a) ITRANSCRIPT i StudentName C-~-N~--b-'----'-----'---,----,-------IStudent Transcript Year Sectionld I ourse um er I Grade Semester C Fall 119 8 Fall 112 A Fall 85 Fall 92 Spring 102 Fall 135 (b) I PREREOUISITES CourseName Database CourseNumber A 8 A Prerequisites FIGURE 1.4 Two views derived from the database in Figure 1.2. (a) The STUDENT TRANSCRIPT view. (b) The COURSE PREREQUISITES view. an operation CALCULATE_CPA can be applied to a STUDENT object to calculate the grade point average. Such operations can be invoked by the user queries or application programs without having to know the details of how the operations are implemented. In that sense, an abstraction of the miniworld activity is made available to the user as an abstract operation. 1.3.3 Support of Multiple Views of the Data A database typically has many users, each of whom may require a different perspective or view of the database. A view may be a subset of the database or it may contain virtual data that is derived from the database files but is not explicitly stored. Some users may not need to be aware of whether the data they refer to is stored or derived. A multiuser DBMS whose users have a variety of distinct applications must provide facilities for defining multiple views. For example, one user of the database of Figure 1.2 may be interested only in access? ing and printing the transcript of each student; the view for this user is shown in Figure 1.4a.A second user, who is interested only in checking that students have taken all the pre? requisites of each course for which they register, may require the view shown in Figure lAb. 1.3.4 Sharing of Data and Multiuser Transaction Processing A multiuser DBMS, as its name implies, must allow multiple users to access the database at the same time. This is essential if data for multiple applications is to be integrated and 12 I Chapter 1 Databases and Database Users maintained in a single database. The DBMS must include concurrency control software to ensure that several users trying to update the same data do so in a controlled manner so that the result of the updates is correct. For example, when several reservation clerks try to assign a seat on an airline flight, the DBMS should ensure that each seat can be accessed by only one clerk at a time for assignment to a passenger. These types of applications are generally called online transaction processing (OLTP) applications. A fundamental role of multiuser DBMS software is to ensure that concurrent transactions operate correctly. The concept of a transaction has become central to many database applications. A transaction is an executing program or process that includes one or more database accesses, such as reading or updating of database records. Each transaction is supposed to execute a logically correct database access if executed in its entirety without interference from other transactions. The DBMS must enforce several transaction properties. The isolation property ensures that each transaction appears to execute in isolation from other transactions, even though hundreds of transactions may be executing concurrently. The atomicity property ensures that either all the database operations in a transaction are executed or none are. We discuss transactions in detail in Part V of the textbook. The preceding characteristics are most important in distinguishing a DBMS from traditional file-processing software. In Section 1.6 we discuss additional features that characterize a DBMS. First, however, we categorize the different types of persons who work in a database system environment. 1.4 ACTORS ON THE SCENE For a small personal database, such as the list of addresses discussed in Section 1.1, one person typically defines, constructs, and manipulates the database, and there is no shar? ing. However, many persons are involved in the design, use, and maintenance of a large database with hundreds of users. In this section we identify the people whose jobs involve the day-to-day use of a large database; we call them the "actors on the scene." In Section 1.5 we consider people who may be called "workers behind the scene"-those who work to maintain the database system environment but who are not actively interested in the database itself. 1.4.1 Database Administrators In any organization where many persons use the same resources, there is a need for a chief administrator to oversee and manage these resources. In a database environment, the pri? mary resource is the database itself, and the secondary resource is the DBMS and related software. Administering these resources is the responsibility of the database administra? tor (DBA). The DBA is responsible for authorizing access to the database, for coordinating and monitoring its use, and for acquiring software and hardware resources as needed. The DBA is accountable for problems such as breach of security or poor system response time. In large organizations, the DBA is assisted by a staff that helps carry out these functions. 1.4 Actors on the Scene I 13 1.4.2 Database Designers Database designers are responsible for identifying the data to be stored in the database and for choosing appropriate structures to represent and store this data. These tasks are mostly undertaken before the database is actually implemented and populated with data. It is the responsibility of database designers to communicate with all prospective database users in order to understand their requirements, and to come up with a design that meets these requirements. In many cases, the designers are on the staff of the DBA and may be assigned other staff responsibilities after the database design is completed. Database designers typically interact with each potential group of users and develop views of the database that meet the data and processing requirements of these groups. Each view is then analyzed and integrated with the views of other user groups. The final database design must be capable of supporting the requirements of all user groups. 1.4.3 End Users End users are the people whose jobs require access to the database for querying, updating, and generating reports; the database primarily exists for their use. There are several cate? gories of end users: • Casual end users occasionally access the database, but they may need different information each time. They use a sophisticated database query language to specify their requests and are typically middle- or high-level managers or other occasional browsers. • Naive or parametric end users make up a sizable portion of database end users. Their main job function revolves around constantly querying and updating the database, using standard types of queries and updates-called canned transactions-that have been carefully programmed and tested. The tasks that such users perform are varied: Bank tellers check account balances and post withdrawals and deposits. Reservation clerks fur airlines, hotels, and car rental companies check availability for a given request and make reservations. Clerks at receiving stations for courier mail enter package identifications via bar codes and descriptive information through buttons to update a central database of received and in-transit packages. • Sophisticated end users include engineers, scientists, business analysts, and others who thoroughly familiarize themselves with the facilities of the DBMS so as to imple? ment their applications to meet their complex requirements. • Stand-alone users maintain personal databases by using ready-made program packages that provide easy-to-use menu-based or graphics-based interfaces. An example is the user of a tax package that stores a variety of personal financial data for tax purposes. A typical DBMS provides multiple facilities to access a database. Naive end users need to learn very little about the facilities provided by the DBMS; they have to understand only the user interfaces of the standard transactions designed and implemented for their 14 I Chapter 1 Databases and Database Users use. Casual users learn only a few facilities that they may use repeatedly. Sophisticated users try to learn most of the DBMS facilities in order to achieve their complex requirements. Stand-alone users typically become very proficient in using a specific software package. 1.4.4 System Analysts and Application Programmers (Software Engineers) System analysts determine the requirements of end users, especially naive and parametric end users, and develop specifications for canned transactions that meet these require? ments. Application programmers implement these specifications as programs; then they test, debug, document, and maintain these canned transactions. Such analysts and pro? grammers-commonly referred to as software engineers-should be familiar with the full range of capabilities provided by the DBMS to accomplish their tasks. 1.5 WORKERS BEHIND THE SCENE In addition to those who design, use, and administer a database, others are associated with the design, development, and operation of the DBMS software and system environment. These persons are typically not interested in the database itself. We call them the "work? ers behind the scene," and they include the following categories. • DBMS system designers and implementers are persons who design and implement the DBMS modules and interfaces as a software package. A DBMS is a very complex software system that consists of many components, or modules, including modules for implementing the catalog, processing query language, processing the interface, accessing and buffering data, controlling concurrency, and handling data recovery and security. The DBMS must interface with other system software, such as the operat? ing system and compilers for various programming languages. • Tool developers include persons who design and implement tools-the software packages that facilitate database system design and use and that help improve perfor? mance. Tools are optional packages that are often purchased separately. They include packages for database design, performance monitoring, natural language or graphical interfaces, prototyping, simulation, and test data generation. In many cases, indepen? dent software vendors develop and market these tools. • Operators and maintenance personnel are the system administration personnel who are responsible for the actual running and maintenance of the hardware and software environment for the database system. Although these categories of workers behind the scene are instrumental in making the database system available to end users, they typically do not use the database for their own purposes. 1.6 Advantages of Using the DBMS Approach I 15 1.6 ADVANTAGES OF USING THE DBMS ApPROACH In this section we discuss some of the advantages of using a DBMS and the capabilities that a good DBMS should possess. These capabilities are in addition to the four main character? istics discussed in Section 1.3. The DBA must utilize these capabilities to accomplish a variety of objectives related to the design, administration, and use of a large multiuser database. 1.6.1 Controlling Redundancy In traditional software development utilizing file processing, every user group maintains its own files for handling its data-processing applications. For example, consider the UNIVERSITY database example of Section 1.2; here, two groups of users might be the course registration personnel and the accounting office. In the traditional approach, each group independently keeps files on students. The accounting office also keeps data on registration and related billing information, whereas the registration office keeps track of student courses and grades. Much of the data is stored twice: once in the files of each user group. Additional user groups may further duplicate some or all of the same data in their own files. This redundancy in storing the same data multiple times leads to several problems. First, there is the need to perform a single logical update-such as entering data on a new student-multiple times: once for each file where student data is recorded. This leads to duplication of effort. Second, storage space is wasted when the same data is stored repeatedly, and this problem may be serious for large databases. Third, files that represent the same data may become inconsistent. This may happen because an update is applied to some of the files but not to others. Even if an update-such as adding a new student-is applied to all the appropriate files, the data concerning the student may still be inconsistent because the updates are applied independently by each user group. For example, one user group may enter a student's birthdate erroneously as JAN-19-1984, whereas the other user groups may enter the correct value of JAN-29-1984. In the database approach, the views of different user groups are integrated during database design. Ideally, we should have a database design that stores each logical data item-such as a student's name or birth date-in only one place in the database. This ensures consistency, and it saves storage space. However, in practice, it is sometimes necessary to use controlled redundancy for improving the performance of queries. For example, we may store Studentl-Jame and CourseN umber redundantly in a GRADE_REPORT file (Figure 1.5a) because whenever we retrieve a GRADE_REPORT record, we want to retrieve the student name and course number along with the grade, student number, and section identifier. By placing all the data together, we do not have to search multiple files to collect this data. In such cases, the DBMS should have the capability to control this redundancy so as to prohibit inconsistencies among the files. This may be done by automatically checking that the StudentName-StudentNumber values in any GRADE_REPORT record in Figure 1.5a match one of the Name-StudentNumber values of a STUDENT record (Figure 1.2). Similarly, the SectionIdentifier-CourseNumber values in 16 I Chapter 1 Databases and Database Users "-_._--- - ORT Stude ntNumber StudentName SectionldentifierL~<:)~~~-~Numbe; Grade I B 112 Smith __ ~_CS1310MATH2410 tt, 17 I Smith _.- 119 17 Brown 85 MATH2410 8 Brown 92 A CS1310 B Brown 102 CS3320 .~._._--- .,--- Brown A 135 '--CS3380 (b) ._._- f--- 8 8 I 8 GRADE_REPORT StudentNumber StudentName Sectionldentifier ICourseNumber I Grade I I -MATH2410 B I Brown 112 17 ' FIGURE 1.5 Redundant storage of StudentName and CourseNumber in GRADE_REPORT. (a) Consistent data. (b) Inconsistent record. GRADE_REPORT can be checked against SECTION records. Such checks can be specified to the DBMS during database design and automatically enforced by the DBMS whenever the GRADE_REPORT file is updated. Figure 1.5b shows a GRADE3EPORT record that is inconsistent with the STUDENT file of Figure 1.2, which may be entered erroneously if the redundancy is not controlled. 1.6.2 Restricting Unauthorized Access When multiple users share a large database, it is likely that most users will not be autho? rized to access all information in the database. For example, financial data is often consid? ered confidential, and hence only authorized persons are allowed to access such data. In addition, some users may be permitted only to retrieve data, whereas others are allowed both to retrieve and to update. Hence, the type of access operation-retrieval or update-must also be controlled. Typically, users or user groups are given account num? bers protected by passwords, which they can use to gain access to the database. A DBMS should provide a security and authorization subsystem, which the DBA uses to create accounts and to specify account restrictions. The DBMS should then enforce these restric? tions automatically. Notice that we can apply similar controls to the DBMS software. For example, only the DBA's staff may be allowed to use certain privileged software, such as the software for creating new accounts. Similarly, parametric users may be allowed to access the database only through the canned transactions developed for their use. 1.6.3 Providing Persistent Storage for Program Objects Databases can be used to provide persistent storage for program objects and data struc? tures. This is one of the main reasons for object-oriented database systems. Programming languages typically have complex data structures, such as record types in Pascal or class 1.6 Advantages of Using the DBMS Approach I 17 definitions in c++ or Java. The values of program variables are discarded once a program terminates, unless the programmer explicitly stores them in permanent files, which often involves converting these complex structures into a format suitable for file storage. When the need arises to read this data once more, the programmer must convert from the file format to the program variable structure. Object-oriented database systems are compati? ble with programming languages such as c++ and Java, and the DBMS software automati? cally performs any necessary conversions. Hence, a complex object in c++ can be stored permanently in an object-oriented DBMS. Such an object is said to be persistent, since it survives the termination of program execution and can later be directly retrieved by another c+ + program. The persistent storage of program objects and data structures is an important function of database systems. Traditional database systems often suffered from the so? called impedance mismatch problem, since the data structures provided by the DBMS were incompatible with the programming language's data structures. Object-oriented database systems typically offer data structure compatibility with one or more object? oriented programming languages. 1.6.4 Providing Storage Structures for Efficient Query Processing Database systems must provide capabilities for efficiently executing queries and updates. Because the database is typically stored on disk, the DBMS must provide specialized data structures to speed up disk search for the desired records. Auxiliary files called indexes are used for this purpose. Indexes are typically based on tree data structures or hash data struc? tures, suitably modified for disk search. In order to process the database records needed by a particular query, those records must be copied from disk to memory. Hence, the DBMS often has a buffering module that maintains parts of the database in main memory buffers. In other cases, the DBMS may use the operating system to do the buffering of disk data. The query processing and optimization module of the DBMS is responsible for choosing an efficient query execution plan for each query based on the existing storage structures. The choice of which indexes to create and maintain is part of physical database design and tuning, which is one of the responsibilities of the DBA staff. 1.6.5 Providing Backup and Recovery A DBMS must provide facilities for recovering from hardware or software failures. The backup and recovery subsystem of the DBMS is responsible for recovery. For example, if the computer system fails in the middle of a complex update transaction, the recovery subsystem is responsible for making sure that the database is restored to the state it was in before the transaction started executing. Alternatively, the recovery subsystem could ensure that the transaction is resumed from the point at which it was interrupted so that its full effect is recorded in the database. 18 I Chapter 1 Databases and Database Users 1.6.6 Providing Multiple User Interfaces Because many types of users with varying levels of technical knowledge use a database, a DBMS should provide a variety of user interfaces. These include query languages for casual users, programming language interfaces for application programmers, forms and command codes for parametric users, and menu-driven interfaces and natural language interfaces for stand-alone users. Both forms-style interfaces and menu-driven interfaces are commonly known as graphical user interfaces (GU Is). Many specialized languages and environ? ments exist for specifying GUls. Capabilities for providing Web GUl interfaces to a data? base-or Web-enabling a database-are also quite common. 1.6.7 Representing Complex Relationships among Data A database may include numerous varieties of data that are interrelated in many ways. Consider the example shown in Figure 1.2. The record for Brown in the STUDENT file is related to four records in the GRADCREPDRT file. Similarly, each section record is related to one course record as well as to a number of GRADE_REPDRT records-one for each student who completed that section. A DBMS must have the capability to represent a variety of complex relationships among the data as well as to retrieve and update related data easily and efficiently. 1.6.8 Enforcing Integrity ~onstraints Most database applications have certain integrity constraints that must hold for the data. A DBMS should provide capabilities for defining and enforcing these constraints. The simplest type of integrity constraint involves specifying a data type for each data item. For example, in Figure 1.2, we may specify that the value of the Class data item within each STUDENT record must be an integer between 1 and 5 and that the value of Name must be a string of no more than 30 alphabetic characters. A more complex type of constraint that frequently occurs involves specifying that a record in one file must be related to records in other files. For example, in Figure 1.2, we can specify that "every section record must be related to a course record." Another type of constraint specifies uniqueness on data item values, such as "every course record must have a unique value for CourseN umber." These constraints are derived from the meaning or semantics of the data and of the miniworld it represents. It is the database designers' responsibility to identify integrity constraints during database design. Some constraints can be specified to the DBMS and automatically enforced. Other constraints may have to be checked by update programs or at the time of data entry. A data item may be entered erroneously and still satisfy the specified integrity constraints. For example, if a student receives a grade of A but a grade of C is entered in the database, the DBMS cannot discover this error automatically, because C is a valid value for the Grade data type. Such data entry errors can only be discovered manually (when the student receives the grade and complains) and corrected later by updating the database. However, a grade of Z can be rejected automatically by the DBMS, because Z is not a valid value for the Grade data type. 1.6 Advantages of Usi ng the DBMS Approach I 19 1.6.9 Permitting Inferencing and Actions Using Rules Some database systems provide capabilities for defining deduction rules for inferencing new information from the stored database facts. Such systems are called deductive database systems. For example, there may be complex rules in the miniworld application for deter? mining when a student is on probation. These can be specified declaratively as rules, which when compiled and maintained by the DBMS can determine all students on proba? tion. In a traditional DBMS, an explicit procedural prof-,Jmm code would have to be written to support such applications. But if the miniworld rules change, it is generally more con? venient to change the declared deduction rules than to recode procedural programs. More powerful functionality is provided by active database systems, which provide active rules that can automatically initiate actions when certain events and conditions occur. 1.6.10 Additional Implications of Using the Database Approach This section discusses some additional implications of using the database approach that can benefit most organizations. Potential for Enforcing Standards. The database approach permits the DBA to define and enforce standards among database users in a large organization. This facilitates communication and cooperation among various departments, projects, and users within the organization. Standards can be defined for names and formats of data elements, display formats, report structures, terminology, and so on. The DBA can enforce standards in a centralized database environment more easily than in an environment where each user group has control of its own files and software. Reduced Application Development Time. A prime selling feature of the database approach is that developing a new application-such as the retrieval of certain data from the database for printing a new report-takes very little time. Designing and implementing a new database from scratch may take more time than writing a single specialized file application. However, once a database is up and running, substantially less time isgenerally required to create new applications using DBMS facilities. Development time using a DBMS is estimated to be one-sixth to one-fourth of that for a traditional file system. FIexibiii ty. It may be necessary to change the structure of a database as requirements change. For example, a new user group may emerge that needs information not currently in the database. In response, it may be necessary to add a file to the database or to extend the data elements in an existing file. Modern DBMSs allow certain types of evolutionary changes to the structure of the database without affecting the stored data and the existing application programs. Availability of Up-to-Date Information. A DBMS makes the database available to all users. As soon as one user's update is applied to the database, all other users can 20 I Chapter 1 Databases and Database Users immediately see this update. This availability of up-to-date information is essential for many transaction-processing applications, such as reservation systems or banking databases, and it is made possible by the concurrency control and recovery subsystems of a DBMS. Economies of Scale. The DBMS approach permits consolidation of data and applications, thus reducing the amount of wasteful overlap between activities of data? processing personnel in different projects or departments. This enables the whole organization to invest in more powerful processors, storage devices, or communication gear, rather than having each department purchase its own (weaker) equipment. This reduces overall costs of operation and management. 1.7 A BRIEF HISTORY OF DATABASE ApPlICATIONS We now give a brief historical overview of the applications that use DBMSs, and how these applications provided the impetus for new types of database systems. 1.7.1 Early Database Applications Using Hierarchical and Network Systems Many early database applications maintained records in large organzations, such as corpo? rations, universities, hospitals, and banks. In many of these applications, there were large numbers of records of similar structure. For example, in a university application, similar information would be kept for each student, each course, each grade record, and so on. There were also many types of records and many interrelationships among them. One of the main problems with early database systems was the intermixing of conceptual relationships with the physical storage and placement of records on disk. For example, the grade records of a particular student could be physically stored next to the student record. Although this provided very efficient access for the original queries and transactions that the database was designed to handle, it did not provide enough flexibility to access records efficiently when new queries and transactions were identified. In particular, new queries that required a different storage organization for efficient processing were quite difficult to implement efficiently. It was also quite difficult to reorganize the database when changes were made to the requirements of the application. Another shortcoming of early systems was that they provided only programming language interfaces. This made it time-consuming and expensive to implement new queries and transactions, since new programs had to be written, tested, and debugged. Most of these database systems were implemented on large and expensive mainframe computers starting in the mid-1960s and through the 1970s and 1980s. The main types of early systems were based on three main paradigms: hierarchical systems, network model based systems, and inverted file systems. 1.7 A Brief History of Database Applications I 21 1.7.2 Providing Application Flexibility with Relational Databases Relational databases were originally proposed to separate the physical storage of data from its conceptual representation and to provide a mathematical foundation for databases. The relational data model also introduced high-level query languages that provided an alternative to programming language interfaces; hence, it was a lot quicker to write new queries. Relational representation of data somewhat resembles the example we presented in Figure 1.2. Relational systems were initially targeted to the same applications as earlier systems, but were meant to provide flexibility to quickly develop new queries and to reor? ganize the database as requirements changed. Early experimental relational systems developed in the late 1970s and the commercial RDBMSs (relational database management systems) introduced in the early 1980s were quite slow, since they did not use physical storage pointers or record placement to access related data records. With the development of new storage and indexing techniques and better query processing and optimization, their performance improved. Eventually, relational databases became the dominant type of database systems for traditional database applications. Relational databases now exist on almost all types of computers, from small personal computers to large servers. 1.7.3 Object-Oriented Applications and the Need for More Complex Databases The emergence of object-oriented programming languages in the 1980s and the need to store and share complex-structured objects led to the development of object-oriented databases. Initially, they were considered a competitor to relational databases, since they provided more general data structures. They also incorporated many of the useful object? oriented paradigms, such as abstract data types, encapsulation of operations, inheritance, and object identity. However, the complexity of the model and the lack of an early stan? dard contributed to their limited usc. They are now mainly used in specialized applica? tions, such as engineering design, multimedia publishing, and manufacturing systems. 1.7.4 Interchanging Data on the Web for E-Commerce The World Wide Web provided a large network of interconnected computers. Users can create documents using a Web publishing language, such as HTML (HyperText Markup Language), and store these documents on Web servers where other users (cli? ents) can access them. Documents can be linked together through hvpcrlinks, which are pointers to other documents. In the 1990s, electronic commerce (e-commerce) emerged as a major application on the Web. It quickly became apparent that parts of the information on e-cornmerce Web pages were often dynamically extracted data from DBMSs. A variety of techniques were developed to allow the interchange of data on the 22 I Chapter 1 Databases and Database Users Web. Currently, XML (eXtended Markup Language) is considered to be the primary standard for interchanging data among various types of databases and Web pages. XML combines concepts from the models used in document systems with database modeling concepts. 1.7.5 Extending Database Capabilities for New Applications The success of database systems in traditional applications encouraged developers of other types of applications to attempt to use them. Such applications traditionally used their own specialized file and data structures. The following are examples of these applications: • Scientific applications that store large amounts of data resulting from scientific experiments in areas such as high-energy physics or the mapping of the human genome. • Storage and retrieval of images, from scanned news or personal photographs to satel? lite photograph images and images from medical procedures such as X-rays or MRI (magnetic resonance imaging). • Storage and retrieval of videos, such as movies, or video clips from news or personal digital cameras. • Data mining applications that analyze large amounts of data searching for the occur? rences of specific patterns or relationships. • Spatial applications that store spatial locations of data such as weather information or maps used in geographical information systems. • Time series applications that store information such as economic data at regular points in time, for example, daily sales or monthly gross national product figures. It was quickly apparent that basic relational systems were not very suitable for many of these applications, usually for one or more of the following reasons: • More complex data structures were needed for modeling the application than the simple relational representation. • New data types were needed in addition to the basic numeric and character string types. • New operations and query language constructs were necessary to manipulate the new data types. • New storage and indexing structures were needed. • This led DBMS developers to add functionality to their systems. Some functionality was general purpose, such as incorporating concepts from object-oriented databases into relational systems. Other functionality was special purpose, in the form of optional modules that could be used for specific applications. For example, users could buy a time series module to use with their relational DBMS for their time series application. 1.8 WHEN NOT TO USE A DBMS 1.8 When Not to Use a DBMS I 23 In spite of the advantages of using a DBMS, there are a few situations in which such a sys? tem may involve unnecessary overhead costs that would not be incurred in traditional file processing. The overhead costs of using a DBMS are due to the following: • High initial investment in hardware, software, and training • The generality that a DBMS provides for defining and processing data • Overhead for providing security, concurrency control, recovery, and integrity functions Additional problems may arise if the database designers and DBA do not properly design the database or if the database systems applications are not implemented properly. Hence, it may be more desirable to use regular files under the following circumstances: • The database and applications are simple, well defined, and not expected to change. • There are stringent real-time requirements for some programs that may not be met because of DBMS overhead. • Multiple-user access to data is not required. 1.9 SUMMARY In this chapter we defined a database as a collection of related data, where data means recorded facts. A typical database represents some aspect of the real world and is used for specific purposes by one or more groups of users. A DBMS is a generalized software package for implementing and maintaining a computerized database. The database and software together form a database system. We identified several characteristics that distinguish the database approach from traditional file-processing applications. We then discussed the main categories of database users, or the "actors on the scene." We noted that, in addition to database users, there are several categories of support personnel, or "workers behind the scene," in a database environment. We then presented a list of capabilities that should be provided by the DBMS software to the DBA, database designers, and users to help them design, administer, and use a database. Following this, we gave a brief historical perspective on the evolution of database applications. Finally, we discussed the overhead costs of using a DBMS and discussed some situations in which it may not be advantageous to use a DBMS. Review Questions 1.1. Define the following terms: data, database, DBMS, database system, database catalog, program-data independence, user view, DBA, end user, canned transaction, deductive database system, persistent object, meta-data, transaction-processing application. 1.2. What three main types of actions involve databases! Briefly discuss each. 24 I Chapter 1 Databases and Database Users 1.3. Discuss the main characteristics of the database approach and how it differs from traditional file systems. 1.4. What are the responsibilities of the DBA and the database designers? 1.5. What are the different types of database end users? Discuss the main activities of each. 1.6. Discuss the capabilities that should be provided by a DBMS. Exercises 1.7. Identify some informal queries and update operations that you would expect to apply to the database shown in Figure 1.2. 1.8. What is the difference between controlled and uncontrolled redundancy? Illus? trate with examples. 1.9. Name all the relationships among the records of the database shown in Figure 1.2. 1.10. Give some additional views that may be needed by other user groups for the data? base shown in Figure 1.2. 1.11. Cite some examples of integrity constraints that you think should hold on the database shown in Figure 1.2. Selected Bibliography The October 1991 issue of Communications of the ACM and Kim (1995) include several articles describing next-generation DBMSs; many of the database features discussed in the former are now commercially available. The March 1976 issue of ACM Computing Surveys offers an early introduction to database systems and may provide a historical perspective for the interested reader. Database System Concepts and Architecture The architecture of DBMS packages has evolved from the early monolithic systems, where the whole DBMS software package was one tightly integrated system, to the modern DBMS packages that are modular in design, with a client/server system architecture. This evolu? tion mirrors the trends in computing, where large centralized mainframe computers are being replaced by hundreds of distributed workstations and personal computers con? nected via communications networks to various types of server mach ines-s-Web servers, database servers, file servers, application servers, and so on. In a basic client/server DBMS architecture, the system functionality is distributed between two types of modules. 1 A client module is typically designed so that it will run on a user workstation or personal computer. Typically, application programs and user interfaces that access the database run in the client module. Hence, the client module handles user interaction and provides the user-friendly interfaces such as forms- or menu? based CUls (Graphical User Interfaces). The other kind of module, called a server module, typically handles data storage, access, search, and other functions. We discuss client/server architectures in more detail in Section 2.S. First, we must study more basic concepts that will give us a better understanding of modern database architectures. In this chapter we present the terminology and basic concepts that will be used throughout the book. We start, in Section 2.1, by discussing data models and defining the 1.As we shall see in Section 2.5, there are variations on this simple two-tier client/server architecture. 25 26 I Chapter 2 Database System Concepts and Architecture concepts of schernas and instances, which are fundamental to the study of database systems. We then discuss the three-schema DBMS architecture and data independence in Section 2.2; this provides a user's perspective on what a DBMS is supposed to do. In Section 2.3, we describe the types of interfaces and languages that are typically provided by a DBMS. Section 2.4 discusses the database system software environment. Section 2.5 gives an overview of various types of client/server architectures. Finally, Section 2.6 presents a classification of the types of DBMS packages. Section 2.7 summarizes the chapter. The material in Sections 2.4 through 2.6 provides more detailed concepts that may be looked upon as a supplement to the basic introductory material. 2.1 DATA MODELS, SCHEMAS, AND INSTANCES One fundamental characteristic of the database approach is that it provides some level of data abstraction by hiding details of data storage that are not needed by most database users. A data model-a collection of concepts that can be used to describe the structure of a database-provides the necessary means to achieve this abstraction.i By structure of a database, we mean the data types, relationships, and constraints that should hold for the data. Most data models also include a set of basic operations for specifying retrievals and updates on the database. In addition to the basic operations provided by the data model, it is becoming more common to include concepts in the data model to specify the dynamic aspect or behavior of a database application. This allows the database designer to specify a set of valid user? defined operations that arc allowed on the database objects.:' An example of a user-defined operation could be COMPUTE_GPA, which can be applied to a STUDENT object. On the other hand, generic operations to insert, delete, modify, or retrieve any kind of object are often included in the basic data model ojJerations. Concepts to specify behavior are fundamental to object? oriented data models (see Chapters 20 ami 21) but are also being incorporated in more traditional data models. For example, object-relational models (see Chapter 22) extend the traditional relational model to include such concepts, among others. 2.1.1 Categories of Data Models Many data models have been proposed, which we can categorize according to the types of concepts they use to describe the database structure. High-level or conceptual data mod? els provide concepts that are close to the way many users perceive data, whereas low-level or physical data models provide concepts that describe the details of how data is stored in 2. Sometimes the word model is used to denote a specific database description, or schema-s-for example, "the marketing data model." We will not use this interpretation. 3. The inclusion of concepts to describe behavior reflects a trend whereby database design and soft? ware design activities are increasingly being combined into a single activity. Traditionally, specify? ing behavior is associated with software design. 2.1 Data Models, Schemas, and Instances I 27 the computer. Concepts provided by low-level data models are generally meant for com? puter specialists, not for typical end users. Between these two extremes is a class of repre? sentational (or implementation) data models, which provide concepts that may be understood by end users but that are not too far removed from the way data is organized within the computer. Representational data models hide some details of data storage but can be implemented on a computer system in a direct way. Conceptual data models use concepts such as entities, attributes, and relationships. An entity represents a real-world object or concept, such as an employee or a project, that is described in the database. An attribute represents some property of interest that further describes an entity, such as the employee's name or salary. A relationship among two or more entities represents an association among two or more entities, for example, a works-on relationship between an employee and a project. Chapter 3 presents the entity? relationship model-a popular high-level conceptual data model. Chapter 4 describes additional conceptual data modeling concepts, such as generalization, specialization, and categories. Representational or implementation data models are the models used most frequently in traditional commercial DBMSs. These include the widely used relational data model, as wellas the so-called legacy data models-the network and hierarchical models-that have been widely used in the past. Part 11 of this book is devoted to the relational data model, its operations and languages, and some of the techniques for programming relational database applications." The SQL standard for relational databases is described in Chapters 8 and 9. Representational data models represent data by using record structures and hence are sometimes called record-based data models. We can regard object data models as a new family of higher-level implementation data models that are closer to conceptual data models. We describe the general characteristics of object databases and the ODM(j proposed standard in Chapters 20 and 21. Object data models are also frequently utilized as high-level conceptual models, particularly in the software engineering domain. Physical data models describe how data is stored as files in the computer by representing information such as record formats, record orderings, and access paths. An access path is a structure that makes the search for particular database records efficient. We discuss physical storage techniques and access structures in Chapters 13 and 14. 2.1.2 Schemas, Instances, and Database State In any data model, it is important to distinguish between the description of the database and the database itself. The description of a database is called the database schema, which is specified during database design and is not expected to change frcquentlv.? Most data 4. A summary of the network and hierarchical data models is includeJ in Appendices E and F. The full chapters from the second edition of this book are accessible from the Web site. 5. Schema changes are usually needed as the requirements of the database applications change. Newer database systems include operations for allowing schema changes, although the schema change process is more involved than simple database updates. 28 I Chapter 2 Database System Concepts and Architecture models have certain conventions for displaying schemas as diagrams." A displayed schema is called a schema diagram. Figure 2.1 shows a schema diagram for the database shown in Figure 1.2; the diagram displays the structure of each record type but not the actual instances of records. We call each object in the schema-such as STUDENT or COURSE-a schema construct. A schema diagram displays only some aspects of a schema, such as the names of record types and data items, and some types of constraints. Other aspects are not specified in the schema diagram; for example, Figure 2.1 shows neither the data type of each data item nor the relationships among the various files. Many types of constraints are not represented in schema diagrams. A constraint such as "students majoring in computer science must take CS1310 before the end of their sophomore year" is quite difficult to represent. The actual data in a database may change quite frequently. For example, the database shown in Figure 1.2 changes every time we add a student or enter a new grade for a student. The data in the database at a particular moment in time is called a database state or snapshot. It is also called the current set of occurrences or instances in the database. In a given database state, each schema construct has its own current set of instances; for example, the STUDENT construct will contain the set of individual student entities (records) as its instances. Many database states can be constructed to correspond to a particular database schema. Every time we insert or delete a record or change the value of a data item in a record, we change one state of the database into another state. The distinction between database schema and database state is very important. When we define a new database, we specify its database schema only to the DBMS. At this STUDENT I Name I---:S'-tu-d---:e-n---:tN---:u-m---:b-e-r[ Class I Major COURSE I CourseName ICourseNumberI CreditHours I Department ----'-----------' PREREQUISITE I CourseNumber I PrerequisiteNumber SECTION I Sectionldentifier I CourseNumber I Semester I Year !Instruetor I StudentNumber I Seetionldentifier I Grade FIGURE 2.1 Schema diagram for the database in Figure 1.2. 6. It is customary in database parlance to use scliemas as the plural for schema, even though schemata is the proper plural form. The word scheme is sometimes used for a schema. 2.2 Three-Schema Architecture and Data Independence I 29 point, the corresponding database state is the empty state with no data. We get the initial state of the database when the database is first populated or loaded with the initial data. From then on, every time an update operation is applied to the database, we get another database state. At any point in time, the database has a current state.7 The DBMS is partly responsible for ensuring that every state of the database is a valid state-s-that is, a state that satisfies the structure and constraints specified in the schema. Hence, specifying a correct schema to the DBMS is extremely important, and the schema must be designed with the utmost care. The DBMS stores the descriptions of the schema constructs and constraints-also called the meta-data-in the DBMS catalog so that DBMS software can refer to the schema whenever it needs to. The schema is sometimes called the intension, and a database state an extension of the schema. Although, as mentioned earlier, the schema is not supposed to change frequently, it is not uncommon that changes need to be occasionally applied to the schema as the application requirements change. For example, we may decide that another data item needs to be stored for each record in a file, such as adding the DateOfBirth to the STUDENT schema in Figure 2.1. This is known as schema evolution. Most modern DBMSs include some operations for schema evolution that can be applied while the database is operational. 2.2 THREE-SCHEMA ARCHITECTURE AND DATA INDEPENDENCE Three of the four important characteristics of the database approach, listed in Section 1J, are (1) insulation of program:; and data (program-data and program-operation inde? pendence), (2) support of multiple user views, and (3) use of a catalog to store the data? base description (schema). In this section we specify an architecture for database systems, called the three-schema architccture.i' that was proposed to help achieve and visualize these characteristics. We then further discuss the concept of data independence. 2.2.1 The Three-Schema Architecture The goal of the three-schema architecture, illustrated in Figure 2.2, is to separate the user applications and the physical database. In this architecture, schemas can be defined at the following three levels: 1. The internal level has an internal schema, which describes the physical storage structure of the database. The internal schema uses a physical data model and describes the complete details of data storage and access paths for the database. 7. The current state is also called the current snapshot of the database. 8. This is also known as the ANSI/SPARe architecture, after the committee that proposed it (Tsichritzis and Klug 1978). 30 I Chapter 2 Database System Concepts and Architecture EXTERNAL LEVEL EXTERNAL VIEW END USERS external/conceptual mapping CONCEPTUAL LEVEL conceptual/internal mapping INTERNAL LEVEL ••• INTERNAL SCHEMA STORED DATABASE EXTERNAL VIEW FIGURE 2.2 The three-schema architecture. 2. The conceptual level has a conceptual schema, which describes the structure of the whole database for a community of users. The conceptual schema hides the details of physical storage structures and concentrates on describing entities, data types, relationships, user operations, and constraints. Usually, a representational data model is used to describe the conceptual schema when a database system is implemented. This implementation conceptual schema is often based on a conceptual schemadesign in a high-level data model. 3. The external or view level includes a number of external schemas or user views. Each external schema describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group. As in the previous case, each external schema is typically implemented using a repre? sentational data model, possibly based on an external schema design in a high? level data model. The three-schema architecture is a convenient tool with which the user can visualize the schema levels in a database system. Most DBMSs do not separate the three levels completely, but support the three-schema architecture to some extent. Some DBMSs may 2.2 Three-Schema Architecture and Data Independence I 31 include physical-level details in the conceptual schema. In most DBMSs that support user views, external schernas are specified in the same data model that describes the conceptual-level information. Some DBMSs allow different data models to be used at the conceptual and external levels. Notice that the three schernas are only descriptions of data; the only data that actually exists is at the physical level. In a DBMS based on the three-schema architecture, each user group refers only to its own external schema. Hence, the DBMS must transform a request specified on an external schema into a request against the conceptual schema, and then into a request on the internal schema for processing over the stored database. If the request is a database retrieval, the data extracted from the stored database must be reformatted to match the user's external view. The processes of transforming requests and results between levels are called mappings. These mappings may be time-consuming, so some DBMSs-especially those that are meant to support small databases-do not support external views. Even in such systems, however, a certain amount of mapping is necessary to transform requests between the conceptual and internal levels. 2.2.2 Data Independence The three-schema architecture can be used to further explain the concept of data inde? pendence, which can be defined as the capacity to change the schema at one level of a database system without having to change the schema at the next higher level. We can define two types of data independence: 1. Logical data independence is the capacity to change the conceptual schema with? out having to change external schernas or application programs. We may change the conceptual schema to expand the database (by adding a record type or data item), to change constraints, or to reduce the database (by removing a record type or data item). In the last case, external schemas that refer only to the remaining data should not be affected. For example, the external schema of Figure l.4a should not be affected by changing the GRADE_REPORT file shown in Figure 1.2 into the one shown in Figure 1.5a. Only the view definition and the mappings need be changed in a DBMS that supports logical data independence. After the conceptual schema undergoes a logical reorganization, application programs that reference the external schema constructs must work as before. Changes to constraints can be applied to the conceptual schema without affecting the external schernas or application programs. 2. Physical data independence is the capacity to change the internal schema with? out having to change the conceptual schema. Hence, the external schemas need not be changed as well. Changes to the internal schema may be needed because some physical files had to be reorganized-for example, by creating additional access structures-to improve the performance of retrieval or update. If the same data as before remains in the database, we should not have to change the concep? tual schema. For example, providing an access path to improve retrieval speed of SECTION records (Figure 1.2) by Semester and Year should not require a query such as "list all sections offered in fall 1998" to be changed, although the query would be executed more efficiently by the DBMS by utilizing the new access path. 32 I Chapter 2 Database System Concepts and Architecture Whenever we have a multiple-level DBMS, its catalog must be expanded to include information on how to map requests and data among the various levels. The DBMS uses additional software to accomplish these mappings by referring to the mapping information in the catalog. Data independence occurs because when the schema is changed at some level, the schema at the next higher level remains unchanged; only the mappingbetween the two levels is changed. Hence, application programs referring to the higher-level schema need not be changed. The three-schema architecture can make it easier to achieve true data independence, both physical and logical. However, the two levels of mappings create an overhead during compilation or execution of a query or program, leading to inefficiencies in the DBMS. Because of this, few DBMSs have implemented the full three-schema architecture. 2.3 DATABASE LANGUAGES AND INTERFACES In Section 1.4 we discussed the variety of users supported by a DBMS. The DBMS must pro? vide appropriate languages and interfaces for each category of users. In this section we dis? cuss the types of languages ami interfaces provided by a DBMS and the user categories targeted by each interface. 2.3.1 DBMS Languages Once the design of a database is completed and a DBMS is chosen to implement the data? base, the first order of the day is to specify conceptual and internal schemas for the data? base and any mappings between the two. In many DBMSs where no strict separation of levels is maintained, one language, called the data definition language (OOL), is used by the DBA and by database designers to define both scheiuas. The DBMS will have a DDL compiler whose function is to process LJDL statements in order to identify descriptions of the schema constructs and to store the schema description in the DBMS catalog. In DBMSs where a clear separation is maintained between the conceptual and internal levels, the DDL is used to specify the conceptual schema only. Another language, the storage definition language (SOL), is used to specify the internal schema. The mappings between the two schemas may be specified in either one of these languages. For a true three-schema architecture, we would need a third language, the view definition language (VDL), to specify user views and their mappings to the conceptual schema, but in most DBMSs the DDL is used to define both conceptual and external schemas. Once the database schemas arc compiled and the database is populated with data, users must have some means to manipulate the database. Typical manipulations include retrieval, insertion, deletion, and modification of the data. The DBMS provides a set of operations or a language called the data manipulation language (OML) for these purposes. In current DBMSs, the preceding types of languages are usually not considered distinct languages; rather, a comprehensive integrated language is used that includes constructs for conceptual schema definition, view definition, ami data manipulation. Storage definition is typically kept separate, since it is used for defining physical storage structures to fine- 2.3 Database Languages and Interfaces I 33 tune the performance of the database system, which is usually done by the DBA staff. A typical example of a comprehensive database language is the SQL relational database language (see Chapters 8 and 9), which represents a combination of DDL, VDL, and DML, as well as statements for constraint specification, schema evolution, and other features. The SDL was a component in early versions of SQL but has been removed from the language to keep it at the conceptual and external levels only. There are two main types of DMLs. A high-level or nonprocedural DML can be used on its own to specify complex database operations in a concise manner. Many DBMSs allow high-level DML statements either to be entered interactively from a display monitor or terminal or to be embedded in a general-purpose programming language. In the latter case, DML statements must be identified within the program so that they can be extracted by a precompiler and processed by the DBMS. A low-level or procedural DML must be embedded in a general-purpose programming language. This type of DML typically retrieves individual records or objects from the database and processes each separately. Hence, it needs to use programming language constructs, such as looping, to retrieve and process each record from a set of records. Low-level DMLs are also called record-at-a-time DMLs because of this property. High-level DMLs, such as SQL, can specify and retrieve many records in a single DML statement and are hence called set-at-a-time or set-oriented DMLs. A query in a high-level DML often specifies which data to retrieve rather than how to retrieve it; hence, such languages are also called declarative. Whenever DML commands, whether high level or low level, are embedded in a general-purpose programming language, that language is called the host language and the DML is called the data sublanguage." On the other hand, a high-level DML used in a stand-alone interactive manner is called a query language. In general, both retrieval and update commands of a high-level DML may be used interactively and are hence considered part of the query language. to Casual end users typically use a high-level query language to specify their requests, whereas programmers use the DML in its embedded form. For naive and parametric users, there usually are user-friendly interfaces for interacting with the database; these can also be used by casual users or others who do not want to learn the details of a high-level query language. We discuss these types of interfaces next. 2.3.2 DBMS Interfaces User-friendly interfaces provided by a DBMS may include the following. Menu-Based Interfaces for Web Clients or Browsing. These interfaces present the user with lists of options, called menus, that lead the user through the formulation of 9. In object databases, the host and data sublanguages typically furm one integrated language-for example, c++ with some extensions to support database functionality. Some relational systems also provide integrated languages-> for example, oracle's PL/sQL. 10. According to the meaning of the word query in English, it should really be used to describe only retrievals, not updates. 34 I Chapter 2 Database System Concepts and Architecture a request. Menus do away with the need to memorize the specific commands and syntax of a query language; rather, the query is composed step by step by picking options from a menu that is displayed by the system. Pull-down menus are a very popular technique in Web-based user interfaces. They are also often used in browsing interfaces, which allow a user to look through the contents of a database in an exploratory and unstructured manner. Forms-Based Interfaces. A forms-based interface displays a form to each user. Users can fill out all of the form entries to insert new data, or they fill out only certain entries, in which case the DBMS will retrieve matching data for the remaining entries. Forms are usually designed and programmed for naive users as interfaces to canned transactions. Many DBMSs have forms specification languages, which are special languages that help programmers specify such forms. Some systems have utilities that define a form by letting the end user interactively construct a sample form on the screen. A graphical interface (CUI) typically displays a schema Graphical User Interfaces. to the user in diagrammatic form. The user can then specify a query by manipulating the diagram. In many cases, CUIs utilize both menus and forms. Most CUIs use a pointing device, such as a mouse, to pick certain parts of the displayed schema diagram. Natural Language Interfaces. These interfaces accept requests written in English or some other language and attempt to "understand" them. A natural language interface usually has its own "schema," which is similar to the database conceptual schema, as well as a dictionary of important words. The natural language interface refers to the words in its schema, as well as to the set of standard words in its dictionary, to interpret the request. If the interpretation is successful, the interface generates a high-level query corresponding to the natural language request and submits it to the DBMS for processing; otherwise, a dialogue is started with the user to clarify the request. Interfaces for Parametri c Users. Parametric users, such as bank tellers, often have a small set of operations that they must perform repeatedly. Systems analysts and programmers design and implement a special interface for each known class of naive users. Usually, a small set of abbreviated commands is included, with the goal of minimizing the number of keystrokes required for each request. For example, function keys in a terminal can be programmed to initiate the various commands. This allows the parametric user to proceed with a minimal number of keystrokes. Interfaces for the DBA. Most database systems contain privileged commands that can be used only by the DBA's staff. These include commands for creating accounts, setting system parameters, granting account authorization, changing a schema, and reorganizing the storage structures of a database. 2.4 The Database System Environment I 35 2.4 THE DATABASE SYSTEM ENVIRONMENT A DBMS is a complex software system. In this section we discuss the types of software com? ponents that constitute a DBMS and the types of computer system software with which the DBMS interacts. 2.4.1 DBMS Component Modules Figure 2.3 illustrates, in a simplified form, the typical DBMS components. The database and the DBMS catalog are usually stored on disk. Access to the disk is controlled primarily by the operating system (OS), which schedules disk input/output. A higher-level stored data manager module of the DBMS controls access to DBMS information that is stored on disk, whether it is part of the database or the catalog. The dotted lines and circles marked DBA staff ~~JI DOL STATEMENTS Casualur l DOL Compiler PRIVILEGED COMMANDS INTERACTIVE QUERY execution execution Parametric users I COMPILED (CANNED) TRANSACTIONS Stored Data Manager 1 1 I 1 1 1 1 1 1 1 1 1 1 _________________________ 1 Concurrency Cantrall Backup/Recovery Subsystems FIGURE 2.3 Component modules of a DBMS and their interactions. 36 I Chapter 2 Database System Concepts and Architecture A, B, C, D, and E in Figure 2.3 illustrate accesses that are under the control of this stored data manager. The stored data manager may use basic os services for carrying out low? level data transfer between the disk and computer main storage, but it controls other aspects of data transfer, such as handling buffers in main memory. Once the data is in main memory buffers, it can be processed by other DBMS modules, as well as by applica? tion programs. Some DBMSs have their own buffer manager module, while others use the os for handling the buffering of disk pages. The DDL compiler processes schema definitions, specified in the DOL, and stores descriptions of the schemas (meta-data) in the DBMS catalog. The catalog includes information such as the names and sizes of files, names and data types of data items, storage details of each file, mapping information among schemas, and constraints, in addition to many other types of information that are needed by the DBMS modules. DBMS software modules then look up the catalog information as needed. The runtime database processor handles database accesses at runtime; it receives retrieval or update operations and carries them out on the database. Access to disk goes through the stored data manager, and the buffer manager keeps track of the database pages in memory. The query compiler handles high-level queries that are entered interactively. It parses, analyzes, and compiles or interprets a query by creating database access code, and then generates calls to the runtime processor for executing the code. The precompiler extracts DML commands from an application program written in a host programming language. These commands are sent to the DML compiler for compilation into object code for database access. The rest of the program is sent to the host language compiler. The object codes for the DML commands and the rest of the program are linked, forming a canned transaction whose executable code includes calls to the runtime database processor. It is now common to have the client program that accesses the DBMS running on a separate computer from the computer on which the database resides. The former is called the client computer, and the latter is called the database server. In some cases, the client accesses a middle computer, called the application server, which in turn accesses the database server. We elaborate on this topic in Section 2.5. Figure 2.3 is not meant to describe a specific DBMS; rather, it illustrates typical DBMS modules. The DBMS interacts with the operating system when disk accesses-to the database or to the catalog-are needed. If the computer system is shared by many users, the os will schedule DBMS disk access requests and DBMS processing along with other processes. On the other hand, if the computer system is mainly dedicated to running the database server, the DBMS will control main memory buffering of disk pages. The DBMS also interfaces with compilers for general-purpose host programming languages, and with application servers and client programs running on separate machines through the system network interface. 2.4.2 Database System Utilities In addition to possessing the software modules just described, most DBMSs have database utilities that help the DBA in managing the database system. Common utilities have the following types of functions: 2.4 The Database System Environment I 37 • Loading: A loading utility is used to load existing data files-such as text files or sequential files-into the database. Usually, the current (source) format of the data ti.le and the desired (target) database file structure are specified to the utility, which then automatically reformats the data and stores it in the database. With the prolifer? ation of DBMSs, transferring data from one DBMS to another is becoming common in many organizations. Some vendors are offering products that generate the appropri? ate loading programs, given the existing source and target database storage descrip? tions (internal schemas). Such tools are also called conversion tools. • Backup: A backup utility creates a backup copy of the database, usually by dumping the entire database onto tape. The backup copy can be used to restore the database in case of catastrophic failure. Incremental backups are also often used, where only changes since the previous backup are recorded. Incremental backup is more com? plex but saves space. • File reorganization: This utility can be used to reorganize a database file into a differ? ent file organization to improve performance. • Performance monitoring: Such a utility monitors database usage and provides statistics to the DBA. The DBA uses the statistics in making decisions such as whether or not to reorganize files to improve performance. Other utilities may be available for sorting files, handling data compression, monitoring access by users, interfacing with the network, and performing other functions. 2.4.3 Tools, Application Environments, and Communications Facilities Other tools are often available to database designers, users, and DBAs. CASE tools"! are used in the design phase of database systems. Another tool that can be quite useful in large organizations is an expanded data dictionary (or data repository) system. In addi? tion to storing catalog information about schemas and constraints, the data dictionary stores other information, such as design decisions, usage standards, application program descriptions, and user information. Such a system is also called an information reposi? tory. This information can be accessed directly by users or the DBA when needed. A data dictionary utility is similar to the DBMS catalog, but it includes a wider variety of informa? tion and is accessed mainly by users rather than by the DBMS software. Application development environments, such as the PowerBuilder (Sybase) or JBuilder (Borland) system, are becoming quite popular. These systems provide an environment for developing database applications and include facilities that help in many facets of database systems, including database design, CUI development, querying and updating, and application program development. 11. Althuugh CASE stands for computer-aided software engineering, many CASE tools are used pri? marily for database design. 38 I Chapter 2 Database System Concepts and Architecture The DBMS also needs to interface with communications software, whose function is to allow users at locations remote from the database system site to access the database through computer terminals, workstations, or their local personal computers. These are connected to the database site through data communications hardware such as phone lines, long-haul networks, local area networks, or satellite communication devices. Many commercial database systems have communication packages that work with the DBMS. The integrated DBMS and data communications system is called a DB/DC system. In addition, some distributed DBMSs are physically distributed over multiple machines. In this case, communications networks are needed to connect the machines. These are often local area networks (LANs), but they can also be other types of networks. 2.5 CENTRALIZED AND CLIENT/SERVER ARCHITECTURES FOR DBMSS 2.5.1 Centralized DBMSS Architecture Architectures for DBMSs have followed trends similar to those for general computer sys? tem architectures. Earlier architectures used mainframe computers to provide the main processing for all functions of the system, including user application programs and user interface programs, as well as all the DBMS functionality. The reason was that most users accessed such systems via computer terminals that did not have processing power and only provided display capabilities. So, all processing was performed remotely on the com? puter system, and only display information and controls were sent from the computer to the display terminals, which were connected to the central computer via various types of communications networks. As prices of hardware declined, most users replaced their terminals with personal computers (PCs) and workstations. At first, database systems used these computers in the same way as they had used display terminals, so that the DBMS itself was still a centralized DBMS in which all the DBMS functionality, application program execution, and user interface processing were carried out on one machine. Figure 2.4 illustrates the physical components in a centralized architecture. Gradually, DBMS systems started to exploit the available processing power at the user side, which led to client/server DBMS architectures. 2.5.2 Basic Client/Server Architectures We first discuss client/server architecture in general, then see how it is applied to DBMSs. The client/server architecture was developed to deal with computing environments in which a large number of rcs, workstations, file servers, printers, database servers, Web servers, and other equipment are connected via a network. The idea is to define special? ized servers with specific functionalities. For example, it is possible to connect a number of PCs or small workstations as clients to a file server that maintains the files of the client 2.5 Centralized and Client/Server Architectures for DBMSs I 39 TerminaIs Display Display I monitor I I monitor I I Network I I Application Programs SOFTWARE ... Display I monitor I I I Terminal display control __ L--_~ ~mPilers-l ... Text editors ... I DBMS I Operating System System bus 1 [ ControllerMe~my IIGI Controller [ I Controller I ... I \Cpu\ I/O devices I (printers, tape drives ... ) HARDWARE/FIRMWARE FIGURE 2.4 A physical centralized architecture. ... machines. Another machine could be designated as a printer server by being connected to various printers; thereafter, all print requests by the clients are forwarded to this machine. Web servers or e-mail servers also fall into the specialized server category. In this way, the resources provided by specialized servers can be accessed by many client machines. The client machines provide the user with the appropriate interfaces to utilize these servers, as well as with local processing power to run local applications. This con? cept can be carried over to software, with specialized software-such as a DBMS or a CAl) (computer-aided design) package-being stored on specific server machines and being made accessible to multiple clients. Figure 2.5 illustrates client/server architecture at the logical level, and Figure 2.6 is a simplified diagram that shows how the physical I c'f~] rc,,~ iF;JL Netwo~----- ~r FIGURE 2.5 Logical two-tier client/server architecture. 40 I Chapter 2 Database System Concepts and Architecture architecture would look. Some machines would be only client sites (for example, diskless workstations or workstations/PCs with disks that have only client software installed). Other machines would be dedicated servers. Still other machines would have both client and server functionality. The concept of client/server architecture assumes an underlying framework that consists of many PCs and workstations as well as a smaller number of mainframe machines, connected via local area networks and other types of computer networks. A client in this framework is typically a user machine that provides user interface capabilities and local processing. When a client requires access to additional functionality-such as database access-that does not exist at that machine, it connects to a server that provides the needed functionality. A server is a machine that can provide services to the client machines, such as file access, printing, archiving, or database access. In the general case, some machines install only client software, others only server software, and still others may include both client and server software, as illustrated in Figure 2.6. However, it is more common that client and server software usually run on separate machines. Two main types of basic DBMS architectures were created on this underlying client/server framework: two-tier and three? tier. 12 We discuss those next. Diskless client ICLIENT I Site 1 Client with disk 8 8Server ISERVER I ICLIENT I Site 2 Site 3 Communication Network Server8and client ISERVER I ICLIENT I Site n FIGURE 2.6 Physical two-tier client-server architecture. 12. There are many other variations of client/server architectures. We only discuss the two most basic ones here. In Chapter 25, we discuss additional client/server and distributed architectures. 2.5 Centralized and Client/Server Architectures for DBMSS 2.5.3 Two-Tier Client/Server Architectures for DBMSS I 41 The client/server architecture is increasingly being incorporated into commercial DBMS packages. In relational DBMSs (RDBMSs), many of which started as centralized systems, the system components that were first moved to the client side were the user interface and application programs. Because SQL (see Chapters 8 and 9) provided a standard language for RDBMSs, this created a logical dividing point between client and server. Hence, the query and transaction functionality remained on the server side. In such an architecture, the server is often called a query server or transaction server, because it provides these two functionalities. In RDBMSs, the server is also often called an SQL server, since most RDBMS servers are based on the SQL language and standard. In such a client/server architecture, the user interface programs and application programs can run on the client side. When DBMS access is required, the program establishes a connection to the DBMS (which is on the server side); once the connection is created, the client program can communicate with the DBMS. A standard called Open Database Connectivity (ODBC) provides an application programming interface (API), which allows client-side programs to call the DBMS, as long as both client and server machines have the necessary software installed. Most DBMS vendors provide ODBC drivers for their systems. Hence, a client program can actually connect to several RDBMSs and send query and transaction requests using the ODBC API, which are then processed at the server sites. Any query results are sent back to the client program, which can process or display the results as needed. A related standard for the Java programming language, called JDBC, has also been defined. This allows Java client programs to access the DBMS through a standard interface. The second approach to client/server architecture was taken by some object-oriented DBMSs. Because many of these systems were developed in the era of client/server architecture, the approach taken was to divide the software modules of the DBMS between client and server in a more integrated way. For example, the server level may include the part of the DBMS software responsible for handling data storage on disk pages, local concurrency control and recovery, buffering and caching of disk pages, and other such functions. Meanwhile, the client level may handle the user interface; data dictionary functions; DBMS interactions with programming language compilers; global query optimization, concurrency control, and recovery across multiple servers; structuring of complex objects from the data in the buffers; and other such functions. In this approach, the client/server interaction is more tightly coupled and is done internally by the DBMS modules-some of which reside on the client and some on the server-rather than by the users. The exact division of functionality varies from system to system. In such a client/ server architecture, the server has been called a data server, because it provides data in disk pages to the client. This data can then be structured into objects for the client programs by the client-side DBMS software itself. The architectures described here are called two-tier architectures because the software components are distributed over two systems: client and server. The advantages of this architecture are its simplicity and seamless compatibility with existing systems. The emergence of the World Wide Web changed the roles of clients and server, leading to the three-tier architecture. 42 I Chapter 2 Database System Concepts and Architecture 2.5.4 Three-Tier Client/Server Architectures for Web Applications Many Web applications use an architecture called the three-tier architecture, which adds an intermediate layer between the client and the database server, as illustrated in Figure 2.7. This intermediate layer or middle tier is sometimes called the application server and sometimes the Web server, depending on the application. This server plays an intermediary role by storing business rules (procedures or constraints) that are used to access data from the database server. It can also improve database security by checking a client's credentials before forwarding a request to the database server. Clients contain GUI interfaces and some additional application-specific business rules. The intermediate server accepts requests from the client, processes the request and sends database com? mands to the database server, and then acts as a conduit for passing (partially) processed data from the database server to the clients, where it may be processed further and filtered to be presented to users in GUI format. Thus, the user interface, application rules, and data access act as the three tiers. Advances in encryption and decryption technology make it safer to transfer sensitive data from server to client in encrypted form, where it will be decrypted. The latter can be done by the hardware or by advanced software. This technology gives higher levels of data security, but the network security issues remain a major concern. Various technologies for data compression are also helping in transferring large amounts of data from servers to clients over wired and wireless networks. Client GUI, Web Interface , Application Server or Web Server Database Server Application Programs, Web Pages Database Management System FIGURE 2.7 Logical three-tier client/server architecture. 2.6 Classification of Database Management Systems I 43 2.6 CLASSIFICATION OF DATABASE MANAGEMENT SYSTEMS Several criteria are normally used to classify DBMSs. The first is the data model on which the DBMS is based. The main data model used in many current commercial DBMSs is the relational data model. The object data model was implemented in some commercial sys? tems but has not had widespread use. Many legacy (older) applications still run on data? base systems based on the hierarchical and network data models. The relational DBMSs are evolving continuously, and, in particular, have been incorporating many of the con? cepts that were developed in object databases. This has led to a new class of DBMSs called object-relational DBMSs. We can hence categorize DBMSs based on the data model: rela? tional, object, object-relational, hierarchical, network, and other. The second criterion used to classify DBMSs is the number of users supported by the system. Single-user systems support only one user at a time and are mostly used with personal computers. Multiuser systems, which include the majority of DBMSs, support multiple users concurrently. A third criterion is the number of sites over which the database is distributed. A DBMS is centralized if the data is stored at a single computer site. A centralized DBMS can support multiple users, but the DBMS and the database themselves reside totally at a single computer site. A distributed DBMS (DDBMS) can have the actual database and DBMS software distributed over many sites, connected by a computer network. Homogeneous DDBMSs use the same DBMS software at multiple sites. A recent trend is to develop software to access several autonomous preexisting databases stored under heterogeneous llBMSs. This leads to a federated DBMS (or multidatabase system), in which the participating DBMSs are loosely coupled and have a degree of local autonomy. Many llDBMSs use a client-server architecture. A fourth criterion is the cost of the DBMS. The majority of DBMS packages cost between $10,000 and $100,000. Single-user low-end systems that work with microcomputers cost between $100 and $3000. At the other end of the scale, a few elaborate packages cost more than $100,000. We can also classify a DBMS on the basis of the types of access path options for storing files. One well-known family of DBMSs is based on inverted file structures. Finally, a DBMS can be general purpose or special purpose. When performance is a primary consideration, a special-purpose DBMS can be designed and built for a specific application; such a system cannot be used for other applications without major changes. Many airline reservations and telephone directory systems developed in the past are special purpose DBMSs. These fall into the category of online transaction processing (OLTP) systems, which must support a large number of concurrent transactions without imposing excessive delays. Let us briefly elaborate on the main criterion for classifying DBMSs: the data model. The basic relational data model represents a database as a collection of tables, where each table can be stored as a separate file. The database in Figure 1.2 is shown in a manner very similar to a relational representation. Most relational databases use the high-level query language called SQL and support a limited form of user views. We discuss the relational 44 I Chapter 2 Database System Concepts and Architecture model, its languages and operations, and techniques for programming relational applications in Chapters 5 through 9. The object data model defines a database in terms of objects, their properties, and their operations. Objects with the same structure and behavior belong to a class, and classes are organized into hierarchies (or acyclic graphs). The operations of each class are specified in terms of predefined procedures called methods. Relational DBMSs have been extending their models to incorporate object database concepts and other capabilities; these systems are referred to as object-relational or extended relational systems. We discuss object databases and object-relational systems in Chapters 20 to 22. Two older, historically important data models, now known as legacy data models, are the network and hierarchical models. The network model represents data as record types and also represents a limited type of l:N relationship, called a set type. Figure 2.8 shows a network schema diagram for the database of Figure 1.2, where record types are shown as rectangles and set types are shown as labeled directed arrows. The network model, also known as the CODASYL DBTG model, l3 has an associated record-at-a-time language that must be embedded in a host programming language. The hierarchical model represents data as hierarchical tree structures. Each hierarchy represents a number of related records. There is no standard language for the hierarchical model, although most hierarchical DBMSs have record-at-a-time languages. We give a brief overview of the network and hierarchical models in Appendices E and E 14 The XML (eXtended Markup Language) model, now considered the standard {or data interchange over the Internet, also uses hierarchical tree structures. It combines database concepts with concepts {rom document representation models. Data is represented as elements, which can be nested to create complex hierarchical structures. This model [ COURSE COURSE~OFFERINGS STUDENT~GRADES fiGURE 2.8 The schema of Figure 2.1 in network model notation 13. COOASYL OBTG stands for Conference on Data Systems Languages Data Base Task Group, which is the committee that specified the network model and its language. 14. The full chapters on the network and hierarchical models from the second edition of this book are available over the Internet from the Web site. 2.7 Summary I 45 conceptually resembles the object model, but uses different terminology. We discuss XML and how it is related to databases in Chapter 26. 2.7 SUMMARY In this chapter we introduced the main concepts used in database systems. We defined a data model, and we distinguished three main categories of data models: • High-level or conceptual data models (based on entities and relationships) • Low-level or physical data models • Representational or implementation data models (record-based, object-oriented) We distinguished the schema, or description of a database, from the database itself. The schema does not change very often, whereas the database state changes every time data is inserted, deleted, or modified. We then described the three-schema DBMS architecture, which allows three schema levels: • An internal schema describes the physical storage structure of the database. • A conceptual schema is a high-level description of the whole database. • External schemas describe the views of different user groups. A DBMS that cleanly separates the three levels must have mappings between the schemas to transform requests and results from one level to the next. Most DBMSs do not separate the three levels completely. We used the three-schema architecture to define the concepts of logical and physical data independence. We then discussed the main types of languages and interfaces that DBMSs support. A data definition language (DOL) is used to define the database conceptual schema. In most DBMSs, the DOL also defines user views and, sometimes, storage structures; in other DBMSs, separate languages (VOL, SOL) may exist for specifying views and storage structures. The DBMS compiles all schema definitions and stores their descriptions in the DBMS catalog. A data manipulation language (DML) is used for specifying database retrievals and updates. DMLs can be high level (set-oriented, nonprocedural) or low level (record-oriented, procedural). A high-level OML can be embedded in a host programming language, or it can be used as a stand-alone language; in the latter case it is often called a query language. We discussed different types of interfaces provided by DBMSs, and the types of DBMS users with which each interface is associated. We then discussed the database system environment, typical DBMS software modules, and DBMS utilities for helping users and the DBA perform their tasks. We then gave an overview of the two-tier and three-tier architectures for database applications, which are now very common in most modem applications, particularly Web database applications. In the final section, we classified DBMSs according to several criteria: data model, number of users, number of sites, cost, types of access paths, and generality. The main classification of DBMSs is based on the data model. We briefly discussed the main data models used in current commercial DBMSs. 46 I Chapter 2 Database System Concepts and Architecture Review Questions 2.1. Define the following terms: data model, database schema, database state, internal schema, conceptual schema, external schema, data independence, DOL, OML, SOL, VOL, query language, host language, data sublanguage, database utility, catalog, client/ server architecture. 2.2. Discuss the main categories of data models. 2.3. What is the difference between a database schema and a database state? 2.4. Describe the three-schema architecture. Why do we need mappings between schema levels? How do different schema definition languages support this archi? tecture? 2.5. What is the difference between logical data independence and physical data inde? pendence? 2.6. What is the difference between procedural and nonprocedural DMLs? 2.7. Discuss the different types of user-friendly interfaces and the types of users who typically use each. 2.8. With what other computer system software does a DBMS interact? 2.9. What is the difference between the two-tier and three-tier client/server architec? tures? 2.10. Discuss some types of database utilities and tools and their functions. Exercises 2.11. Think of different users for the database of Figure 1.2. What types of applications would each user need? To which user category would each belong, and what type of interface would each need? 2.12. Choose a database application with which you are familiar. Design a schema and show a sample database for that application, using the notation of Figures 2.1 and 1.2. What types of additional information and constraints would you like to repre? sent in the schema? Think of several users for your database, and design a view for each. Selected Bibliography Selected Bibliography I 47 Many database textbooks, including Date (2001), Silberschatz et a1. (2001), Ramakrishnan and Gehrke (2002), Garcia-Molina et al (1999, 2001), and Abiteboul et a1. (1995), provide a discussion of the various database concepts presented here. Tsichritzis and Lochovsky (1982) is an early textbook on data models. Tsichritzis and Klug (1978) and Jardine (1977) present the three-schema architecture, which was first suggested in the DBTG CODASYL report (1971) and later in an American National Standards Institute (ANSI) report (1975). An in-depth analysis of the relational data model and some of its possible extensions is given in Codd (1992). The proposed standard for object-oriented databases is described in Cattell (1997). Many documents describing XML are available on the Web, such as XML (2003 ). Examples of database utilities are the ETI Extract Toolkit (www.eti.com) and the database administration tool DB Artisan from Embarcadero Technologies (wwwembarcadero.com). Data Modeling Using the Entity-Relationsh ip Model Conceptual modeling is a very important phase in designing a successful database appli? cation. Generally, the term database application refers to a particular database and the associated programs that implement the database queries and updates. For example, a BANK database application that keeps track of customer accounts would include programs that implement database updates corresponding to customers making deposits and withdraw, also These programs provide user-friendly graphical user interfaces (GUls) utilizing forms and menus for the end users of the application-the bank tellers, in this example. Hence, part of the database application will require the design, implementation, and testing of these application programs. Traditionally, the design and testing of application programs has been considered to be more in the realm of the software engineering domain than in the database domain. As database design methodologies include more of the concepts for specifying operations on database objects, and as software engineering methodologies specify in more detail the structure of the databases that software programs will use and access, it is clear that these activities are strongly related. We briefly discuss some of the concepts for specifying database operations in Chapter 4, and again when we discuss data, base design methodology with example applications in Chapter 12 of this book. In this chapter, we follow the traditional approach of concentrating on the database structures and constraints during database design. We present the modeling concepts of the Entity-Relationship (ER) model, which is a popular high, level conceptual data model. This model and its variations are frequently used for the conceptual design of database applications, and many database design tools employ its concepts. We describe 49 50 I Chapter 3 Data Modeling Using the Entity-Relationship Model the basic data-structuring concepts and constraints of the ER model and discuss their use in the design of conceptual schemas for database applications. We also present the diagrammatic notation associated with the ER model, known as ER diagrams. Object modeling methodologies such as UML (Universal Modeling Language) are becoming increasingly popular in software design and engineering. These methodologies go beyond database design to specify detailed design of software modules and their interactions using various types of diagrams. An important part of these methodologies? namely, class diagrams I-are similar in many ways to the ER diagrams. In class diagrams, operations on objects are specified, in addition to specifying the database schema structure. Operations can be used to specify the functional requirements during database design, as discussed in Section 3.1. We present some of the UML notation and concepts for class diagrams that are particularly relevant to database design in Section 3.8, and briefly compare these to ER notation and concepts. Additional UML notation and concepts are presented in Section 4.6 and in Chapter 12. This chapter is organized as follows. Section 3.1 discusses the role of high-level conceptual data models in database design. We introduce the requirements for an example database application in Section 3.2 to illustrate the use of concepts from the ER model. This example database is also used in subsequent chapters. In Section 3.3 we present the concepts of entities and attributes, and we gradually introduce the diagrammatic technique for displaying an ER schema. In Section 3.4 we introduce the concepts of binary relationships and their roles and structural constraints. Section 3.5 introduces weak entity types. Section 3.6 shows how a schema design is refined to include relationships. Section 3.7 reviews the notation for ER diagrams, summarizes the issues that arise in schema design, and discusses how to choose the names for database schema constructs. Section 3.8 introduces some UML class diagram concepts, compares them to ER model concepts, and applies them to the same database example. Section 3.9 summarizes the chapter. The material in Sections 3.8 may be left out of an introductory course if desired. On the other hand, if more thorough coverage of data modeling concepts and conceptual database design is desired, the reader should continue on to the material in Chapter 4 after concluding Chapter 3. Chapter 4 describes extensions to the ER model that lead to the Enhanced-ER (EER) model, which includes concepts such as specialization, generalization, inheritance, and union types (categories). We also introduce some additional UML concepts and notation in Chapter 4. 3.1 USING HIGH-LEVEL CONCEPTUAL DATA MODELS FOR DATABASE DESIGN Figure 3.1 shows a simplified description of the database design process. The first step shown is requirements collection and analysis. Outing this step, the database designers interview pro? spective database users to understand and document their data requirements. The result of this 1. A class is similar to an entity type in many ways. 3.1 Using High-Level Conceptual Data Models for Database Design I 51 1 I Functional Requirements High-level Transaction Specification Miniworld REQUIREMENTS COLLECTION AND ANALYSIS DataRequirements Conceptual Schema (In a high-level data model) DBMs-independent LOGICALDESIGN (DATA MODELMAPPING) DBMs-specific Logical (Conceptual) Schema (Inthe data modelof a specific DBMS) APPLICATION PROGRAM DESIGN Application Programs PHYSICALDESIGN .. Internal Schema FIGURE 3.1 A simplified diagram to illustrate the main phases of database design. 52 I Chapter 3 Data Modeling Using the Entity-Relationship Model step is a concisely written set of users'requirements. These requirements should be specified in as detailed and complete a fonn as possible. In parallel with specifying the data requirements, it is useful to specify the known functional requirements of the application. These consist of the user-defined operations (or transactions) that will be applied to the database, including both retrievals and updates. In software design, it is common to use data flow diagrams, sequence dia? grams, scenarios, and other techniques for specifying functional requirements. We will not discuss any of these techniques here because they are usually described in detail in software engineering texts. We give an overview of some of these techniques in Chapter 12. Once all the requirements have been collected and analyzed, the next step is to create a conceptual schema for the database, using a high-level conceptual data model. This step is called conceptual design. The conceptual schema is a concise description of the data requirements of the users and includes detailed descriptions of the entity types, relationships, and constraints; these are expressed using the concepts provided by the high-level data model. Because these concepts do not include implementation details, they are usually easier to understand and can be used to communicate with nontechnical users. The high-level conceptual schema can also bc used as a reference to ensure that all users' data requirements are met and that the requirements do not conflict. This approach enables the database designers to concentrate on specifying the properties of the data, without being concerned with storage details. Consequently, it is easier for them to come up with a good conceptual database design. During or after the conceptual schema design, the basic data model operations can be used to specify the high-level user operations identified during functional analysis. This also serves to confirm that the conceptual schema meets all the identified functional requirements. Modifications to the conceptual schema can be introduced if some functional requirements cannot be specified using the initial schema. The next step in database design is the actual implementation of the database, using a commercial DBMS. Most current commercial DBl\1Ss use an implementation data model? such as the relational or the object-relational database model-so the conceptual schema is transformed from the high-level data model into the implementation data model. This step is called logical design or data model mapping, and its result is a database schema in the implementation data model of the DBMS. The last step is the physical design phase, during which the internal storage structures, indexes, access paths, and file organizations for the database files are specified. In parallel with these activities, application programs are designed and implemented as database transactions corresponding to the high-level transaction specifications. We discuss the database design process in more detail in Chapter 12. We present only the basic ER model concepts for conceptual schema design in this chapter. Additional modeling concepts are discussed in Chapter 4, when we introduce the EER model. 3.2 AN EXAMPLE DATABASE ApPLICATION In this section we describe an example database application, called COMPANY, that serves to illustrate the basic ER model concepts and their use in schema design. We list the data requirements for the database here, and then create its conceptual schema step by step as 3.3 Entity Types, Entity Sets, Attributes, and Keys I 53 we introduce the modeling concepts of the ER model. The COMPANY database keeps track of a company's employees, departments, and projects. Suppose that after the requirements collection and analysis phase, the database designers provided the following description of the "miniworld"-the part of the company to be represented in the database: 1. The company is organized into departments. Each department has a unique name, a unique number, and a particular employee who manages the department. We keep track of the start date when that employee began managing the department. A department may have several locations. 2. A department controls a number of projects, each of which has a unique name, a unique number, and a single location. 3. We store each employee's name, social security number.i address, salary, sex, and birth date. An employee is assigned to one department but may work on several projects, which are not necessarily controlled by the same department. We keep track of the number of hours per week that an employee works on each project. We also keep track of the direct supervisor of each employee. 4. We want to keep track of the dependents of each employee for insurance pur? poses. We keep each dependent's first name, sex, birth date, and relationship to the employee. Figure 3.2 shows how the schema for this database application can be displayed by means of the graphical notation known as ER diagrams. We describe the step-by-step process of deriving this schema from the stated requirements-and explain the ER diagrammatic notation-as we introduce the ER model concepts in the following section. 3.3 ENTITY TYPES, ENTITY SETS, ATTRIBUTES, AND KEYS The ER model describes data as entities, relationships, and attributes. In Section 3.3.1 we introduce the concepts of entities and their attributes. We discuss entity types and key attributes in Section 3.3.2. Then, in Section 3.3.3, we specify the initial conceptual design of the entity types for the COMPANY database. Relationships are described in Section 3.4. 3.3.1 Entities and Attributes Entities and Their Attributes. The basic object that the ER model represents is an entity, which is a "thing" in the real world with an independent existence. An entity may be an object with a physical existence (for example, a particular person, car, house, or 2. The social security number, or SSN, is a unique nine-digit identifier assigned to each individual in the United States to keep track of his or her employment, benefits, and taxes. Other countries may have similar identification schemes, such as personal identification card numbers. 54 I Chapter 3 Data Modeling Using the Entity-Relationship Model SUPERVISION N N FIGURE 3.2 An ER schema diagram for the COMPANY database. Relationship employee) or it may be an object with a conceptual existence (for example, a company, a job, or a university course). Each entity has attributes-the particular properties that describe it. For example, an employee entity may be described by the employee's name, age, address, salary, and job. A particular entity will have a value for each of its attributes. The attribute values that describe each entity become a major part of the data stored in the database. Figure 3.3 shows two entities and the values of their attributes. The employee entity c J has four attributes: Name, Address, Age, and HomePhone; their values are "John Smith," "2311 Kirby, Houston, Texas 77001," "55," and "713-749-2630," respectively. The company entity (1 has three attributes: Name, Headquarters, and President; their values are "Sunco Oil," "Houston," and "John Smith," respectively. 3.3 Entity Types, Entity Sets, Attributes, and Keys I 55 Name = John Smith Address = 2311 Kirby, Houston, Texas 77001 Age = 55 HomePhone = 713-749-2630 Name = Sunco Oil -- Headquarters =Houston President = John Smith FIGURE 3.3 Two entities, employee e1 and company c1' and their attributes. Several types of attributes occur in the ER model: simple versus composite, single-valued versus 1l1ultivalued, and stored versus derived. We first define these attribute types and illustrate their use via examples. We then introduce the concept of a null value for an attribute. Composite versus Simple (Atomic) Attributes. Composite attributes can be divided into smaller subparts, which represent more basic attributes with independent meanings. For example, the Address attribute of the employee entity shown in Figure 3.3 can be subdivided into StreetAddress, City, State, and Zip,3 with the values "2311 Kirby," "Houston," "Texas," and "77001." Attributes that are not divisible are called simple or atomic attributes. Composite attributes can form a hierarchy; for example, StreetAddress can be further subdivided into three simple attributes: Number, Street, and ApartmentNumber, as shown in Figure 3.4. The value of a composite attribute is the concatenation of the values of itsconstituent simple attributes. Number Address StreetAddress City State Street ApartmentNumber FIGURE 3.4 A hierarchy of composite attributes. Zip 3.The zipcude is the n.une used in the United States fur a 5-digit postal code. 56 I Chapter 3 Data Modeling Using the Entity-Relationship Model Composite attributes are useful to model situations in which a user sometimes refers to the composite attribute as a unit but at other times refers specifically to its components. If the composite attribute is referenced only as a whole, there is no need to subdivide it into component attributes. For example, if there is no need to refer to the individual components of an address (zip code, street, and so on), then the whole address can be designated as a simple attribute. Single-Valued versus Multivalued Attributes. Most attributes have a single value for a particular entity; such attributes are called single-valued. For example, Age is a single-valued attribute of a person. In some cases an attribute can have a set of values for the same entity-for example, a Colors attribute for a car, or a CollegeDegrees attribute for a person. Cars with one color have a single value, whereas two-tone cars have two values for Colors. Similarly, one person may not have a college degree, another person may have one, and a third person may have two or more degrees; therefore, different persons can have different numbers of values for the CollegeDegrees attribute. Such attributes are called multivalued. A multivalued attribute may have lower and upper bounds to constrain the number of values allowed for each individual entity. For example, the Colors attribute of a car may have between one and three values, if we assume that a car can have at most three colors. Stored versus Derived Attributes. In some cases, two (or more) attribute values are related-for example, the Age and BirthDate attributes of a person. For a particular person entity, the value of Age can be determined from the current (today's) date and the value of that person's BirthDate. The Age attribute is hence called a derived attribute and is said to be derivable from the BirthDate attribute, which is called a stored attribute. Some attribute values can be derived from related entities; for example, an attribute NumberOfEmployees of a department entity can be derived by counting the number of employees related to (working for) that department. Null Va Iues. In some cases a particular entity may not have an applicable value for an attribute. For example, the ApartmentNumber attribute of an address applies only to addresses that are in apartment buildings and not to other types of residences, such as single-family homes. Similarly, a College Degrees attribute applies only to persons with college degrees. For such situations, a special value called null is created. An address of a single-family home would have null for its ApartmentNumber attribute, and a person with no college degree would have null for College Degrees. Null can also be used if we do not know the value of an attribute for a particular entity-for example, if we do not know the home phone of "John Smith" in Figure 3.3. The meaning of the former type of null is not applicable, whereas the meaning of the latter is unknown. The "unknown" category of null can be further classified into two cases. The first case arises when it is known that the attribute value exists but is missing-for example, if the Height attribute of a person is listed as null. The second case arises when it is not known whether the attribute value exists-for example, if the Homel'hone attribute of a person is null. Complex Attributes. Notice that composite and multivalued attributes can be nested in an arbitrary way. We can represent arbitrary nesting by grouping components of 3.3 Entity Types, Entity Sets, Attributes, and Keys I 57 (AddressPhone( (Phone(AreaCode,PhoneNumber)}, Address(StreetAddress(Number,Street,ApartmentNumber), City,State,Zip) ) } FIGURE 3.5 A complex attribute: AddressPhone. commas,a compositeand attributeby displayingbetweenmultivaluedparenthesesattributes() andbetweenseparatingbracesthen. Suchcomponentsattributeswithare called complex attributes. For example, if a person can have more than one residence and each residence can have multiple phones, an attribute AddressPhone for a person can be specifiedas shown in Figure 3.5. 4 3.3.2 Entity Types, Entity Sets, Keys, and Value Sets A database usually contains groups of entities that Entity Types and Entity Sets. are similar. For example, a company employing hundreds of employees may want to store similar information concerning each of the employees. These employee entities share the same attributes, but each entity has its own value(s) for each attribute. An entity type defines a collection (or set) of entities that have the same attributes. Each entity type in the database is described by its name and attributes. Figure 3.6 shows two entity types, named EMPLOYEE and COMPANY, and a list of attributes for each. A few individual entities of each type are also illustrated, along with the values of their attributes. The collection of all entities ofa particular entity type in the database at any point in time is called an entity set; the entity set is usually referred to using the same name as the entity type. For example, EMPLOYEE refers to both a type of entity as well as the current set of all employee entities in the database. An entity type is represented in ER diagrams (see Figure 3.2) as a rectangular box enclosingthe entity type name. Attribute names are enclosed in ovals and are attached to their entity type by straight lines. Composite attributes are attached to their component attributes by straight lines. Multivalued attributes are displayed in double ovals. An entity type describes the schema or intension for a set of entities that share the same structure. The collection of entities of a particular entity type are grouped into an entity set, which is also called the extension of the entity type. Key Attributes of an Entity Type. An important constraint on the entities of an entity type is the key or uniqueness constraint on attributes. An entity type usually has an attribute whose values are distinct for each individual entity in the entity set. Such an attribute is called a key attribute, and its values can be used to identify each entity -----_._----- ----- 4. For those familiar with XML, we should note here that complex attributes are similar to complex elements in XML (see Chapter 26). 5. We are using a notation for ER diagrams that is close to the original proposed notation (Chen 1976). Unfortunately, many other notations are in use. We illustrate some of the other notations in Appendix A and later in this chapter when we present UML classdiagrams. 58 I Chapter 3 Data Modeling Using the Entity-Relationship Model ENTITVTVPE NAME: ENTITVSET: (EXTENSION) EMPLOYEE Name, Age,Salary (JohnSmith, 55, 80k) (Fred Brown, 40, 30K) (JudyClark, 25, 20K) COMPANY Name, Headquarters, President (SuncoOil, Houston, John Smith) (FastComputer, Dallas, Bob King) FIGURE 3.6 Two entity types, EMPLOYEE and COMPANY, and some member entities of each. uniquely. For example, the Name attribute is a key of the COMPANY entity type in Figure 3.6, because no two companies are allowed to have the same name. For the PERSON entity type, a typical key attribute is SocialSecurityNumber. Sometimes, several attributes together form a key, meaning that the combination of the attribute values must be distinct for each entity. If a set of attributes possesses this property, the proper way to represent this in the ER model that we describe here is to define a composite attribute and designate it as a key attribute of the entity type. Notice that such a composite key must be minimal; that is, all component attributes must be included in the composite attribute to have the uniqueness property," In ER diagrammatic notation, each key attribute has its name underlined inside the oval, as illustrated in Figure 3.2. Specifying that an attribute is a key of an entity type means that the preceding uniqueness property must hold for every entity set of the entity type. Hence, it is a constraint that prohibits any two entities from having the same value for the key attribute at the same time. It is not the property of a particular extension; rather, it is a constraint on all extensions of the entity type. This key constraint (and other constraints we discuss later) is derived from the constraints of the miniworld that the database represents. Some entity types have more thanone key attribute. For example, each of the VehicleID and Registration attributes of the entity type CAR (Figure 3.7) is a key in its own right. The Registration attribute is an example of a composite key formed from two simple component attributes, RegistrationNumber and State, neither of which is a key on its own. An entity type may also have no key, in which case it is called a weak entity type (see Section 3.5). 6. Superfluous attributes must not be included in a key; however, a superkey may include superflu? ous attributes, as explained in Chapter 5. 3.3 Entity Types, Entity Sets, Attributes, and Keys I 59 CAR Registration(RegistrationNumber, State), VehiclelD, Make, Model, Year, {Color} «ABC 123, TEXAS), TK629, Ford Mustang, convertible, 1998, {red, black}) «ABC 123, NEW YORK), WP9872, Nissan Maxima, 4-door, 1999, {blue}) «VSY 720, TEXAS), TD729 , Chrysler LeBaron, 4-door, 1995, {white, blue}) FIGURE 3.7 The CAR entity type with two key attributes, Registration and VehicielD. Value Sets (Domains) of Attributes. Each simple attribute of an entity type is associated with a value set (or domain of values), which specifies the set of values that may be assigned to that attribute for each individual entity. In Figure 3.6, if the range of ages allowed for employees is between 16 and 70, we can specify the value set of the Age attribute of EMPLOYEE to be the set of integer numbers between 16 and 70. Similarly, we can specify the value set for the Name attribute as being the set of strings of alphabetic characters separated by blank characters, and so on. Value sets are not displayed in ER diagrams. Value sets are typically specified using the basic data types available in most programming languages, such as integer, string, boolean, float, enumerated type, subrange, and so on. Additional data types to represent date, time, and other concepts are also employed. a functionMathematically,from E to thean attributepower seeA ofP(V)entityof V:type E whose value set is V can be defined as A: E -? P(V) We refer to the value of attribute A for entity e as A(e). The previous definition covers both single-valued and multivalued attributes, as well as nulls. A null value is represented by the empty set. For single-valued attributes, A(e) is restricted to being a singleton set for each entity e in E, whereas there is no restriction on multivalued attributes.f For a composite attribute A, the value set V is the Cartesian product of P(V1) , 7.The power set rev) of a set V is the set of all subsets of V. 8.A singleton set is a set with only one element (value). 60 I Chapter 3 Data Modeling Using the Entity-Relationship Model P(Vz)' ... , P(Vn ) , where Vi' Vz, ... , Vn are the value sets of the simple component attributes that form A: 3.3.3 Initial Conceptual Design of the COMPANY Database We can now define the entity types for the COMPANY database, based on the requirements described in Section 3.2. After defining several entity types and their attributes here, we refine our design in Section 3.4 after we introduce the concept of a relationship. Accord? ing to the requirements listed in Section 3.2, we can identify four entity types-one cor? responding to each of the four items in the specification (see Figure 3.8): 1. An entity type DEPARTMENT with attributes Name, Number, Locations, Manager, and ManagerStartDate. Locations is the only multivalued attribute. We can spec? ify that both Name and Number are (separate) key attributes, because each was specified to be unique. 2. An entity type PROJECT with attributes Name, Number, Location, and Control? lingDepartment. Both Name and Number are (separate) key attributes. 3. An entity type EMPLOYEE with attributes Name, SSN (for social security number), Sex, Address, Salary, BirthDate, Department, and Supervisor. Both Name and Address may be composite attributes; however, this was not specified in the requirements. We must go back to the users to see if any of them will refer to the individual components of Name-FirstName, Middlelnitial, LastName-or of Address. 4. An entity type DEPENDENT with attributes Employee, DependentName, Sex, Birth? Date, and Relationship (to the employee). DEPARTMENT Name, Number, {Locations}, Manager, ManagerStartDate PROJECT Name, Number, Location, ControllingDepartment EMPLOYEE Name(FName, Mlnit, LName), SSN,Sex, Address, Salary, BirthDate, Department, Supervisor, {WorksOn (Project, Hours)) DEPENDENT Employee, DependentName, Sex,BirthDate, Relationship FIGURE 3.8 Preliminary design of entity types for the COMPANY database. 3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints I 61 So far, we have not represented the fact that an employee can work on several projects, nor have we represented the number of hours per week an employee works on each project. This characteristic is listed as part of requirement 3 in Section 3.2, and it can be represented by a multivalued composite attribute of EMPLOYEE called WorksOn with the simple components (Project, Hours). Alternatively, it can be represented as a multivalued composite attribute of PROJECT called Workers with the simple components (Employee, Hours). We choose the first alternative in Figure 3.8, which shows each of the entity types just described. The Name attribute of EMPLOYEE is shown as a composite attribute, presumably after consultation with the users. 3.4 RELATIONSHIP TYPES, RELATIONSHIP SETS, ROLES, AND STRUCTURAL CONSTRAINTS In Figure 3.8 there are several implicit relationships among the various entity types. In fact, whenever an attribute of one entity type refers to another entity type, some relationship exists. For example, the attribute Manager of DEPARTMENT refers to an employee who man? ages the department; the attribute ControllingDepartment of PROJECT refers to the depart? ment that controls the project; the attribute Supervisor of EMPLOYEE refers to another employee (the one who supervises this employee); the attribute Department of EMPLOYEE refers to the department for which the employee works; and so on. In the ER model, these references should not be represented as attributes but as relationships, which are dis? cussed in this section. The COMPANY database schema will be refined in Section 3.6 to repre? sentrelationships explicitly. In the initial design of entity types, relationships are typically captured in the form of attributes. As the design is refined, these attributes get converted into relationships between entity types. This section is organized as follows. Section 3.4.1 introduces the concepts of relationship types, relationship sets, and relationship instances. We then define the conceptsof relationship degree, role names, and recursive relationships in Section 3.4.2, and discuss structural constraints on relationships-such as cardinality ratios and existence dependencies-in Section 3.4.3. Section 3.4.4 shows how relationship types can alsohave attributes. 3.4.1 Relationship Types, Sets, and Instances A relationship type R among n entity types E1, E2, ••. , Endefines a set of associations? or a relationship set-among entities from these entity types. As for the case of entity types and entity sets, a relationship type and its corresponding relationship set are cus? tomarily referred to by the same name, R. Mathematically, the relationship set R is a set of relationship instances Ti, where each Ti associates n individual entities (e., e2' ... , en)' and eachentity ej in Tj is a member of entity type Ej , 1 <: j <: n. Hence, a relationship type is a mathematical relation on E1, E2, ••• ,En; alternatively, it can be defined as a subset of the Cartesian product E1 X £2 X ... X En' Each of the entity types E1, E2, ... , En is said to 62 I Chapter 3 Data Modeling Using the Entity-Relationship Model participate in the relationship type R; similarly, each of the individual entities el' e2' ... , en is said to participate in the relationship instance Tj = (e., e2' ..., en)' Informally, each relationship instance Tj in R is an association of entities, where the association includes exactly one entity from each participating entity type. Each such relationship instance Tj represents the fact that the entities participating in Ti are related in some way in the corresponding miniworld situation. For example, consider a relationship type WORKS_FOR between the two entity types EMPLOYEE and DEPARTMENT, which associates each employee with the department for which the employee works. Each relationship instance in the relationship set WORKSJOR associates one employee entity and one department entity. Figure 3.9 illustrates this example, where each relationship instance Ti is shown connected to the employee and department entities that participate in rio In the miniworld represented by Figure 3.9, employees el' e3' and e6 work for department d l ; e2 and e4 work for d2; and es and e7 work for d3· In ER diagrams, relationship types are displayed as diamond-shaped boxes, which are connected by straight lines to the rectangular boxes representing the participat? ing entity types. The relationship name is displayed in the diamond-shaped box (see Figure 3.2). EMPLOYEE DEPARTMENT FIGURE 3.9 Some instances in the WORKS_FOR relationship set, which represents a rela? tionship type WORKS_FOR between EMPLOYEE and DEPARTMENT. 3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints I 63 3.4.2 Relationship Degree, Role Names, and Recursive Relationships Degree of a Relationship Type. The degree of a relationship type is the number of participating entity types. Hence, the WORKSJOR relationship is of degree two. A relationship type of degree two is called binary, and one of degree three is called ternary. An example of a ternary relationship is SUPPLY, shown in Figure 3.10, where each relationship instance r j associates three entities-a supplier s, a part p, and a project j? whenever s supplies part p ro project j. Relationships can generally be of any degree, but the ones most common are binary relationships. Higher-degree relationships are generally more complex than binary relationships; we characterize them further in Section 4.7. Relationships as Attributes. It is sometimes convenient to think of a relationship type in terms of attributes, as we discussed in Section 3.3.3. Consider the WORKS_FOR relationship type of Figure 3.9. One can think of an attribute called Department of the EMPLOYEE entity type whose value for each employee entity is (a reference to) the department entity that the employee works for. Hence, the value set for this Department attribute is the set ofall DEPARTMENT entities, which is the DEPARTMENT entity set. This is what we did in Figure 3.8 when we specified the initial design of the entity type EMPLOYEE for the COMPANY database. However, when we think of a binary relationship as an attribute, we always have two SUPPLIER SUPPLY FIGURE 3.10 Some relationship instances in the SUPPLY ternary relationship set. 64 I Chapter 3 Data Modeling Using the Entity-Relationship Model options. In this example, the alternative is to think of a multivalued attribute Employees of the entity type DEPARTMENT whose values for each department entity is the set of employee entities who work for that department. The value set of this Employees attribute is the power set of the EMPLOYEE entity set. Either of these two attributes-Department of EMPLOYEE or Employees of DEPARTMENT--can represent the WORKS_FOR relationship type. If both are represented, they are constrained to be inverses of each other," Role Names and Recursive Relationships. Each entity type that participates in a relationship type plays a particular role in the relationship. The role name signifies the role that a participating entity from the entity type plays in each relationship instance, and helps to explain what the relationship means. For example, in the WORKS_FOR relationship type, EMPLOYEE plays the role of employee or workerand DEPARTMENT plays the role of department or employer. Role names are not technically necessary in relationship types where all the participating entity types are distinct, since each participating entity type name can be used as the role name. However, in some cases the same entity type participates more than once in a relationship type in differentroles. In such cases the role name becomes essential for distinguishing the meaning of each participation. Such relationship types are called recursive relationships. Figure 3.11 shows an example. The SUPERVISION relationship type relates an employee to a supervisor, where both employee and supervisor entities are members of the same EMPLOYEE entity type. Hence, the EMPLOYEE entity type participates twice in SUPERVISION: once in the role of supervisor (or boss), and once in the role of supervisee (or subordinate). Each relationship instance ri in SUPERVISION associates two employee entities ej and ek, one of which plays the role of supervisor and the other the role of supervisee. In Figure 3.11, the lines marked "I" represent the supervisor role, and those marked "2" represent the supervisee role; hence, el supervises ez and e3' e4 supervises e6 and e7' and es supervises el and e4' 3.4.3 Constraints on Relationship Types Relationship types usually have certain constraints that limit the possible combinations of entities that may participate in the corresponding relationship set. These constraints are determined from the miniworld situation that the relationships represent. For exam? ple, in Figure 3.9, if the company has a rule that each employee must work for exactly one department, then we would like to describe this constraint in the schema. We can distin? guish two main types of relationship constraints: cardinality ratio and participation. 9. This concept of representing relationship types as attributes is used in a class of data models called functional data models. In object databases (see Chapter 20), relationships can be repre? sented by reference attributes, either in one direction or in both directions as inverses. In rela? tional databases (see Chapter 5), foreign keys are a type of reference attribute used to represent relationships. EMPLOYEE e7~ __ 1 3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints I 65 SUPERVISION 2 FIGURE 3.11 A recursive relationship SUPERVISION between EMPLOYEE in the supervisor role (1) and EMPLOYEE in the subordinate role (2). Cardinality Ratios for Binary Relationships. The cardinality ratio for a binary relationship specifies the maximum number of relationship instances that an entity can participate in. For example, in the WORKS_FOR binary relationship type, DEPARTMENT: EMPLOYEE is of cardinality ratio l:N, meaning that each department can be related to (that is, employs) any number of ernployees.l" but an employee can be related to (work for) only onedepartment. The possible cardinality ratios for binary relationship types are 1:1, l:N, N:l, and M:N. An example of a 1:1 binary relationship is MANAGES (Figure 3.12), which relates a department entity to the employee who manages that department. This represents the miniworld constraints that-at any point in time-an employee can manage only one department and a department has only one manager. The relationship type WORKS_ON (Figure 3.13) is of cardinality ratio M:N, because the miniworld rule is that an employee can work on several projects and a project can have several employees. Cardinality ratios for binary relationships are represented on ER diagrams by displaying 1, M, and N on the diamonds as shown in Figure 3.2. 10. N stands for any number of related entities (zero or more). 66 I Chapter 3 Data Modeling Using the Entity-Relationship Model EMPLOYEE MANAGES DEPARTMENT L --\------~rd, _---+-------i- d, ---+------:~:::::::::::::=-----+____lil_+-----\ d3 e5 ..- e6 e7 • FIGURE 3.12 A 1:1 relationship, MANAGES. EMPLOYEE e, WORKS_ON PROJECT e, '" e, P, P2 '4 P3 '5 FIGURE 3.13 An M:N relationship, WORKS_ON. P, 3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints I 67 Participation Constraints and Existence Dependencies. The participation con? straint specifies whether the existence of an entity depends on its being related to another entityvia the relationship type. This constraint specifies the minimum number of relationship instances that each entity can participate in, and is sometimes called the minimum cardinality constraint. There are two types of participation constraints-total and partial? whichwe illustrate by example. If a company policy states that everyemployee must work for a department, then an employee entity can exist only if it participates in at least one WORKS_ FOR relationship instance (Figure 3.9). Thus, the participation of EMPLOYEE in WORKS_FOR is called total participation, meaning that every entity in "the total set" of employee entities must be related to a department entity via WORKS_FOR. Total participation is also called existence dependency. In Figure 3.12 we do not expect every employee to manage a department, so the participation of EMPLOYEE in the MANAGES relationship type is partial, meaning that some or "part of the set of" employee entities are related to some department entity via MANAGES, but not necessarily all. We will refer to the cardinality ratio and participation constraints, taken together, as the structural constraints of a relationship type. In ER diagrams, total participation (or existence dependency) is displayed as a double line connecting the participating entity type to the relationship, whereas partial participation is represented by a single line (see Figure 3.2). 3.4.4 Attributes of Relationship Types Relationship types can also have attributes, similar to those of entity types. For example, to record the number of hours per week that an employee works on a particular project, we can include an attribute Hours for the WORKS_ON relationship type of Figure 3.13. Another example is to include the date on which a manager started managing a depart? ment via an attribute StartDate for the MANAGES relationship type of Figure 3.12. Notice that attributes of 1:1 or I:N relationship types can be migrated to one of the participating entity types. For example, the StartDate attribute for the MANAGES relationship can be an attribute of either EMPLOYEE or OEPARTMENT, although conceptually it belongs to MANAGES. This is because MANAGES is a 1:1 relationship, so every department or employee entity participates in at most one relationship instance. Hence, the value of the StartDate attribute can be determined separately, either by the participating department entity or by the participating employee (manager) entity. For a I:N relationship type, a relationship attribute can be migrated only to the entity type on the N-side of the relationship. For example, in Figure 3.9, if the WORKS_FOR relationship also has an attribute StartDate that indicates when an employee started working for a department, this attribute can be included as an attribute of EMPLOYEE. This is because each employee works for only one department, and hence participates in at most one relationship instance in WORKS_FOR. In both 1:1 and I:N relationship types, the decision as to where a relationship attribute should be placed-as a relationship type attribute or as an attribute of a participating entity type-is determined subjectively by theschema designer. For M:N relationship types, some attributes may be determined by the combination of participating entities in a relationship instance, not by any single entity. Such attributes 68 I Chapter 3 Data Modeling Using the Entity-Relationship Model must be specified as relationship attributes. An example is the Hours attribute of the M:N relationship WORKS_ON (Figure 3.13); the number of hours an employee works on a project is determined by an employee-project combination and not separately by either entity. 3.5 WEAK ENTITY TYPES Entity types that do not have key attributes of their own are called weak entity types. In contrast, regular entity types that do have a key attribute-which include all the exam? ples we discussed so far-are called strong entity types. Entities belonging to a weak entity type are identified by being related to specific entities from another entity type in combination with one of their attribute values. We call this other entity type the identi? fying or owner entity type, II and we call the relationship type that relates a weak entity type to its owner the identifying relationship of the weak entity type. 12 A weak entity type always has a total participation constraint (existence dependency) with respect to its identifying relationship, because a weak entity cannot be identified without an owner entity. However, not every existence dependency results in a weak entity type. For exam? ple, a DRIVER_LICENSE entity cannot exist unless it is related to a PERSON entity, even though it has its own key (LicenseNumber) and hence is not a weak entity. Consider the entity type DEPENDENT, related to EMPLOYEE, which is used to keep track of the dependents of each employee via a l:N relationship (Figure 3.2). The attributes of DEPENDENT are Name (the first name of the dependent), BirthDate, Sex, and Relationship (to the employee). Two dependents of two distinct employees may, by chance, have the same values for Name, BirthDate, Sex, and Relationship, but they are still distinct entities. They are identified as distinct entities only after determining the particular employee entity to which each dependent is related. Each employee entity is said to own the dependent entities that are related to it. A weak entity type normally has a partial key, which is the set of attributes that can uniquely identify weak entities that are related to the same owner entity.13 In our example, if we assume that no two dependents of the same employee ever have the same first name, the attribute Name of DEPENDENT is the partial key. In the worst case, a composite attribute of all the weak entity'sattributes will be the partial key. In ER diagrams, both a weak entity type and its identifying relationship are distinguished by surrounding their boxes and diamonds with double lines (see Figure 3.2). The partial key attribute is underlined with a dashed or dotted line. Weak entity types can sometimes be represented as complex (composite, multivalued) attributes. In the preceding example, we could specify a multivalued attribute Dependents for EMPLOYEE, which is a composite attribute with component attributes Name, BirthDate, 11. The identifying entity type is also sometimes called the parent entity type or the dominant entity type. 12. The weak entity type is also sometimes called the child entity type or the subordinate entity type. 13. The partial key is sometimescalled the discriminator. 3.6 Refining the ER Design for the COMPANY Database I 69 Sex, and Relationship. The choice of which representation to use is made by the database designer. One criterion that may be used is to choose the weak entity type representation if there are many attributes. If the weak entity participates independently in relationship types other than its identifying relationship type, then it should not be modeled as a complex attribute. In general, any number of levels of weak entity types can be defined; an owner entity type may itself be a weak entity type. In addition, a weak entity type may have more than one identifying entity type and an identifying relationship type of degree higher than two, aswe illustrate in Section 4.7. 3.6 REFINING THE ER DESIGN FOR THE COMPANY DATABASE Wecan now refine the database design of Figure 3.8 by changing the attributes that repre? sent relationships into relationship types. The cardinality ratio and participation con? straintof each relationship type are determined from the requirements listed in Section 3.2. Ifsome cardinality ratio or dependency cannot be determined from the requirements, the users must be questioned further to determine these structural constraints. In our example, we specify the following relationship types: 1. MANAGES, a 1:1 relationship type between EMPLOYEE and DEPARTMENT. EMPLOYEE participation ispartial. DEPARTMENT participation is not clear from the requirements. We question the users, who say that a department must have a manager at all times, which implies total participation. 14The attribute StartDate is assigned to this relationship type. 2. WORKSJOR, a I:N relationship type between DEPARTMENT and EMPLOYEE. Both participa? tions are total. 3. CONTROLS, a I:N relationship type between DEPARTMENT and PROJECT. The participation of PROJECT is total, whereas that of DEPARTMENT is determined to be partial, after con? sultation with the users indicates that some departments may control no projects. 4. SUPERVISION, a I:N relationship type between EMPLOYEE (in the supervisor role) and EMPLOYEE (in the supervisee role). Both participations are determined to be partial, after the users indicate that not every employee is a supervisor and not every employee has a supervisor. 5. WORKS_ON, determined to be an M:N relationship type with attribute Hours, after the users indicate that a project can have several employees working on it. Both participations are determined to be total. 14. The rules in the miniworld that determine the constraints are sometimes called the business rules, since they are determined by the "business" or organization that will utilize the database. 70 I Chapter 3 Data Modeling Using the Entity-Relationship Model 6. DEPENDENTS_OF, a l:N relationship type between EMPLOYEE and DEPENDENT, which is also the identifying relationship for the weak entity type DEPENDENT. The participation of EMPLOYEE is partial, whereas that of DEPENDENT is total. After specifying the above six relationship types, we remove from the entity types in Figure 3.8 all attributes that have been refined into relationships. These include Manager and ManagerStartDate from DEPARTMENT; ControllingDepartment from PROJ ECT; Department, Supervisor, and WorksOn from EMPLOYEE; and Employee from DEPENDENT. It is important to have the least possible redundancy when we design the conceptual schema of a database. If some redundancy is desired at the storage level or at the user view level, it can be introduced later, as discussed in Section 1.6.1. 3.7 ER DIAGRAMS, NAMING CONVENTIONS, AND DESIGN ISSUES 3.7.1 Summary of Notation for ER Diagrams Figures 3.9 through 3.13 illustrate examples of the participation of entity types in rela? tionship types by displaying their extensions-the individual entity instances and rela? tionship instances in the entity sets and relationship sets. In ER diagrams the emphasis is on representing the schemas rather than the instances. This is more useful in database design because a database schema changes rarely, whereas the contents of the entity sets change frequently. In addition, the schema is usually easier to display than the extension of a database, because it is much smaller. Figure 3.2 displays the CDMPANY ER database schema as an ER diagram. We now review the full ER diagram notation. Entity types such as EMPLOYEE, DEPARTMENT, and PROJECT are shown in rectangular boxes. Relationship types such as WORKSJOR, MANAGES, CONTROLS, and WORKS_ON are shown in diamond-shaped boxes attached to the participating entity types with straight lines. Attributes are shown in ovals, and each attribute is attached by a straight line to its entity type or relationship type. Component attributes of a composite attribute are attached to the oval representing the composite attribute, as illustrated by the Name attribute of EMPLOYEE. Multivalued attributes are shown in double ovals, as illustrated by the Locations attribute of DEPARTMENT. Key attributes have their names underlined. Derived attributes are shown in dotted ovals, as illustrated by the NumberOfEmployees attribute of DEPARTMENT. Weak entity types are distinguished by being placed in double rectangles and by having their identifying relationship placed in double diamonds, as illustrated by the DEPENDENT entity type and the DEPENDENTS_OF identifying relationship type. The partial key of the weak entity type is underlined with a dotted line. In Figure 3.2 the cardinality ratio of each binary relationship type is specified by attaching a I, M, or N on each participating edge. The cardinality ratio of DEPARTMENT: EMPLOYEE in MANAGES is 1:1, whereas it is l:N for DEPARTMENT: EMPLOYEE in WORKS_FOR, and M:N for WORKS_ON. The 3.7 ER Diagrams, Naming Conventions, and Design Issues I 71 participation constraint is specified by a single line for partial participation and by double lines for total participation (existence dependency). In Figure 3.2 we show the role names for the SUPERVISION relationship type because the EMPLOYEE entity type plays both roles in that relationship. Notice that the cardinality is l:N from supervisor to supervisee because each employee in the role of supervisee has at most one direct supervisor, whereas an employee in the role of supervisor can supervise zero or more employees. Figure3.14 summarizes the conventions for ER diagrams. 3.7.2 Proper Naming of Schema Constructs When designing a database schema, the choice of names for entity types, attributes, rela? tionship types, and (particularly) roles is not always straightforward. One should choose names that convey, as much as possible, the meanings attached to the different constructs in the schema. We choose to use singular names for entity types, rather than plural ones, because the entity type name applies to each individual entity belonging to that entity type. In our ER diagrams, we will use the convention that entity type and relationship type names are in uppercase letters, attribute names are capitalized, and role names are in lowercase letters. We have already used this convention in Figure 3.2. As a general practice, given a narrative description of the database requirements, the nouns appearing in the narrative tend to give rise to entity type names, and the verbs tend to indicate names of relationship types. Attribute names generally arise from additional nouns that describe the nouns corresponding to entity types. Another naming consideration involves choosing binary relationship names to make the ER diagram of the schema readable from left to right and from top to bottom. We have generally followed this guideline in Figure 3.2. To explain this naming convention further, we have one exception to the convention in Figure 3.2-the DEPENDENTS_OF relationship type, which reads from bottom to top. When we describe this relationship, we can say that the DEPENDENT entities (bottom entity type) are DEPENDENTS_OF (relationship name) an EMPLOYEE (top entity type). To change this to read from top to bottom, we could rename the relationship type to HAS_DEPENDENTS, which would then read as follows: An EMPLOYEE entity (top entity type) HAS_DEPENDENTS (relationship name) of type DEPENDENT (bottom entity type). Notice that this issue arises because each binary relationship can be described starting from either of the two participating entity types, as discussed in the beginningof Section 3.4. 3.7.3 Design Choices for ER Conceptual Design It isoccasionally difficult to decide whether a particular concept in the miniworld should be modeled as an entity type, an attribute, or a relationship type. In this section, we give some brief guidelines as to which construct should be chosen in particular situations. 72 I Chapter 3 Data Modeling Using the Entity-Relationship Model I Symbol I II II Meaning ENTITY WEAK ENTITY 0 ¢ ---0 -Q KEY ATIRIBUTE C) RELATIONSHIP IDENTIFYING RELATIONSHIP ATIRIBUTE MULTIVALUED ATIRIBUTE COMPOSITE ATTRIBUTE DERIVED ATIRIBUTE '------<0:t==s= __ 1-<0 ....N - - - - ----' '--_-Jf--- TOTAL PARTICIPATION OF E2 IN R CARDINALITY RATIO 1: NFOR E,:E2IN R STRUCTURAL CONSTRAINT (min, max) ON PARTICIPATION OF EIN R FIGURE 3.14 Summary of the notation for ER diagrams. 3.7 ER Diagrams, Naming Conventions, and Design Issues I 73 In general, the schema design process should be considered an iterative refinement process, where an initial design is created and then iteratively refined until the most suitable design is reached. Some of the refinements that are often used include the following; • A concept may be first modeled as an attribute and then refined into a relationship because it is determined that the attribute is a reference to another entity type. It is often the case that a pair of such attributes that are inverses of one another are refined into a binary relationship. We discussed this type of refinement in detail in Section 3.6. • Similarly, an attribute that exists in several entity types may be elevated or promoted to an independent entity type. For example, suppose that several entity types in a UNIVERSITY database, such as STUDENT, INSTRUCTOR, and COURSE, each has an attribute Department in the initial design; the designer may then choose to create an entity type DEPARTMENT with a single attribute DeptName and relate it to the three entity types (STUDENT, INSTRUCTOR, and COURSE) via appropriate relationships. Other attributes/ relationships of DEPARTMENT may be discovered later. • An inverse refinement to the previous case may be applied-for example, if an entity type DEPARTMENT exists in the initial design with a single attribute DeptName and is related to only one other entity type, STUDENT. In this case, DEPARTMENT may be reduced or demoted to an attribute of STUDENT. • In Chapter 4, we discuss other refinements concerning specialization/generalization and relationships of higher degree. Chapter 12 discusses additional top-down and bottom-up refinements that are common in large-scale conceptual schema design. 3.7.4 Alternative Notations for ER Diagrams Thereare many alternative diagrammatic notations for displaying ER diagrams. Appendix A gives some of the more popular notations. In Section 3.8, we introduce the Universal Modeling Language (UML) notation for class diagrams, which has been proposed as a standard for conceptual object modeling. In this section, we describe one alternative ER notation for specifying structural constraints on relationships. This notation involves associating a pair of integer numbers (min, max) with each participation of an entity type E in a relationship type R, where 0 :=; min :s max and max 2: 1. The numbers mean that for each entity e in E, e must participate in at least min and at most max relationship instances in R at any point in time. In this method, min = 0 implies partial participation, whereas min > 0 implies total participation. Figure 3.15 displays the COMPANY database schema using the (min, max) notation.P Usually, one uses either the cardinality ratio/single-line/double-line notation or the (min, 15. In some notations, particularly those used in object modeling methodologies such as UML, the (min, max) is placed on the opposite sides to the ones we have shown. For example, for the WORKS_FOR relationship in Figure 3.15, the (1,1) would be on the DEPARTMENT side, and the (4,N) would be on the EMPLOYEE side. Here we used the original notation from Abrial (1974). 74 I Chapter 3 Data Modeling Using the Entity-Relationship Model (1,1) employee SUPERVISION (D,1) manager (1,1) department? managed (1,N) worker (D,N) controlling? department Hours (D,N) CONTROLS employee dependent (1,1) I DEPENDENT (1,N) project controlled? project Relationship (1,1) FIGURE 3.15 ER diagrams for the COMPANY schema, with structural constraints specified using (min, max) notation. max) notation. The (min, max) notation is more precise, and we can use it easily to specify structural constraints for relationship types of any degree. However, it is not sufficient for specifying some key constraints on higher-degree relationships, as discussed in Section 4.7. Figure 3.15 also displays all the role names for the COMPANY database schema. 3.8 NOTATION FOR UML CLASS DIAGRAMS The UML methodology is being used extensively in software design and has many types of diagrams for various software design purposes. We only briefly present the basics of UML EMPLOYEE 3.8 Notation for UML Class Diagrams I 75 Name: NameDom Fname Minit Lname S5n Bdate:Date Sex: {M,F} Address Salary age change_department change.J)rojects .., IDependent NameI ? DEPENDENT 4." 1..1 WORKS_FOR I I MANAGES I I StartDate I 1..1 DEPARTMENT Sex: {M,F} BirthDate: Date Relationship .. , '------. 0..1 1." . I I WORKS ON I supervisee I Hours I Name Number add_employee number_oCemployees change_manager ... 1..1 0..1 supervisor CONTROLS. 1." PROJECT Name Number add_employee add_project change_manager ... Multiplicity Notation in 0 MT: --- 1..1 ----0• 0."0..1 0." A 0." k> 1." I LOCATION I Name 1..1 I I Aggregation Notation In UML: IWHOLE K> I PART FIGURE 3.16 The COMPANY conceptual schema in UML class diagram notation. class diagrams here, and compare them with ER diagrams. In some ways, class diagrams can be considered as an alternative notation to ER diagrams. Additional UML notation andconcepts are presented in Section 4.6, and in Chapter 12. Figure 3.16 shows how the COMPANY ERdatabase schema of Figure 3.15 can be displayed using UML class diagram nota? tion. The entity types in Figure 3.15 are modeled as classes in Figure 3.16. An entity in ER corresponds to an objectin UML. In UML class diagrams, a class is displayed as a box (see Figure 3.16) that includes three sections: The top section gives the class name, the middle section includes the attributes for individual objects of the class; and the last section includes operations that can be applied to these objects. Operations are not specified in ER diagrams. Consider the EMPLOYEE class in Figure 3.16. Its attributes are Name, Ssn, Bdate, Sex, Address, and Salary. The designer can optionally specify the domain of an attribute if desired, by placing a colon (:) followed by the domain name or description, as illustrated by the Name, Sex, and Bdate attributes of EMPLOYEE in Figure 3.16. A composite attribute is modeled as a structured domain, as illustrated by the Name attribute of EMPLOYEE. A multivalued attribute will generally be modeled as a separate class, as illustrated by the LOCATION class in Figure 3.16. Relationship types are called associations in UML terminology, and relationship instances are called links. A binary association (binary relationship type) is represented as a line connecting the participating classes (entity types), and may optionally have a 76 I Chapter 3 Data Modeling Using the Entity-Relationship Model name. A relationship attribute, called a link attribute, is placed in a box that is connected to the association's line by a dashed line. The (min, max) notation described in Section 3.7.4 is used to specify relationship constraints, which are called multiplicities in UML terminology. Multiplicities are specified in the form min..max, and an asterisk (*) indicates no maximum limit on participation. However, the multiplicities are placed on the opposite ends of the relationshiP when compared with the notation discussed in Section 3.7.4 (compare Figures 3.16 and 3.15). In UML, a single asterisk indicates a multiplicity of 0..*, and a single 1 indicates a multiplicity of 1..1. A recursive relationship (see Section 3.4.2) is called a reflexive association in UML, and the role names-like the multiplicities-are placed at the opposite ends of an association when compared with the placing of role names in Figure 3.15. In UML, there are two types of relationships: association and aggregation. Aggregation is meant to represent a relationship between a whole object and its component parts, and it has a distinct diagrammatic notation. In Figure 3.16, we modeled the locations of a department and the single location of a project as aggregations. However, aggregation and association do not have different structural properties, and the choice as to which type of relationship to use is somewhat subjective. In the ER model, both are represented as relationships. UML also distinguishes between unidirectional and bidirectional associations (or aggregations). In the unidirectional case, the line connecting the classes is displayed with an arrow to indicate that only one direction for accessing related objects is needed. If no arrow is displayed, the bidirectional case is assumed, which is the default. For example, if we always expect to access the manager of a department starting from a DEPARTMENT object, we would draw the association line representing the MANAGES association with an arrow from DEPARTMENT to EMPLOYEE. In addition, relationship instances may be specified to be ordered. For example, we could specify that the employee objects related to each department through the WORKS_FOR association (relationship) should be ordered by their Bdate attribute value. Association (relationship) names are optional in UML, and relationship attributes are displayed in a box attached with a dashed line to the line representing the association/aggregation (see StartDate and Hours in Figure 3.16). The operations given in each class are derived from the functional requirements of the application, as we discussed in Section 3.1. It is generally sufficient to specify the operation names initially for the logical operations that are expected to be applied to individual objects of a class, as shown in Figure 3.16. As the design is refined, more details are added, such as the exact argument types (parameters) for each operation, plus a functional description of each operation. UML has function descriptions and sequence diagrams to specify some of the operation details, but these are beyond the scope of our discussion. Chapter 12 will introduce some of these diagrams. Weak entities can be modeled using the construct called qualified association (or qualified aggregation) in UMLj this can represent both the identifying relationship and the partial key, which is placed in a box attached to the owner class. This is illustrated by the DEPENDENT class and its qualified aggregation to EMPLOYEE in Figure 3.16. The partial key DependentName is called the discriminator in UML terminology, since its value distinguishes the objects associated with (related to) the same EMPLOYEE. Qualified associations are not restricted to modeling weak entities, and they can be used to model other situations in UML. 3.9 Summary I 77 3.9 SUMMARY In this chapter we presented the modeling concepts of a high-level conceptual data model, the Entity-Relationship (ER) model. We started by discussing the role that a high? level data model plays in the database design process, and then we presented an example set of database requirements for the COMPANY database, which is one of the examples that is used throughout this book. We then defined the basic ER model concepts of entities and their attributes. We discussed null values and presented the various types of attributes, which can be nested arbitrarily to produce complex attributes: • Simple or atomic • Composite • Multivalued We also briefly discussed stored versus derived attributes. We then discussed the ER modelconcepts at the schema or "intension" level: • Entity types and their corresponding entity sets • Key attributes of entity types • Value sets (domains) of attributes • Relationship types and their corresponding relationship sets • Participation roles of entity types in relationship types We presented two methods for specifying the structural constraints on relationship types. The first method distinguished two types of structural constraints: • Cardinality ratios (1:1, I:N, M:N for binary relationships) • Participation constraints (total, partial) We noted that, alternatively, another method of specifying structural constraints is to specify minimum and maximum numbers (min, max) on the participation of each entity type in a relationship type. We discussed weak entity types and the related concepts of ownerentity types, identifying relationship types, and partial key attributes. Entity-Relationship schemas can be represented diagrammatically as ER diagrams. We showed how to design an ER schema for the COMPANY database by first defining the entity types and their attributes and then refining the design to include relationship types. We displayed the ER diagram for the COMPANY database schema. Finally, we discussed some ofthe basic concepts of UML class diagrams and how they relate to ER model concepts. The ER modeling concepts we have presented thus far-entity types, relationship types, attributes, keys, and structural constraints-can model traditional business data? processing database applications. However, many newer, more complex applications? such as engineering design, medical information systems, or telecommunications? require additional concepts if we want to model them with greater accuracy. We discuss these advanced modeling concepts in Chapter 4. We also describe ternary and higher? degree relationship types in more detail in Chapter 4, and discuss the circumstances under which they are distinguished from binary relationships. 78 I Chapter 3 Data Modeling Using the Entity-Relationship Model Review Questions 3.1. Discuss the role of a high-level data model in the database design process. 3.2. List the various cases where use of a null value would be appropriate. 3.3. Define the following terms: entity, attribute, attribute value, relationship instance, composite attribute, multivalued attribute, derived attribute, complex attribute, key attribute, value set (domain). 3.4. What is an entity type? What is an entity set? Explain the differences among an entity, an entity type, and an entity set. 3.5. Explain the difference between an attribute and a value set. 3.6. What is a relationship type? Explain the differences among a relationship instance, a relationship type, and a relationship set. 3.7. What is a participation role? When is it necessary to use role names in the description of relationship types? 3.8. Describe the two alternatives for specifying structural constraints on relationship types. What are the advantages and disadvantages of each? 3.9. Under what conditions can an attribute of a binary relationship type be migrated to become an attribute of one of the participating entity types? 3.10. When we think of relationships as attributes, what are the value sets of these attributes? What class of data models is based on this concept? 3.11. What is meant by a recursive relationship type? Give some examples of recursive relationship types. 3.12. When is the concept of a weak entity used in data modeling? Define the terms owner entity type, weak entity type, identifying relationship type, and partial key. 3.13. Can an identifying relationship of a weak entity type be of a degree greater than two? Give examples to illustrate your answer. 3.14. Discuss the conventions for displaying an ER schema as an ER diagram. 3.15. Discuss the naming conventions used for ER schema diagrams. Exercises 3.16. Consider the following set of requirements for a university database that is used to keep track of students' transcripts. This is similar but not identical to the database shown in Figure 1.2: a. The university keeps track of each student's name, student number, social security number, current address and phone, permanent address and phone, birthdare, sex, class (freshman, sophomore, ... , graduate), major department, minor department (if any), and degree program (B.A., B.S., ... , Ph.D.). Some user applications need to refer to the city, state, and zip code of the student's permanent address and to the student's last name. Both social security number and student number have unique values for each student. b. Each department is described by a name, department code, office number, office phone, and college. Both name and code have unique values for each department. Exercises I 79 c. Each course has a course name, description, course number, number of semes? ter hours, level, and offering department. The value of the course number is unique for each course. d. Each section has an instructor, semester, year, course, and section number. The section number distinguishes sections of the same course that are taught dur? ing the same semester/year; its values are 1,2,3, ... , up to the number of sec? tions taught during each semester. e. A grade report has a student, section, letter grade, and numeric grade (0, 1, 2, 3, or 4). Design an ER schema for this application, and draw an ER diagram for that schema. Specify key attributes of each entity type, and structural constraints on each relationship type. Note any unspecified requirements, and make appropriate assumptions to make the specification complete. 3.17. Composite and multivalued attributes can be nested to any number of levels. Sup? pose we want to design an attribute for a STUDENT entity type to keep track of previ? ous college education. Such an attribute will have one entry for each college previously attended, and each such entry will be composed of college name, start and end dates, degree entries (degrees awarded at that college, if any), and tran? script entries (courses completed at that college, if any). Each degree entry con? tains the degree name and the month and year the degree was awarded, and each transcript entry contains a course name, semester, year, and grade. Design an attribute to hold this information. Use the conventions of Figure 3.5. 3.18. Show an alternative design for the attribute described in Exercise 3.17 that uses only entity types (including weak entity types, if needed) and relationship types. 3.19. Consider the ER diagram of Figure 3.17, which shows a simplified schema for an airline reservations system. Extract from the ER diagram the requirements and constraints that produced this schema. Try to be as precise as possible in your requirements and constraints specification. 3.20. In Chapters 1 and 2, we discussed the database environment and database users. We can consider many entity types to describe such an environment, such as DBMS, stored database, DBA, and catalog/data dictionary. Try to specify all the entity types that can fully describe a database system and its environment; then specify the relationship types among them, and draw an ER diagram to describe such a general database environment. 3.21. Design an ER schema for keeping track of information about votes taken in the U.S. House of Representatives during the current two-year congressional session. The database needs to keep track of each U.S. STATE'S Name (e.g., Texas, New York, California) and include the Region of the state (whose domain is {North? east, Midwest, Southeast, Southwest, West}). Each CONGRESS PERSON in the House of Representatives is described by his or her Name, plus the District represented, the StartDate when the congressperson was first elected, and the political Party to which he or she belongs (whose domain is {Republican, Democrat, Independent, Other}). The database keeps track of each BILL (i.e., proposed law), including the BillName, the DateONote on the bi.ll, whether the bill PassedOrFailed (whose domain is {Yes, Nol), and the Sponsor (the congresspersonts) who sponsored- 80 I Chapter 3 Data Modeling Using the Entity-Relationship Model M CAN LAND TYPE ScheduledDepTime DEPARTURE AIRPORT N instances e N 1 N Total-no-of-seats N ASSIGNED >============1 NOTES: (1)A LEG (SEGMENT) ISA NONSTOPPORTIONOF A FLIGHT (2)A LEG INSTANCE ISA PARTICULAR OCCURRENCE OF A LEG ON A PARTICULAR DATE FIGURE 3.17 An ER diagram for an AIRLINE database schema. that is, proposed-the bill). The database keeps track of how each congressperson voted on each bill (domain of vote attribute is {Yes, No, Abstain, Absent}). Draw an ER schema diagram for this application. State clearly any assumptions you make. 3.22. A database is being constructed to keep track of the teams and games of a sports league. A team has a number of players, not all of whom participate in each game. It is desired to keep track of the players participating in each game for each team, the positions they played in that game, and the result of the game. Design an ER schema diagram for this application, stating any assumptions you make. Choose your favorite sport (e.g., soccer, baseball, football). 3.23. Consider the ER diagram shown in Figure 3.18 for part of a BANK database. Each bank can have multiple branches, and each branch can have multiple accounts and loans. a. List the (nonweak) entity types in the ER diagram. b. Is there a weak entity type? If so, give its name, partial key, and identifying relationship. c. What constraints do the partial key and the identifying relationship of the weak entity type specify in this diagram? d. List the names of all relationship types, and specify the (min, max) constraint on each participation of an entity type in a relationship type. Justify your choices. e. List concisely the user requirements that led to this ER schema design. f Suppose that every customer must have at least one account but is restricted to at most two loans at a time, and that a bank branch cannot have more than 1000 loans. How does this show up on the (min, max) constraints? 'r-_-,----_j======13(~>N=======JLi==::;====:;'J Addr A-C N Exercises I 81 FIGURE 3.18 An ER diagram for a BANK database schema. LOANS 82 I Chapter 3 Data Modeling Using the Entity-Relationship Model 3.24. Consider the ER diagram in Figure 3.19. Assume that an employee may work in up to two departments or may not be assigned to any department. Assume that each department must have one and may have up to three phone numbers. Supply (min, max) constraints on this diagram. State clearly any additional assumptions you make. Under what conditions would the relationship HAS_PHONE be redundant in this example? 3.25. Consider the ER diagram in Figure 3.20. Assume that a course mayor may not use a textbook, but that a text by definition is a book that is used in some course. A course may not use more than five books. Instructors teach from two to four courses. Supply (min, max) constraints on this diagram. State clearly any additional assumptions you make. If we add the relationship ADOPTS between INSTRUCTOR and TEXT, what (min, max) constraints would you put on it? Why? 3.26. Consider an entity type SECTION in a UNIVERSITY database, which describes the sec? tion offerings of courses. The attributes of SECTION are SectionNumber, Semester, Year, CourseNumber, Instructor, RoomNo (where section is taught), Building (where section is taught), Weekdays (domain is the possible combinations of weekdays in which a section can be offered {MWF, MW, TT, etc.j), and Hours (domain is all possible time periods during which sections are offered {9-9:50 A.M., 10-10:50 A.M., ... , 3:30-4:50 P.M., 5:30-6:20 P.M., etc.}). Assume that Section- PHONE FIGURE 3.19 Part of an ER diagram for a COMPANY database. INSTRUCTOR FIGURE 3.20 Part of an ER diagram for a COURSES database. I Selected Bibliography I 83 Number is unique for each course within a particular semester/year combination (that is, if a course is offered multiple times during a particular semester, its section offerings are numbered 1, 2,3, etc.). There are several composite keys for SECTION, and some attributes are components of more than one key. Identify three compos? ite keys, and show how they can be represented in an ER schema diagram. Selected Bibliography The Entity-Relationship model was introduced by Chen (1976), and related work appears in Schmidt and Swenson (1975), Wiederhold and Elmasri (1979), and Senko (1975). Since then, numerous modifications to the ER model have been suggested. We have incorporated some of these in our presentation. Structural constraints on relation? ships are discussed in Abrial (1974), Elmasri and Wiederhold (1980), and Lenzerini and Santucci (1983). Multivalued and composite attributes are incorporated in the ER model in Elmasri et al. (1985). Although we did not discuss languages for the entity-relationship model and its extensions, there have been several proposals for such languages. Elmasri andWiederhold (1981) proposed the GORDAS query language for the ER model. Another ER query language was proposed by Markowitz and Raz (1983). Senko (1980) presented a query language for Senko's DIAM model. A formal set of operations called the ER algebra was presented by Parent and Spaccapietra (1985). Gogolla and Hohenstein (1991) pre? sented another formal language for the ER model. Campbell et al. (1985) presented a set ofERoperations and showed that they are relationally complete. A conference for the dis? seminationof research results related to the ER model has been held regularly since 1979. The conference, now known as the International Conference on Conceptual Modeling, hasbeen held in Los Angeles (ER 1979, ER 1983, ER 1997), Washington, D.C. (ER 1981), Chicago (ER 1985), Dijon, France (ER 1986), New York City (ER 1987), Rome (ER 1988), Toronto (ER 1989), Lausanne, Switzerland (ER 1990), San Mateo, California (ER 1991), Karlsruhe, Germany (ER 1992), Arlington, Texas (ER 1993), Manchester, England (ER 1994), Brisbane, Australia (ER 1995), Cottbus, Germany (ER 1996), Singapore (ER 1998), SaltLake City, Utah (ER 1999), Yokohama, Japan (ER 2001), and Tampere, Finland (ER 2002). The next conference is scheduled for Chicago in October 2003. Enhanced Entity? Relationship and UML Modeling The ER modeling concepts discussed in Chapter 3 are sufficient for representing many database schemas for "traditional" database applications, which mainly include data? processing applications in business and industry. Since the late 1970s, however, designers ofdatabase applications have tried to design more accurate database schemas that reflect the data properties and constraints more precisely. This was particularly important for newer applications of database technology, such as databases for engineering design and manufacturing (CAD/CAMl ) , telecommunications, complex software systems, and Geo? graphic Information Systems (GIs), among many other applications. These types of data? bases have more complex requirements than do the more traditional applications. This led to the development of additional semantic data modeling concepts that were incorpo? rated into conceptual data models such as the ER model. Various semantic data models have been proposed in the literature. Many of these concepts were also developed inde? pendently in related areas of computer science, such as the knowledge representation area of artificial intelligence and the object modeling area in software engineering. In this chapter, we describe features that have been proposed for semantic data models, and show how the ER model can be enhanced to include these concepts, leading to the enhanced ER, or EER, model.i We start in Section 4.1 by incorporating the 1. CAD/CAM stands for computer-aided design/computer-aided manufacturing. 2. EER has also been used to stand for Extended ER model. 85 86 I Chapter 4 Enhanced Entity-Relationship and UML Modeling concepts of class/subclass relationships and type inheritance into the ER model. Then, in Section 4.2, we add the concepts of specialization and generalization. Section 4.3 discusses the various types of constraints on specialization/generalization, and Section 4.4 shows how the UNION construct can be modeled by including the concept of category in the EER model. Section 4.5 gives an example UNIVERSITY database schema in the EER model and summarizes the EER model concepts by giving formal definitions. We then present the UML class diagram notation and concepts for representing specialization and generalization in Section 4.6, and briefly compare these with EER notation and concepts. This is a continuation of Section 3.8, which presented basic UML class diagram notation. Section 4.7 discusses some of the more complex issues involved in modeling of ternary and higher-degree relationships. In Section 4.8, we discuss the fundamental abstractions that are used as the basis of many semantic data models. Section 4.9 summarizes the chapter. For a detailed introduction to conceptual modeling, Chapter 4 should be considered a continuation of Chapter 3. However, if only a basic introduction to ER modeling is desired, this chapter may be omitted. Alternatively, the reader may choose to skip some or all of the later sections of this chapter (Sections 4.4 through 4.8). 4.1 SUBCLASSES, SUPERCLASSES, AND INHERITANCE The EER (Enhanced ER) model includes all the modeling concepts of the ER model that were presented in Chapter 3. In addition, it includes the concepts of subclass and super? class and the related concepts of specialization and generalization (see Sections 4.2 and 4.3). Another concept included in the EER model is that of a category or union type (see Section 4.4), which is used to represent a collection of objects that is the union of objects of different entity types. Associated with these concepts is the important mechanism of attribute and relationship inheritance. Unfortunately, no standard terminology exists for these concepts, so we use the most common terminology. Alternative terminology is given in footnotes. We also describe a diagrammatic technique for displaying these con? cepts when they arise in an EERschema. We call the resulting schema diagrams enhanced ER or EER diagrams. The first EER model concept we take up is that of a subclass of an entity type. As we discussed in Chapter 3, an entity type is used to represent both a type of entity and the entity set or collection of entities of that type that exist in the database. For example, the entity type EMPLOYEE describes the type (that is, the attributes and relationships) of each employee entity, and also refers to the current set of EMPLOYEE entities in the COMPANY database. In many cases an entity type has numerous subgroupings of its entities that are meaningful and need to be represented explicitly because of their significance to the database application. For example, the entities that are members of the EMPLOYEE entity type may be grouped further into SECRETARY, ENGINEER, MANAGER, TECHNICIAN, SALARIED_EMPLOYEE, HOURLY_EMPLOYEE, and so on. The set of entities in each of the latter groupings is a subset of 4.1 Subclasses, Superclasses, and Inheritance I 87 the entities that belong to the EMPLOYEE entity set, meaning that every entity that is a member of one of these subgroupings is also an employee. We call each of these subgroupings a subclass of the EMPLOYEE entity type, and the EMPLOYEE entity type is called the superclass for each of these subclasses. Figure 4.1 shows how to diagramaticallv represent these concepts in EER diagrams. We call the relationship between a superclass and anyone of its subclasses a superclass/subclass or simply class/subclass relationship.! In our previous example, EMPLOYEE/SECRETARY and EMPLOYEE/TECHNICIAN are two class/subclass relationships. Notice that a member entity of the subclass represents the same real-world entity as some member of the superclass; for example, a SECRETARY entity 'Joan Logano' is also the EMPLOYEE 'Joan Lagana'. Hence, the subclass member is the same as the entity in the superclass, but in a distinct specific role. When we implement a superclass/subclass relationship in the Three specializations of EMPLOYEE: {SECRETARY, TECHNICIAN, ENGINEER} {MANAGER} (HOURLY_EMPLOYEE, SALARIED_EMPLOYEE) FIGURE 4.1 EER diagram notation to represent subclasses and specialization. 3. A class/subclass relationship is often called an IS-A (or IS-AN) relationship because of the way we refer to the concept. We say "a SECRETARY is an EMPLOYEE,""a TECHNICIAN is an EMPLOYEE,"and so on. 88 I Chapter 4 Enhanced Entity-Relationship and UML Modeling database system, however, we may represent a member of the subclass as a distinct database object-say, a distinct record that is related via the key attribute to its superclass entity. In Section 7.2, we discuss various options for representing superclass/subclass relationships in relational databases. An entity cannot exist in the database merely by being a member of a subclass; it must also be a member of the superclass. Such an entity can be included optionally as a member of any number of subclasses. For example, a salaried employee who is also an engineer belongs to the two subclasses ENGINEER and SALARIED_EMPLOYEE of the EMPLOYEE entity type. However, it is not necessary that every entity in a superclass be a member of some subclass. An important concept associated with subclasses is that of type inheritance. Recall that the type of an entity is defined by the attributes it possesses and the relationship types in which it participates. Because an entity in the subclass represents the same real-world entity from the superclass, it should possess values for its specific attributes as well as values of its attributes as a member of the superclass. We say that an entity that is a member of a subclass inherits all the attributes of the entity as a member of the superclass. The entity also inherits all the relationships in which the superclass participates. Notice that a subclass, with its own specific (or local) attributes and relationships together with all the attributes and relationships it inherits from the superclass, can be considered an entity type in its own right." 4.2 SPECIALIZATION AND GENERALIZATION 4.2.1 Specialization Specialization is the process of defining a set of subclasses of an entity type; this entity type is called the superclass of the specialization. The set of subclasses that form a specializa? tion is defined on the basis of some distinguishing characteristic of the entities in the superclass. For example, the set of subclasses {SECRETARY, ENGINEER, TECHNICIAN} is a specializa? tion of the superclass EMPLOYEE that distinguishes among employee entities based on the job type of each employee entity. We may have several specializations of the same entity type based on different distinguishing characteristics. For example, another specialization of the EMPLOYEE entity type may yield the set of subclasses {SALARIED_EMPLOYEE, HOURLY_EMPLOYEE}; this specialization distinguishes among employees based on the methodof pay. Figure 4.1 shows how we represent a specialization diagrammatically in an EER diagram. The subclasses that define a specialization are attached by lines to a circle that represents the specialization, which is connected to the superclass. The subset symbol on each line connecting a subclass to the circle indicates the direction of the superclass/ subclass relationship.i Attributes that apply only to entities of a particular subclass-such --- ----- ----- ------------ ---------- --- 4. In some object-oriented programming languages, a common restriction is that an entity (or object) has only one type. This is generally too restrictive for conceptual database modeling. 5. There are many alternative notations for specialization; we present the UML notation in Section 4.6 and other proposednotations in Appendix A. 4.2 Specialization and Generalization I 89 as TypingSpeed of SECRETARY-are attached to the rectangle representing that subclass. These are called specific attributes (or local attributes) of the subclass. Similarly, a subclass can participate in specific relationship types, such as the HOURLY_EMPLOYEE subclass participating in the BELONGS_TO relationship in Figure 4.1. We will explain the d symbol in the circles of Figure 4.1 and additional EERdiagram notation shortly. Figure 4.2 shows a few entity instances that belong to subclasses of the {SECRETARY, ENGI? NEER, TECHNICIAN} specialization. Again, notice that an entity that belongs to a subclass represents the same real-world entity as the entity connected to it in the EMPLOYEE superclass, even though the same entity is shown twice; for example, el is shown in both EMPLOYEE and SECRETARY in Figure 4.2. As this figure suggests, a superclass/subclass relationship such as SECRETARY e, e4 es EMPLOYEE e, ·2 ~ ENGINEER e, ·5 ., e, e, e, TECHNICIAN e, FIGURE 4.2 Instances of a specialization. e8 90 I Chapter 4 Enhanced Entity-Relationship and UML Modeling EMPLOYEE/SECRETARY somewhat resembles a 1:1 relationship at the instance level (see Figure 3.12). The main difference is that in a 1:1 relationship two distinct entities are related, whereas in a superclass/subclass relationship the entity in the subclass is the same real-world entity as the entity in the superclass but is playing a specialized role-for example, an EMPLOYEE specialized in the role of SECRETARY, or an EMPLOYEE specialized in the role of TECHNICIAN. There are two main reasons for including class/subclass relationships and specializations in a data model. The first is that certain attributes may apply to some but not all entities of the superclass. A subclass is defined in order to group the entities to which these attributes apply. The members of the subclass may still share the majority of their attributes with the other members of the superclass. For example, in Figure 4.1 the SECRETARY subclass has the specific attribute TypingSpeed, whereas the ENGINEER subclass has the specific attribute EngType, but SECRETARY and ENGINEER share their other inherited attributes from the EMPLOYEE entity type. The second reason for using subclasses is that some relationship types may be participated in only by entities that are members of the subclass. For example, if only HOURLY_EMPLOYEES can belong to a trade union, we can represent that fact by creating the subclass HOURLY_EMPLOYEE of EMPLOYEE and relating the subclass to an entity type TRADE_UNION via the BELONGS_TO relationship type, as illustrated in Figure 4.1. In summary, the specialization process allows us to do the following: • Define a set of subclasses of an entity type • Establish additional specific attributes with each subclass • Establish additional specific relationship types between each subclass and other entity types or other subclasses 4.2.2 Generalization We can think of a reverse process of abstraction in which we suppress the differences among several entity types, identify their common features, and generalize them into a single super? class of which the original entity types are special subclasses. For example, consider the entity types CAR and TRUCK shown in Figure 4.3a. Because they have several common attributes, they can be generalized into the entity type VEHICLE, as shown in Figure 4.3b. Both CAR and TRUCK are now subclasses of the generalized superclass VEHICLE. We use the term generalization to refer to the process of defining a generalized entity type from the given entity types. Notice that the generalization process can be viewed as being functionally the inverse of the specialization process. Hence, in Figure 4.3 we can view {cAR, TRUCK} as a specialization of VEHICLE, rather than viewing VEHICLE as a generalization of CAR and TRUCK. Similarly, in Figure 4.1 we can view EMPLOYEE as a generalization of SECRETARY, TECHNICIAN, and ENGINEER. A diagrammatic notation to distinguish between generalization and specialization is used in some design methodologies. An arrow pointing to the generalized superclass represents a generalization, whereas arrows pointing to the specialized subclasses represent a specialization. We will not use this notation, because the decision as to which process is more appropriate in a particular situation is often subjective. Appendix A gives some of the suggested alternative diagrammatic notations for schema diagrams and class diagrams. (a) (b) 4.3 Constraints and Characteristics of Specialization and Generalization I 91 NoOfPassengers Price LicensePlateNo LicensePlateNo LicensePlateNo NoOfPassengers FIGURE 4.3 Generalization. (a) Two entity types, CAR and TRUCK. (b) Generalizing CAR and TRUCK into the superclass VEHICLE. Sofar we have introduced the concepts of subclasses and superclass/subclass relationships, as well as the specialization and generalization processes. In general, a superclass or subclass represents a collection of entities of the same type and hence also describes an entity type; that is why superclasses and subclasses are shown in rectangles in EER diagrams, like entity types. We next discussin more detail the properties of specializations and generalizations. 4.3 CONSTRAINTS AND CHARACTERISTICS OF SPECIALIZATION AND GENERALIZATION Wefirst discuss constraints that apply to a single specialization or a single generalization. For brevity, our discussion refers only to specialization even though it applies to both spe? cialization and generalization. We then discuss differences between specialization/gener? alization lattices (multiple inheritance) and hierarchies (single inheritance), and elaborate on the differences between the specialization and generalization processes during conceptual database schema design. 92 I Chapter 4 Enhanced Entity-Relationship and UML Modeling 4.3.1 Constraints on Specialization and Generalization In general, we may have several specializations defined on the same entity type (or super? class), as shown in Figure 4.1. In such a case, entities may belong to subclasses in each of the specializations. However, a specialization may also consist of a single subclass only, such as the {MANAGER} specialization in Figure 4.1; in such a case, we do not use the circle notation. In some specializations we can determine exactly the entities that will become members of each subclass by placing a condition on the value of some attribute of the superclass. Such subclasses are called predicate-defined (or condition-defined) subclasses. For example, if the EMPLOYEE entity type has an attribute ]obType, as shown in Figure 4.4, we can specify the condition of membership in the SECRETARY subclass by the condition (JobType = 'Secretary'), which we call the defining predicate of the subclass. This condition is a constraint specifying that exactly those entities of the EMPLOYEE entity type whose attribute value for ]obType is 'Secretary' belong to the subclass. We display a predicate-defined subclass by writing the predicate condition next to the line that connects the subclass to the specialization circle. If all subclasses in a specialization have their membership condition on the same attribute of the superclass, the specialization itself is called an attribute-defined specialization, and the attribute is called the defining attribute of the specialization.P We display an attribute-defined specialization by placing the defining attribute name next to the arc from the circle to the superclass, as shown in Figure 4.4. "Secretary" JobType "Engineer" TECHNICIAN FIGURE 4.4 EER diagram notation for an attribute-defined specialization on JobType. 6. Such an attribute is called a discriminator in UML terminology. 4.3 Constraints and Characteristics of Specialization and Generalization I 93 When we do not have a condition for determining membership in a subclass, the subclass is called user-defined. Membership in such a subclass is determined by the database users when they apply the operation to add an entity to the subclass; hence, membership is specified individually for eachentity by the user, not by any condition that may be evaluated automatically. Two other constraints may apply to a specialization. The first is the disjointness constraint, which specifies that the subclasses of the specialization must be disjoint. This means that an entity can be a member of at most one of the subclasses of the specialization. A specialization that is attribute-defined implies the disjointness constraint if the attribute used to define the membership predicate is single-valued. Figure 4.4 illustrates thiscase, where the d in the circle stands for disjoint. We also use the d notation to specify the constraint that user-defined subclasses of a specialization must be disjoint, as illustrated by the specialization {HOURLY_EMPLOYEE, SALARIED_EMPLOYEE} in Figure 4.1. If the subclasses are not constrained to be disjoint, their sets of entities may overlap; that is, the same (real-world) entity may be a member of more than one subclass of the specialization. This case, which is the default, is displayed by placing an 0 in the circle, as shown in Figure 4.5. The second constraint on specialization is called the completeness constraint, which may be total or partial. A total specialization constraint specifies that every entity in the superclass must be a member of at least one subclass in the specialization. For example, if every EMPLOYEE must be either an HOURLY_EMPLOYEE or a SALARIEO_EMPLOYEE, then the specialization {HOURLY_EMPLOYEE, SALARIED_EMPLOYEE} of Figure 4.1 is a total specialization of EMPLOYEE. This is shown in EERdiagrams by using a double line to connect the superclass to the circle. A single line is used to display a partial specialization, which allows an entity not to belong to any of the subclasses. For example, if some EMPLOYEE entities do not belong SupplierName FIGURE 4.5 EER diagram notation for an overlapping (nondisjoint) specialization. 94 I Chapter 4 Enhanced Entity-Relationship and UML Modeling to any of the subclasses {SECRETARY, ENGINEER, TECHNICIAN} of Figures 4.1 and 4.4, then that specialization is partial. 7 Notice that the disjointness and completeness constraints are independent. Hence, we have the following four possible constraints on specialization: • Disjoint, total • Disjoint, partial • Overlapping, total • Overlapping, partial Of course, the correct constraint is determined from the real-world meaning that applies to each specialization. In general, a superclass that was identified through the generaliza? tion process usually is total, because the superclass is derived from the subclasses and hence contains only the entities that are in the subclasses. Certain insertion and deletion rules apply to specialization (and generalization) as a consequence of the constraints specified earlier. Some of these rules are as follows: • Deleting an entity from a superclass implies that it is automatically deleted from all the subclasses to which it belongs. • Inserting an entity in a superclass implies that the entity is mandatorily inserted in all predicate-defined (or attribute-defined) subclasses for which the entity satisfies the defining predicate. • Inserting an entity in a superclass of a total specialization implies that the entity is mandatorily inserted in at least one of the subclasses of the specialization. The reader is encouraged to make a complete list of rules for insertions and deletions for the various types of specializations. 4.3.2 Specialization and Generalization Hierarchies and Lattices A subclass itself may have further subclasses specified on it, forming a hierarchy or a lat? tice of specializations. For example, in Figure 4.6 ENGINEER is a subclass of EMPLOYEE and is also a superclass of ENGINEERING_MANAGER; this represents the real-world constraint that every engineering manager is required to be an engineer. A specialization hierarchy has the constraint that every subclass participates as a subclass in only one class/subclass relation? ship; that is, each subclass has only one parent, which results in a tree structure. In con? trast, for a specialization lattice, a subclass can be a subclass in more than one class/subclass relationship. Hence, Figure 4.6 is a lattice. Figure 4.7 shows another specialization lattice of more than one level. This may be part of a conceptual schema for a UNIVERSITY database. Notice that this arrangement would 7. The notation of using single or double lines is similar to that for partial or total participation of an entity type in a relationship type, as described in Chapter 3. 4.3 Constraints and Characteristics of Specialization and Generalization I 95 TECHNICIAN FIGURE 4.6 A special ization lattice with shared subclass ENGINEERING_MANAGER. have been a hierarchy except for the STUDENT_ASSISTANT subclass, which is a subclass in two distinct class/subclass relationships. In Figure 4.7, all person entities represented in the database are members of the PERSON entity type, which is specialized into the subclasses {EMPLOYEE, ALUMNUS, STUDENT}. This specialization is overlapping; for example, an alumnus may also be an employee and may also be a student pursuing an advanced degree. The subclass STUDENT is the superclass for the specialization {GRADUATE_STUDENT, UNDERGRADUATE_STUDENT}, while EMPLOYEE is the superclass for the specialization {STUDENT_ASSISTANT, FACULTY, STAFF}. Notice that STUDENT_ASSISTANT is also a subclass of STUDENT. Finally, STUDENT_ASSISTANT is the superclass for the specialization into {RESEARCH_ASSISTANT, TEACHING_ASSISTANT}. In such a specialization lattice or hierarchy, a subclass inherits the attributes not only ofitsdirect superclass but also of all its predecessor superclasses all the way to the rootof the hierarchy or lattice. For example, an entity in GRADUATE_STUDENT inherits all the attributes of thatentity as a STUDENT and as a PERSON. Notice that an entity may exist in several leafnodes ofthe hierarchy, where a leaf node is a class that has no subclasses of its own. For example, amember of GRADUATE_STUDENT may also be a member of RESEARCH_ASSISTANT. A subclass with more thanone superclass is called a shared subclass, such as ENGINEERING_ MANAGER in Figure 4.6. This leads to the concept known as multiple inheritance, where the shared subclass ENGINEERING_MANAGER directly inherits attributes and relationships from multiple classes. Notice that the existence of at least one shared subclass leads to a lattice (and hence to multiple inheritance); if no shared subclasses existed, we would have a hierarchy rather than a lattice. An important rule related to multiple inheritance can be illustrated by the example of the shared subclass STUDENT_ASSISTANT in Figure 4.7, which 96 I Chapter 4 Enhanced Entity-Relationship and UML Modeling DegreeProgram FIGURE 4.7 A specialization lattice with multiple inheritance for a UNIVERSITY database. 4.3 Constraints and Characteristics of Specialization and Generalization I 97 inherits attributes from both EMPLOYEE and STUDENT. Here, both EMPLOYEE and STUDENT inherit the same attributes from PERSON. The rule states that if an attribute (or relationship) originating in the same superclass (PERSON) is inherited more than once via different paths (EMPLOYEE and STUDENT) in the lattice, then it should be included only once in the shared subclass (STUDENT_ ASSISTANT). Hence, the attributes of PERSON are inherited only once in the STUDENT_ASSISTANT subclass of Figure 4.7. It is important to note here that some models and languages do not allow multiple inheritance (shared subclasses). In such a model, it is necessary to create additional subclasses to cover all possible combinations of classes that may have some entity belong to all these classes simultaneously. Hence, any overlapping specialization would require multiple additional subclasses. For example, in the overlapping specialization of PERSON into {EMPLOYEE, ALUMNUS, STUDENT} (or {E, A, s} for short), it would be necessary to create seven subclasses of PERSON in order to cover all possible types of entities: E, A, S, E~A, E_S, A_S, and E_A_S. Obviously, this can lead to extra complexity. It is also important to note that some inheritance mechanisms that allow multiple inheritance do not allow an entity to have multiple types, and hence an entity can be a member of only one class. 8 In such a model, it is also necessary to create additional shared subclasses as leaf nodes to cover all possible combinations of classes that may have some entitybelong to all these classes simultaneously. Hence, we would require the same seven subclasses of PERSON. Although we have used specialization to illustrate our discussion, similar concepts apply equally to generalization, as we mentioned at the beginning of this section. Hence, we can also speak of generalization hierarchies and generalization lattices. 4.3.3 Utilizing Specialization and Generalization in Refining Conceptual Schemas We now elaborate on the differences between the specialization and generalization pro? cesses, and how they are used to refine conceptual schemas during conceptual database design. In the specialization process, we typically start with an entity type and then define subclasses of the entity type by successive specialization; that is, we repeatedly define more specific groupings of the entity type. For example, when designing the specialization lattice in Figure 4.7, we may first specify an entity type PERSON for a university database. Then we discover that three types of persons will be represented in the database: university employ? ees, alumni, and students. We create the specialization {EMPLOYEE, ALUMNUS, STUDENT} for this purpose and choose the overlapping constraint because a person may belong to more than one of the subclasses. We then specialize EMPLOYEE further into {STAFF, FACULTY, STUDENT_ ASSISTANT}, and specialize STUDENT into {GRADUATE_STUDENT, UNDERGRADUATE_STUDENT}. Finally, we specialize STUDENT_ASSISTANT into {RESEARCH_ASSISTANT, TEACHING~ASSISTANT}. This successive specialization corresponds to a top-down conceptual refinement process during concep- 8.In some models, the class is further restricted to be a leafnode in the hierarchy or lattice. 98 I Chapter 4 Enhanced Entity-Relationship and UML Modeling tual schema design. So far, we have a hierarchy; we then realize that STUDENT_ASSISTANT is a shared subclass, since it is also a subclass of STUDENT, leading to the lattice. It is possible to arrive at the same hierarchy or lattice from the other direction. In such a case, the process involves generalization rather than specialization and corresponds to a bottom-up conceptual synthesis. In this case, designers may first discover entity types such as STAFF, RESEARCH_ASSISTANT, UNDERGRADUATE_STUDENT, FACULTY, ALUMNUS, GRADUATE_STUDENT, TEACHING_ASSISTANT, and so on; then they generalize {GRADUATE_STUDENT, UNDERGRADUATE_STUDENT} into STUDENT; then they generalize {RESEARCH_ASSISTANT, TEACHING_ASSISTANT} into STUDENT_ASSIS? TANT; then they generalize {STAFF, FACULTY, STUDENT_ASSISTANT} into EMPLOYEE; and finally they generalize {EMPLOYEE, ALUMNUS, STUDENT} into PERSON. In structural terms, hierarchies or lattices resulting from either process may be identical; the only difference relates to the manner or order in which the schema superclasses and subclasses were specified. In practice, it is likely that neither the generalization process nor the specialization process is followed strictly, but that a combination of the two processes is employed. In this case, new classes are continually incorporated into a hierarchy or lattice as they become apparent to users and designers. Notice that the notion of representing data and knowledge by using superclass/subclass hierarchies and lattices is quite common in knowledge-based systems and expert systems, which combine database technology with artificial intelligence techniques. For example, frame-based knowledge representation schemes closely resemble class hierarchies. Specialization is also common in software engineering design methodologies that are based on the object-oriented paradigm. 4.4 MODELING OF UNION TYPES USING CATEGORIES All of the superclass/subclass relationships we have seen thus far have a single superclass. A shared subclass such as ENGINEERING_MANAGER in the lattice of Figure 4.6 is the subclass in three distinct superclass/subclass relationships, where each of the three relationships has a single superclass. It is not uncommon, however, that the need arises for modeling a single superclass/subclass relationship with more than one superclass, where the superclasses rep? resent different entity types. In this case, the subclass will represent a collection of objects that is a subset of the UNION of distinct entity types; we call such a subclass a union type or a category," For example, suppose that we have three entity types: PERSON, BANK, and COMPANY. In a database for vehicle registration, an owner of a vehicle can be a person, a bank (holding a lien on a vehicle), or a company. We need to create a class (collection of entities) that includes entities of all three types to play the role of vehicle owner. A category OWNER that is a subclass of the UNION of the three entity sets of COMPANY, BANK, and PERSON is created for this purpose. We display categories in an EERdiagram as shown in Figure 4.8. The superclasses 9. Our use of the term category is based on the EeR (Entity-Category-Relationship) model (Elmasri et al. 1985). 4.4 Modeling of UNION Types Using Categories I 99 COMPANY, BANK, and PERSON are connected to the circle with the U symbol, which stands for the set union operation. An arc with the subset symbol connects the circle to the (subclass) OWNER category. If a defining predicate is needed, it is displayed next to the line from the N LicensePlateNo REGISTERED_VEHICLE FIGURE 4.8 Two categories (union types): OWNER and REGISTERED_VEHICLE. 100 I Chapter 4 Enhanced Entity-Relationship and UML Modeling superclass to which the predicate applies. In Figure 4.8 we have two categories: OWNER, which is a subclass of the union of PERSON, BANK, and COMPANY; and REGISTERED_VEHICLE, which is a subclass of the union of CAR and TRUCK. A category has two or more superclasses that may represent distinct entity types, whereas other superclass/subclass relationships always have a single superclass. We can compare a category, such as OWNER in Figure 4.8, with the ENGINEERING_MANAGER shared subclass of Figure 4.6. The latter is a subclass of each of the three superclasses ENGINEER, MANAGER, and SALARIED_EMPLOYEE, so an entity that is a member of ENGINEERING_MANAGER must exist in all three. This represents the constraint that an engineering manager must be an ENGINEER, a MANAGER, and a SALARIED_EMPLOYEE; that is, ENGINEERING_MANAGER is a subset of the intersection of the three subclasses (sets of entities). On the other hand, a category is a subset of the union of its superclasses. Hence, an entity that is a member of OWNER must exist in only one of the superclasses. This represents the constraint that an OWNER may be a COMPANY, a BANK, or a PERSON in Figure 4.8. Attribute inheritance works more selectively in the case of categories. For example, in Figure 4.8 each OWNER entity inherits the attributes of a COMPANY, a PERSON, or a BANK, depending on the superclass to which the entity belongs. On the other hand, a shared subclass such as ENGINEERING_MANAGER (Figure 4.6) inherits all the attributes of its superclasses SALARIED_EMPLOYEE, ENGINEER, and MANAGER. It is interesting to note the difference between the category REGISTERED_VEHICLE (Figure 4.8) and the generalized superclass VEHICLE (Figure 4.3b). In Figure 4.3b, every car and every truck is a VEHICLE; but in Figure 4.8, the REGISTERED_VEHICLE category includes some cars and some trucks but not necessarily all of them (for example, some cars or trucks may not be registered). In general, a specialization or generalization such as that in Figure 4.3b, if it were partial, would not preclude VEHICLE from containing other types of entities, such as motorcycles. However, a category such as REGISTERED_ VEHICLE in Figure 4.8 implies that only cars and trucks, but not other types of entities, can be members of REGISTERED_VEHICLE. A category can be total or partial. A total category holds the union of all entities in its superclasses, whereas a partial category can hold a subsetof the union. A total category is represented by a double line connecting the category and the circle, whereas partial categories are indicated by a single line. The superclasses of a category may have different key attributes, as demonstrated by the OWNER category of Figure 4.8, or they may have the same key attribute, as demonstrated by the REGISTERED_VEHICLE category. Notice that if a category is total (not partial), it may be represented alternatively as a total specialization (or a total generalization). In this case the choice of which representation to use is subjective. If the two classes represent the same type of entities and share numerous attributes, including the same key attributes, specialization/generalization is preferred; otherwise, categorization (union type) is more appropriate. 4.5 An Example UNIVERSITY EER Schema and Formal Definitions for the EER Model I 101 4.5 AN EXAMPLE UNIVERSITY EER SCHEMA AND FORMAL DEFINITIONS FOR THE EER MODEL In this section, we first give an example of a database schema in the EER model to illus? trate the use of the various concepts discussed here and in Chapter 3. Then, we summa? rize the EER model concepts and define them formally in the same manner in which we formally defined the concepts of the basic ER model in Chapter 3. 4.5.1 The UNIVERSITY Database Example For our example database application, consider a UNIVERSITY database that keeps track of studentsand their majors, transcripts, and registration as well as of the university's course offerings. The database also keeps track of the sponsored research projects of faculty and graduate students. This schema is shown in Figure 4.9. A discussion of the requirements that led to this schema follows. For each person, the database maintains information on the person's Name [Name]' social security number [Ssn], address [Address], sex [Sex], and birth date [BDate]. Two subclasses of the PERSON entity type were identified: FACULTY and STUDENT. Specific attributes of FACULTY are rank [Rank] (assistant, associate, adjunct, research, visiting, etc.), office [FOfficeJ, office phone [FPhone], and salary [Salary]. All faculty members are related to the academic department(s) with which they are affiliated [BELONGS] (a faculty member can beassociated with several departments, so the relationship is M:N). A specific attribute of STUDENT is [Class] (freshman = 1, sophomore = 2, ... , graduate student = 5). Each student is alsorelated to his or her major and minor departments, if known ([MAJOR] and [MINORD, to the course sections he or she is currently attending [REGISTERED], and to the courses completed [TRANSCRIPT]. Each transcript instance includes the grade the student received [Grade) in the course section. GRAD_STUDENT is a subclass of STUDENT, with the defining predicate Class = 5. For each graduate student, we keep a list of previous degrees in a composite, multivalued attribute [Degrees). We also relate the graduate student to a faculty advisor [ADVISOR] and to a thesis committee [COMMITIEE], if one exists. An academic department has the attributes name [DName]' telephone [DPhone), and office number [Office] and is related to the faculty member who is its chairperson [cHAIRS) and to the college to which it belongs [co). Each college has attributes college name [Cl-lame], office number [COffice], and the name of its dean [Dean). A course has attributes course number [C#], course name [Cname], and course description [CDesc]. Several sections of each course are offered, with each section having the attributes section number [Sees] and the year and quarter in which the section was offered ([Year) and [QtrD. lO Section numbers uniquely identify each section. The sections being offered during the current quarter are in a subclass CURRENT_SECTION of SECTION, with 10. We assume that the quarter system rather than the semestersystem is used in this university. 102 I Chapter 4 Enhanced Entity-Relationship and UML Modeling FIGURE 4.9 An EER conceptual schema for a UNIVERSITY database. 4.5 An Example UNIVERSITY EER Schema and Formal Definitions for the EER Model I 103 the defining predicate Qtr = CurrentQtr and Year = CurrentYear. Each section is related to the instructor who taught or is teaching it ([TEACH]), if that instructor is in the database. The category INSTRUCTOR_RESEARCHER is a subset of the union of FACULTY and GRAD_STUDENT and includes all faculty, as well as graduate students who are supported by teaching or research. Finally, the entity type GRANT keeps track of research grants and contracts awarded to the university. Each grant has attributes grant title [Title], grant number [No], the awarding agency [Agency], and the starting date [StDate]. A grant is related to one principal investigator [PI] and to all researchers it supports [SUPPORT]. Each instance of supporthas as attributes the starting date of support [Start], the ending date of the support (if known) [End], and the percentage of time being spent on the project [Time] by the researcher being supported. 4.5.2 Formal Definitions for the EER Model Concepts Wenow summarize the EER model concepts and give formal definitions. A class! is a set or collection of entities; this includes any of the EER schema constructs that group enti? ties, such as entity types, subclasses, superclasses, and categories. A subclass 5 is a class whose entities must always be a subset of the entities in another class, called the super? class C of the superclass/subclass (or IS-A) relationship. We denote such a relationship by CIS. For such a superclass/subclass relationship, we must always have Sc: C A specialization Z = {51' 52' ... , 5n } is a set of subclasses that have the same superclass G; that is, G/5 j is a superclass/subclass relationship for i = 1, 2, ... , n, G is called a generalized entity type (or the superclass of the specialization, or a generalization of the subclasses {51' 52' ... , 5n}) . Z is said to be total if we always (at any point in time) have n Us I = G i = 1 Otherwise, Z is said to be partial. Z is said to be disjoint if we always have Sj n Sj = 0 (empty set) for i oF j Otherwise,Z is said to be overlapping. A subclass 5 of C is said to be predicate-defined if a predicate p on the attributes of C is used to specify which entities in C are members of 5; that is, 5 = C[p], where C[p] is the set of entities in C that satisfy p. A subclass that is not defined by a predicate is called user-defined. 11. The use of the word class here differs from its more common use in object-oriented programming languages such as c++. In C++, a class is a structured type definition along with its applicable func? tions (operations). 104 I Chapter 4 Enhanced Entity-Relationship and UML Modeling A specialization Z (or generalization G) is said to be attribute-defined if a predicate (A = c), where A is an attribute of G and Ci is a constant value from the domain of A, is used to specify membership in each subclass Sj in Z. Notice that if ci 7:- cj for i 7:- j, and A is a single-valued attribute, then the specialization will be disjoint. A category T is a class that is a subset of the union of n defining superclasses01' 0z, ... , On'n > 1, and is formally specified as follows: A predicate Pi on the attributes of D, can be used to specify the members of each Vi that are members of T. If a predicate is specified on every 0i' we get We should now extend the definition of relationship type given in Chapter 3 by allowing any class-not only any entity type-to participate in a relationship. Hence, we should replace the words entity type with class in that definition. The graphical notation of EER is consistent with ER because all classes are represented by rectangles. 4.6 REPRESENTING SPECIALIZATION/ GENERALIZATION AND INHERITANCE IN UML CLASS DIAGRAMS We now discuss the UML notation for generalization/specialization and inheritance. We already presented basic UML class diagram notation and terminology in Section 3.8. Fig? ure 4.10 illustrates a possible UML class diagram corresponding to the EER diagram in Fig? ure 4.7. The basic notation for generalization is to connect the subclasses by vertical lines to a horizontal line, which has a triangle connecting the horizontal line through another vertical line to the superclass (see Figure 4.10). A blank triangle indicates a specializa? tion/generalization with the disjoint constraint, and a filled triangle indicates an overlap? pingconstraint. The root superclass is called the base class, and leaf nodes are called leaf classes. Both single and multiple inheritance are permitted. The above discussion and example (and Section 3.8) give a brief overview of UML class diagrams and terminology. There are many details that we have not discussed because they are outside the scope of this book and are mainly relevant to software engineering. For example, classes can be of various types: • Abstract classes define attributes and operations but do not have objects correspond? ing to those classes. These are mainly used to specify a set of attributes and operations that can be inherited. • Concrete classes can have objects (entities) instantiated to belong to the class. • Template classes specify a template that can be further used to define other classes. EMPLOYEE Salary hire_emp ... I STAFF Position hire_staff ... I FACULTY Rank promote ... A I RESEARCH_ASSISTANT Project change_project ... 4.7 Relationship Types of Degree Higher Than Two I 105 I ALUMNUS PERSON Name Ssn BirthDate Sex Address age .-, 1 DEGREE Year new_alumnus ~ MajorDegree ... ... I I STUDENT_ASSISTANT PercentTime hire_student ... A I STUDENT MajorDept change_major ... 4 1 I I UNDERGRADUATE_STUDENT GRADUATE STUDENT DegreeProgram Class change_degreeJ)rogram change_classification ... ... I TEACHING_ASSISTANT Course assign_to_course ... FIGURE 4.10 A UML class diagram corresponding to the EER diagram in Figure 4.7, illustrating UML notation for special ization/general ization. In database design, we are mainly concerned with specifying concrete classes whose collections of objects are permanently (or persistently) stored in the database. The bibliographic notes at the end of this chapter give some references to books that describe complete details of UML. Additional material related to UML is covered in Chapter 12, and object modeling in general is further discussed in Chapter 20. 4.7 RELATIONSHIP TYPES OF DEGREE HIGHER THAN Two InSection 3.4.2 we defined the degree of a relationship type as the number of participat? ing entity types and called a relationship type of degree two binary and a relationship type ofdegree three ternary. In this section, we elaborate on the differences between binary 106 I Chapter 4 Enhanced Entity-Relationship and UML Modeling and higher-degree relationships, when to choose higher-degree or binary relationships, and constraints on higher-degree relationships. 4.7.1 Choosing between Binary and Ternary (or Higher-Degree> Relationships The ER diagram notation for a ternary relationship type is shown in Figure 4.11a, which displays the schema for the SUPPLY relationship type that was displayed at the instance level in Figure 3.10. Recall that the relationship set of SUPPLY is a set of relationship instances (s, j, p), where s is a SUPPLIER who is currently supplying a PAR-, p to a PROJECT j. In general, a relationship type R of degree n will have n edges in an ER diagram, one con? necting R to each participating entity type. Figure 4.11b shows an ER diagram for the three binary relationship types CAN_SUPPLY, USES, and SUPPLIES. In general, a ternary relationship type represents different information than do three binary relationship types. Consider the three binary relationship types CAN_ SUPPLY, USES, and SUPPLIES. Suppose that CAN_SUPPLY, between SUPPLIER and PART, includes an instance (5, p) whenever supplier 5 can supply part p (to any project); USES, between PROJECT and PART, includes an instance (j, p) whenever project j uses part p; and SUPPLIES, between SUPPLIER and PROJECT, includes an instance (s, j) whenever supplier 5 supplies some part to project j. The existence of three relationship instances (5, p), (j, p), and (5, j) in CAN_SUPPLY, USES, and SUPPLIES, respectively, does not necessarily imply that an instance (5, j, p) exists in the ternary relationship SUPPLY, because the meaning is different. It is often tricky to decide whether a particular relationship should be represented as a relationship type of degree n or should be broken down into several relationship types of smaller degrees. The designer must base this decision on the semantics or meaning of the particular situation being represented. The typical solution is to include the ternary relationship plus one or more of the binary relationships, if they represent different meanings and if all are needed by the application. Some database design tools are based on variations of the ER model that permit only binary relationships. In this case, a ternary relationship such as SUPPLY must be represented as a weak entity type, with no partial key and with three identifying relationships. The three participating entity types SUPPLIER, PART, and PROJECT are together the owner entity types (see Figure 4.11c). Hence, an entity in the weak entity type SUPPLY of Figure 4.11c is identified by the combination of its three owner entities from SUPPLIER, PART, and PROJECT. Another example is shown in Figure 4.12. The ternary relationship type OFFERS represents information on instructors offering courses during particular semesters; hence it includes a relationship instance (i, 5, c) whenever INSTRUCTOR i offers COURSE c during SEMESTER s, The three binary relationship types shown in Figure 4.12 have the following meanings: CAN_TEACH relates a course to the instructors who can teach that course, TAUGHT_ DURING relates a semester to the instructors who taught some course during that semester, and OFFERED_DURING relates a semester to the courses offered during that semester by any instructor. These ternary and binary relationships represent different information, but certain constraints should hold among the relationships. For example, a relationship instance (i, 5, c) should not exist in OFFERS unless an instance (i, 5) exists in TAUGHT_DURING, 4.7 Relationship Types of Degree Higher Than Two I 107 (a) SUPPLY (b) M SUPPLIES N (c) M M USES N N ~ I ~----,------- I PART FIGURE 4.11 Ternary relationship types. (a) The SUPPLY relationship. (b) Three binary relationships not equivalent to SUPPLY. (c) SUPPLY represented as a weak entity type. 108 I Chapter 4 Enhanced Entity-Relationship and UML Modeling TAUGHT_DURING INSTRUCTOR OFFERS OFFERED_DURING FIGURE 4.12 Another example of ternary versus binary relationship types. an instance (s, c) exists in OFFERED_DURING, and an instance (i, c) exists in CAN_TEACH. However, the reverse is not always true; we may have instances (i, s), (s, c), and (i, c) in the three binary relationship types with no corresponding instance (i, s, c) in OFFERS. Note that in this example, based on the meanings of the relationships, we can infer the instances of TAUGHT_DURING and OFFERED_DURING from the instances in OFFERS, but we cannot infer the instances of CAN_TEACH; therefore, TAUGHT_DURING and OFFERED_DURING are redundant and can be left out. Although in general three binary relationships cannot replace a ternary relationship, they may do so under certain additional constraints. In our example, if the CAN_TEACH relationship is 1:1 (an instructor can teach on~ course, and a course can be taught by only one instructor), then the ternary relationship OFFERS can be left out because it can be inferred from the three binary relationships CAN_TEACH, TAUGHT_DURING, and OFFERED_DURING. The schema designer must analyze the meaning of each specific situation to decide which of the binary and ternary relationship types are needed. Notice that it is possible to have a weak entity type with a ternary (or n-ary) identifying relationship type. In this case, the weak entity type can have several owner entity types. An example is shown in Figure 4.13. 4.7.2 Constraints on Ternary (or Higher-Degree) Relationships There are two notations for specifying structural constraints on n-ary relationships, and they specify different constraints. They should thus both be used if it is important to fully specify the structural constraints on a ternary or higher-degree relationship. The first '__ 4.7 Relationship Types of Degree Higher Than Two 1109 ~----<.:~>--------1'----------' Department I INTERVIEW FIGURE 4.13 A weak entity type INTERVIEW with a ternary identifying relationship type. notation is based on the cardinality ratio notation of binary relationships displayed in Fig? ure 3.2. Here, a 1, M, or N is specified on each participation arc (both M and N symbols stand for many or any number) .12 Let us illustrate this constraint using the SUPPLY relation? ship in Figure 4.11. Recall that the relationship set of SUPPLY is a set of relationship instances (s, i, p), where s is a SUPPLIER, j is a PROJECT, and p is a PART. Suppose that the constraint exists that for a particular project-part combination, only one supplier will be used (only one supplier supplies a particular part to a particular project). In this case, we place 1 on the SUPPLIER participation, and M, N on the PROJECT, PART participations in Figure 4.11. This specifies the constraint that a particular (j, p) combination can appear at most once in the relationship set because each such (project, part) combination uniquely determines a single supplier. Hence, any relationship instance (s, i, p) is uniquely identified in the relationship set by its (j, p) combination, which makes (j, p) a key for the relationship set. In this notation, the participations that have a one specified on them are not required to bepart of the identifying key for the relationship set. 13 The second notation is based on the (min, max) notation displayed in Figure 3.15 for binary relationships. A (min, max) on a participation here specifies that each entity is related to at least min and at most max relationship instances in the relationship set. These constraints have no bearing on determining the key of an n-ary relationship, where n > 2,14 but specify a different type of constraint that places restrictions on how many relationship instances each entity can participate in. 12. This notation allows us to determine the key of the relationship relation, as we discuss in Chapter 7. 13. This is also true for cardinality ratios of binary relationships. 14. The (min, max) constraints can determine the keys for binary relationships, though. 110 I Chapter 4 Enhanced Entity-Relationship and UML Modeling 4.8 DATA ABSTRACTION, KNOWLEDGE REPRESENTATION, AND ONTOLOGY CONCEPTS In this section we discuss in abstract terms some of the modeling concepts that we described quite specifically in our presentation of the ER and EERmodels in Chapter 3 and earlier in this chapter. This terminology is used both in conceptual data modeling and in artificial intelligence literature when discussing knowledge representation (abbreviated as KR). The goal of KR techniques is to develop concepts for accurately modeling some domain of knowledge by creating an ontologv'P that describes the concepts of the domain. This is then used to store and manipulate knowledge for drawing inferences, making decisions, or just answering questions. The goals of KR are similar to those of semantic data models, but there are some important similarities and differences between the two disciplines: • Both disciplines use an abstraction process to identify common properties and impor? tant aspects of objects in the miniworld (domain of discourse) while suppressing insignificant differences and unimportant details. • Both disciplines provide concepts, constraints, operations, and languages for defining data and representing knowledge. • KR is generally broader in scope than semantic data models. Different forms of knowl? edge, such as rules (used in inference, deduction, and search), incomplete and default knowledge, and temporal and spatial knowledge, are represented in KR schemes. Data? base models are being expanded to include some of these concepts (see Chapter 24). • KR schemes include reasoning mechanisms that deduce additional facts from the facts stored in a database. Hence, whereas most current database systems are limited to answering direct queries, knowledge-based systems using KR schemes can answer queries that involve inferences over the stored data. Database technology is being extended with inference mechanisms (see Section 24.4). • Whereas most data models concentrate on the representation of database schemas, or meta-knowledge, KR schemes often mix up the schemas with the instances them? selves in order to provide flexibility in representing exceptions. This often results in inefficiencies when these KR schemes are implemented, especially when compared with databases and when a large amount of data (or facts) needs to be stored. In this section we discuss four abstraction concepts that are used in both semantic data models, such as the EER model, and KR schemes: (1) classification and instantiation, (2) identification, (3) specialization and generalization, and (4) aggregation and association. The paired concepts of classification and instantiation are inverses of one another, as are generalization and specialization. The concepts of aggregation and association are also related. We discuss these abstract concepts and their relation to the concrete representations used in the EER model to clarify the data abstraction process and 15. An ontology is somewhat similar to a conceptual schema, but with more knowledge, rules, and exceptions. 4.8 Data Abstraction, Knowledge Representation, and Ontology Concepts I 111 to improve our understanding of the related process of conceptual schema design. We close the section with a brief discussion of the term ontology, which is being used widely in recent knowledge representation research. 4.8.1 Classification and Instantiation The process of classification involves systematically assigning similar objects/entities to object classes/entity types. We can now describe (in DB) or reason about (in KR) the classes rather than the individual objects. Collections of objects share the same types of attributes, relationships, and constraints, and by classifying objects we simplify the pro? cess of discovering their properties. Instantiation is the inverse of classification and refers to the generation and specific examination of distinct objects of a class. Hence, an object instance is related to its object class by the IS-AN-INSTANCE-OF or IS-A-MEMBER-OF rela? tionship. Although UML diagrams do not display instances, the UML diagrams allow a form of instantiation by permitting the display of individual objects. We did not describe thisfeature in our introduction to UML. In general, the objects of a class should have a similar type structure. However, some objects may display properties that differ in some respects from the other objects of the class; these exception objects also need to be modeled, and KR schemes allow more varied exceptions than do database models. In addition, certain properties apply to the class as a whole and not to the individual objects; KR schemes allow such class properties. UML diagrams also allow specification of class properties. In the EER model, entities are classified into entity types according to their basic attributes and relationships. Entities are further classified into subclasses and categories based on additional similarities and differences (exceptions) among them. Relationship instances are classified into relationship types. Hence, entity types, subclasses, categories, andrelationship types are the different types of classes in the EER model. The EER model does not provide explicitly for class properties, but it may be extended to do so. In UML, objects are classified into classes, and it is possible to display both class properties and individual objects. Knowledge representation models allow multiple classification schemes in which one class is an instance of another class (called a meta-class). Notice that this cannot be represented directly in the EER model, because we have only two levels-classes and instances. The only relationship among classes in the EER model is a superclass/subclass relationship, whereas in some KR schemes an additional class/instance relationship can be represented directly in a class hierarchy. An instance may itself be another class, allowing multiple-level classification schemes. 4.8.2 Identification Identification is the abstraction process whereby classes and objects are made uniquely identifiable by means of some identifier. For example, a class name uniquely identifies a whole class. An additional mechanism is necessary for telling distinct object instances 112 I Chapter 4 Enhanced Entity-Relationship and UML Modeling apart by means of object identifiers. Moreover, it is necessary to identify multiple manifes? tations in the database of the same real-world object. For example, we may have a tuple in a PERSON relation and another tuple <301-54? 0836, CS, 3.8> in a STUDENT relation that happen to represent the same real-world entity. There is no way to identify the fact that these two database objects (tuples) represent the same real-world entity unless we make a provision at design time for appropriate cross? referencing to supply this identification. Hence, identification is needed at two levels: • To distinguish among database objects and classes • To identify database objects and to relate them to their real-world counterparts In the EER model, identification of schema constructs is based on a system of unique names for the constructs. For example, every class in an EER schema-whether it is an entity type, a subclass, a category, or a relationship type-must have a distinct name. The names of attributes of a given class must also be distinct. Rules for unambiguously identifying attribute name references in a specialization or generalization lattice or hierarchy are needed as well. At the object level, the values of key attributes are used to distinguish among entities of a particular entity type. For weak entity types, entities are identified by a combination of their own partial key values and the entities they are related to in the owner entity tvpets). Relationship instances are identified by some combination of the entities that they relate, depending on the cardinality ratio specified. 4.8.3 Specialization and Generalization Specialization is the process of classifying a class of objects into more specialized sub? classes. Generalization is the inverse process of generalizing several classes into a higher? level abstract class that includes the objects in all these classes. Specialization is concep? tual refinement, whereas generalization is conceptual synthesis. Subclasses are used in the EER model to represent specialization and generalization. We call the relationship between a subclass and its superclass an IS-A-SUBCLASS-OF relationship, or simply an IS-A relationship. 4.8.4 Aggregation and Association Aggregation is an abstraction concept for building composite objects from their compo? nent objects. There are three cases where this concept can be related to the EER model. The first case is the situation in which we aggregate attribute values of an object to form the whole object. The second case is when we represent an aggregation relationship as an ordinary relationship. The third case, which the EER model does not provide for explicitly, involves the possibility of combining objects that are related by a particular relationship instance into a higher-level aggregate object. This is sometimes useful when the higher-level aggregate object is itself to be related to another object. We call the relation- 4.8 Data Abstraction, Knowledge Representation, and Ontology Concepts I 113 shipbetween the primitive objects and their aggregate object IS-A-PART-OF; the inverse iscalled IS-A-COMPONENT-OF. UML provides for all three types of aggregation. The abstraction of association is used to associate objects from several independent classes. Hence, it is somewhat similar to the second use of aggregation. It is represented in the EER model by relationship types, and in UML by associations. This abstract relationship is called IS-ASSOCIATED-WITH. In order to understand the different uses of aggregation better, consider the ER schema shown in Figure 4.14a, which stores information about interviews by job applicants to various companies. The class COMPANY is an aggregation of the attributes (or component objects) CName (company name) and CAddress (company address), whereas JOB_APPLICANT is an aggregate of Ssn, Name, Address, and Phone. The relationship attributes ContactName and ContactPhone represent the name and phone number of the person in the company who is responsible for the interview. Suppose that some interviews result in job offers, whereas others do not. We would like to treat INTERVIEW as a class to associate it with JOB_OFFER. The schema shown in Figure 4.14b is incorrect because it requires each interview relationship instance to have a job offer. The schema shown in Figure 4.14c is not allowed, because the ER model does not allow relationships among relationships (although UML does). One way to represent this situation is to create a higher-level aggregate class composed ofCOMPANY, JOB_APPLICANT, and INTERVIEW and to relate this class to JOB_OFFER, as shown in Figure 4.14d. Although the EERmodel as described in this book does not have this facility, some semantic data models do allow it and call the resulting object a composite or molecular object. Other models treat entity types and relationship types uniformly and hence permit relationships among relationships, as illustrated in Figure 4.14c. To represent this situation correctly in the ER model as described here, we need to create a new weak entity type INTERVIEW, as shown in Figure 4.14e, and relate it to JOB_ OFFER. Hence, we can always represent these situations correctly in the ER model by creating additional entity types, although it may be conceptually more desirable to allow direct representation of aggregation, as in Figure 4.14d, or to allow relationships among relationships, as in Figure 4.14c. The main structural distinction between aggregation and association is that when an association instance is deleted, the participating objects may continue to exist. However, if we support the notion of an aggregate object-for example, a CAR that is made up of objects ENGINE, CHASSIS, and TIREs-then deleting the aggregate CAR object amounts to deleting all its component objects. 4.8.5 Ontologies and the Semantic Web Inrecent years, the amount of computerized data and information available on the Web has spiraled out of control. Many different models and formats are used. In addition to the database models that we present in this book, much information is stored in the form of documents, which have considerably less structure than database information does. One research project that is attempting to allow information exchange among computers on the Web is called the Semantic Web, which attempts to create knowledge representation 114 I Chapter 4 Enhanced Entity-Relationship and UML Modeling (a) (b) (c) (d) (e) COMPANY INTERVIEW COMPANY G,:>------iL-_--=- JOB_APPLICANT --' FIGURE 4.14 Aggregation. (a) The relationship type INTERVIEW. (b) Including JOB_OFFER in a ternary relationship type (incorrect). (c) Having the RESULTS_IN relationship partic? ipate in other relationships (generally not allowed in ER). (d) Using aggregation and a composite (molecular) object (generally not allowed in ER). (e) Correct representa? tion in ER. 4.9 Summary 1115 models that are quite general in order to to allow meaningful information exchange and search among machines. The concept of ontology is considered to be the most promising basis for achieving the goals of the Semantic Web, and is closely related to knowledge rep? resentation. In this section, we give a brief introduction to what an ontology is and how it can be used as a basis to automate information understanding, search, and exchange. The study of ontologies attempts to describe the structures and relationships that are possible in reality through some common vocabulary, and so it can be considered as a way to describe the knowledge of a certain community about reality. Ontology originated in the fields of philosophy and metaphysics. One commonly used definition of ontology is "a specification of a conceptualization."16 In this definition, a conceptualization is the set of concepts that are used to represent the part of reality or knowledge that is of interest to a community of users. Specification refers to the language and vocabulary terms that are used to specify the conceptualization. The ontology includes both specification and conceptualization. For example, the same conceptualization may be specified in two different languages, giving two separate ontologies. Based on this quite general definition, there is no consensus on what exactly an ontology is. Some possible techniques to describe ontologies that have been mentioned are as follows: • A thesaurus (or even a dictionary or a glossary of terms) describes the relationships between words (vocabulary) that represent various concepts. • A taxonomy describes how concepts of a particular area of knowledge are related using structures similar to those used in a specialization or generalization. • A detailed database schema is considered by some to be an ontology that describes the concepts (entities and attributes) and relationships of a miniworld from reality. • A logical theory uses concepts from mathematical logic to try to define concepts and their interrelationships. Usually the concepts used to describe ontologies are quite similar to the concepts we discussed in conceptual modeling, such as entities, attributes, relationships, specializations, and so on. The main difference between an ontology and, say, a database schema is that the schema is usually limited to describing a small subset of a miniworld from reality in order to store and manage data. An ontology is usually considered to be more general in thatit should attempt to describe a part of reality as completely as possible. 4.9 SUMMARY In this chapter we first discussed extensions to the ER model that improve its representa? tional capabilities. We called the resulting model the enhanced ER or EERmodel. The con? cept of a subclass and its superclass and the related mechanism of attribute/relationship inheritance were presented. We saw how it is sometimes necessary to create additional 16. This definition is given in Gruber (1995). 116 I Chapter 4 Enhanced Entity-Relationship and UML Modeling classes of entities, either because of additional specific attributes or because of specific rela? tionship types. We discussed two main processes for defining superclass/subclass hierarchies and lattices: specialization and generalization. We then showed how to display these new constructs in an EER diagram. We also discussed the various types of constraints that may apply to specialization or generalization. The two main constraints are total/partial and disjoint/overlapping. In addition, a defining predicate for a subclass or a defining attribute for a specialization may be specified. We discussed the differences between user-defined and predicate-defined subclasses and between user-defined and attribute-defined specializations. Finally, we discussed the concept of a category or union type, which is a subset of the union of two or more classes, and we gave formal definitions of all the concepts presented. We then introduced some of the notation and terminology of UML for representing specialization and generalization. We also discussed some of the issues concerning the difference between binary and higher-degree relationships, under which circumstances each should be used when designing a conceptual schema, and how different types of constraints on n-ary relationships may be specified. In Section 4.8 we discussed briefly the discipline of knowledge representation and how it is related to semantic data modeling. We also gave an overview and summary of the types of abstract data representation concepts: classification and instantiation, identification, specialization and generalization, and aggregation and association. We saw how EER and UML concepts are related to each of these. Review Questions 4.1. What is a subclass? When is a subclass needed in data modeling? 4.2. Define the following terms: superclass of a subclass, superclass/subclass relationship, is-a relationship, specialization, generalization, category, specific (local) attributes) spe? cific relationships. 4.3. Discuss the mechanism of attribute/relationship inheritance. Why is it useful? 4.4. Discuss user-defined and predicate-defined subclasses, and identify the differences between the two. 4.5. Discuss user-defined and attribute-defined specializations, and identify the differ? ences between the two. 4.6. Discuss the two main types of constraints on specializations and generalizations. 4.7. What is the difference between a specialization hierarchy and a specialization lattice? 4.8. What is the difference between specialization and generalization? Why do we not display this difference in schema diagrams? 4.9. How does a category differ from a regular shared subclass? What is a category used for? Illustrate your answer with examples. 4.10. For each of the following UML terms (see Sections 3.8 and 4.6), discuss the corre? sponding term in the EERmodel, if any: object, class, association, aggregation, gener? alization, multiplicity, attributes, discriminator, link, link attribute, reflexive association, qualified association. 4.11. Discuss the main differences between the notation for EER schema diagrams and UML class diagrams by comparing how common concepts are represented in each. 4.12. Discuss the two notations for specifying constraints on n-ary relationships, and what each can be used for. 4.13. List the various data abstraction concepts and the corresponding modeling con? cepts in the EER model. 4.14. What aggregation feature is missing from the EER model? How can the EER model be further enhanced to support it? 4.15. What are the main similarities and differences between conceptual database mod? eling techniques and knowledge representation techniques? 4.16. Discuss the similarities and differences between an ontology and a database schema. Exercises I 117 Exercises 4.17. Design an EER schema for a database application that you are interested in. Spec? ify all constraints that should hold on the database. Make sure that the schema has at least five entity types, four relationship types, a weak entity type, a super? class/subclass relationship, a category, and an n-ary (n > 2) relationship type. 4.18. Consider the BANK ER schema of Figure 3.18, and suppose that it is necessary to keep track of different types of ACCOUNTS (SAVINGS_ACCTS, CHECKING_ACCTS, •.• ) and ). Suppose that it is also desirable to keep track of LOANS (CAR_LOANS, HOME_LOANS, ••• each account's TRANSACTIONS (deposits, withdrawals, checks, ...) and each loan's PAYMENTS; both of these include the amount, date, and time. Modify the BANK schema, using ER and EER concepts of specialization and generalization. State any assumptions you make about the additional requirements. 4.19. The following narrative describes a simplified version of the organization of Olympic facilities planned for the summer Olympics. Draw an EER diagram that shows the entity types, attributes, relationships, and specializations for this appli? cation. State any assumptions you make. The Olympic facilities are divided into sports complexes. Sports complexes are divided into one-sport and multisporttypes. Multisport complexes have areas of the complex designated for each sport with a location indicator (e.g., center, NE corner, etc.). A complex has a location, chief organizing individual, total occupied area, and so on. Each complex holds a series of events (e.g., the track stadium may hold many different races). For each event there is a planned date, duration, number of participants, number of officials, and so on. A roster of all officials will be maintained together with the list of events each official will be involved in. Different equipment is needed for the events (e.g., goal posts, poles, parallel bars) as well as for maintenance. The two types of facilities (one-sport and multisport) will have different types of information. For each type, the number of facilities needed is kept, together with an approximate budget. 4.20. Identify all the important concepts represented in the library database case study described here. In particular, identify the abstractions of classification (entity types and relationship types), aggregation, identification, and specialization/gen? eralization. Specify (min, max) cardinality constraints whenever possible. List 118 I Chapter 4 Enhanced Entity-Relationship and UML Modeling details that will affect the eventual design but have no bearing on the conceptual design. List the semantic constraints separately. Draw an EER diagram of the library database. Case Study: The Georgia Tech Library (GTL) has approximately 16,000 members, 100,000 titles, and 250,000 volumes (or an average of 2.5 copies per book). About 10 percent of the volumes are out on loan at anyone time. The librarians ensure that the books that members want to borrow are available when the members want to borrow them. Also, the librarians must know how many copies of each book are in the library or out on loan at any given time. A catalog of books is available online that lists books by author, title, and subject area. For each title in the library, a book description is kept in the catalog that ranges from one sentence to several pages. The reference librarians want to be able to access this description when members request information about a book. Library staff is divided into chief librarian, departmental associate librarians, reference librarians, check-out staff, and library assistants. Books can be checked out for 21 days. Members are allowed to have only five books out at a time. Members usually return books within three to four weeks. Most members know that they have one week of grace before a notice is sent to them, so they try to get the book returned before the grace period ends. About 5 percent of the members have to be sent reminders to return a book. Most overdue books are returned within a month of the due date. Approximately 5 percent of the overdue books are either kept or never returned. The most active members of the library are defined as those who borrow at least ten times during the year. The top 1 percent of membership does 15 percent of the borrowing, and the top 10 percent of the membership does 40 percent of the borrowing. About 20 percent of the members are totally inactive in that they are members but never borrow. To become a member of the library, applicants fill out a form including their SSN, campus and home mailing addresses, and phone numbers. The librarians then issue a numbered, machine-readable card with the member's photo on it. This card is good for four years. A month before a card expires, a notice is sent to a member for renewal. Professors at the institute are considered automatic mem? bers. When a new faculty member joins the institute, his or her information is pulled from the employee records and a library card is mailed to his or her campus address. Professors are allowed to check out books for three-month intervals and have a two-week grace period. Renewal notices to professors are sent to the cam? pus address. The library does not lend some books, such as reference books, rare books, and maps. The librarians must differentiate between books that can be lent and those that cannot be lent. In addition, the librarians have a list of some books they are interested in acquiring but cannot obtain, such as rare or out-of-print books and books that were lost or destroyed but have not been replaced. The librarians must have a system that keeps track of books that cannot be lent as well as books that they are interested in acquiring. Some books may have the same title; therefore, the title cannot be used as a means of identification. Every book is identified by its International Standard Book Number (ISBN), a unique interna- • • tional code assigned to all books. Two books with the same title can have different ISBNs if they are in different languages or have different bindings (hard cover or soft cover). Editions of the same book have different ISBNs. The proposed database system must be designed to keep track of the mem? bers, the books, the catalog, and the borrowing activity. 4.21. Design a database to keep track of information for an art museum. Assume that the following requirements were collected: • The museum has a collection of ART_OBJECTS. Each ART_OBJECT has a unique IdNo, an Artist (if known), a Year (when it was created, if known), a Title, and a Description. The art objects are categorized in several ways, as discussed below. ART_OBJECTS are categorized based on their type. There are three main types: PAINTING, SCULPTURE, and STATUE, plus another type called OTHER to accommodate objects that do not fall into one of the three main types. • A PAINTING has a PaintType (oil, watercolor, etc.), material on which it is DrawnOn (paper, canvas, wood, etc.), and Style (modem, abstract, erc.). • A SCULPTURE or a STATUE has a Material from which it was created (wood, stone, etc.), Height, Weight, and Style. • An art object in the OTHER category has a Type (print, photo, etc.) and Style. ART_OBJECTS are also categorized as PERMANENT_COLLECTION, which are owned by the • museum (these have information on the DateAcquired, whether it is OnDis? play or stored, and Cost) or BORROWED, which has information on the Collection (from which it was borrowed), DateBorrowed, and DateRetumed. ART_OBJECTS also have information describing their country/culture using infor? mation on country/culture of Origin (Italian, Egyptian, American, Indian, etc.) and Epoch (Renaissance, Modem, Ancient, etc.). • The museum keeps track of ARTIST'S information, if known: Name, DateBom (if known), DateDied (if not living), CountryOfOrigin, Epoch, MainStyle, and Description. The Name is assumed to be unique. Different EXHIBITIONS occur, each having a Name, StartDate, and EndDate. EXHIBITIONS are related to all the art objects that were on display during the exhibition. Information is kept on other COLLECTIONS with which the museum interacts, including Name (unique), Type (museum, personal, etc.), Description, Address, Phone, and current ContactPerson. Draw an EERschema diagram for this application. Discuss any assumptions you made, and that justify your EERdesign choices. 4.22. Figure 4.15 shows an example of an EER diagram for a small private airport data? base that is used to keep track of airplanes, their owners, airport employees, and pilots. From the requirements for this database, the following information was collected: Each AIRPLANE has a registration number [Reg#], is of a particular plane type [OF_TYPE], and is stored in a particular hangar [STORED_IN]. Each PLANE_TYPEhas a model number [Model], a capacity [Capacity], and a weight [Weight]. Each HANGAR has a number [Number], a capacity [Capacity], and a location [Location]. The database also keeps track of the OWNERS of each plane [OWNS] and the EMPLOYEES who • • Exercises I 119 120 I Chapter 4 Enhanced Entity-Relationship and UML Modeling N N N FIGURE 4.15 EER schema for a SMALL AIRPORT database. have maintained the plane [MAINTAIN]. Each relationship instance in OWNS relates an airplane to an owner and includes the purchase date [Pdate]. Each relationship instance in MAINTAIN relates an employee to a service record [SERVICE]. Each plane undergoes service many times; hence, it is related by [PLANE_SERVICE] to a number of service records. A service record includes as attributes the date of maintenance [Date], the number of hours spent on the work [Hours], and the type of work done [Workcode]. We use a weak entity type [SERVICE] to represent airplane service, Selected Bibliography I 121 because the airplane registration number is used to identify a service record. An owner is either a person or a corporation. Hence, we use a union type (category) [OWNER] that is a subset of the union of corporation [CORPORATION] and person [PERSON] entity types. Both pilots [PILOT] and employees [EMPLOYEE] are subclasses of PERSON. Each pilot has specific attributes license number [Lic_Num] and restrictions [Restr], each employee has specific attributes salary [Salary] and shift worked [Shift]. All PERSON entities in the database have data kept on their social security number [Ssn], name [Name], address [Address], and telephone number [Phone]. For CORPORATION entities, the data kept includes name [Name], address [Address], and telephone number [Phone]. The database also keeps track of the types of planes each pilot is authorized to fly [FLIES] and the types of planes each employee can do maintenance work on [WORKS_ON]. Show how the SMALL AIRPORT EERschema of Figure 4.15 may be represented in UML notation. (Note: We have not discussed how to represent categories (union types) in UML, so you do not have to map the categories in this and the following question.) 4.23. Show how the UNIVERSITY EER schema of Figure 4.9 may be represented in UML notation. Selected Bibliography Many papers have proposed conceptual or semantic data models. We give a representa? tive list here. One group of papers, including Abrial (1974), Senko's DIAM model (1975), theNIAM method (Verheijen and VanBekkum 1982), and Bracchi et al. (1976), presents semantic models that are based on the concept of binary relationships. Another group of early papers discusses methods for extending the relational model to enhance its model? ing capabilities. This includes the papers by Schmid and Swenson (1975), Navathe and Schkolnick (1978), Codd's RM/T model (1979), Furtado (1978), and the structural model ofWiederhold and Elmasri (1979). The ERmodel was proposed originally by Chen (1976) and is formalized in Ng (1981). Since then, numerous extensions of its modeling capabilities have been proposed, as in Scheuermann et al. (1979), Dos Santos et al. (1979), Teorey et al. (1986), Gogolla and Hohenstein (1991), and the entity-category-relationship (EeR) model of Elmasri et al. (1985). Smith and Smith (1977) present the concepts of generalization and aggregation. The semantic data model of Hammer and McLeod (1981) introduced the concepts of class/subclass lattices, as well as other advanced modeling concepts. A survey of semantic data modeling appears in Hull and King (1987). Eick (1991) discusses design and transformations of conceptual schemas. Analysis of constraints for n? ary relationships is given in Soutou (1998). UML is described in detail in Booch, Rumbaugh, and Jacobson (1999). Fowler and Scott (2000) and Stevens and Pooley (2000) give concise introductions to UML concepts. Fense! (2000) is a good reference on Semantic Web. Uschold and Gruninger (1996) and Gruber (1995) discuss ontologies. A recent entire issue of Communications of the ACM is devoted to ontology concepts and applications. RELATIONAL MODEL: CONCEPTS, CONSTRAINTS, LANGUAGES, DESIGN, AND PROGRAMMING The Relational Data Model and Relational Database Constraints This chapter opens Part II of the book on relational databases. The relational model was first introduced by Ted Codd of IBM Research in 1970 in a classic paper (Codd 1970), and attracted immediate attention due to its simplicity and mathematical foundation. The model uses the concept of a mathematical relation-which looks somewhat like a table of values-as its basic building block, and has its theoretical basis in set theory and first-order predicate logic. In this chapter we discuss the basic characteristics of the model and its constraints. The first commercial implementations of the relational model became available in the early 1980s, such as the Oracle DBMS and the SQL/DS system on the MVS operating system by IBM. Since then, the model has been implemented in a large number of commercial systems. Currentpopular relational DBMSs (RDBMSs) include DB2 and lnformix Dynamic Server (from IBM), Oracle and Rdb (from Oracle), and SQL Server and Access (from Microsoft). Because of the importance of the relational model, we have devoted all of Part II of this textbook to this model and the languages associated with it. Chapter 6 covers the operations of the relational algebra and introduces the relational calculus notation for twotypes of calculi-tuple calculus and domain calculus. Chapter 7 relates the relational modeldata structures to the constructs of the ER and EER models, and presents algorithms fordesigning a relational database schema by mapping a conceptual schema in the ER or EER model (see Chapters 3 and 4) into a relational representation. These mappings are incorporated into many database design and CASEI tools. In Chapter 8, we describe the 1.CASEstands for computer-aided software engineering. 125 126 I Chapter 5 The Relational Data Model and Relational Database Constraints SQL query language, which is the standard for commercial relational OBMSs. Chapter 9 discusses the programming techniques used to access database systems, and presents additional topics concerning the SQL language-s-constraints, views, and the notion of connecting to relational databases via OOBC and JOBC standard protocols. Chapters 10 and 11 in Part III of the book present another aspect of the relational model, namely the formal constraints of functional and multivalued dependencies; these dependencies are used to develop a relational database design theory based on the concept known as normalization. Data models that preceded rhe relational model include the hierarchical and network models. They were proposed in the 1960s and were implemented in early OBMSs during rhe 1970s and 1980s. Because of their historical importance and the large existing user base for these OBMSs, we have included a summary of the highlights of these models in appendices, which are available on the Web site for the book. These models and systems will be with us for many years and are now referred to as legacy database systems. In this chapter, we concentrate on describing the basic principles of the relational model of data. We begin by defining the modeling concepts and notation of the relational model in Section 5.1. Section 5.2 is devoted to a discussion of relational constraints that are now considered an important part of the relational model and are automatically enforced in most relational OBMSs. Section 5.3 defines the update operations of the relational model and discusses how violations of integriry constraints are handled. 5.1 RELATIONAL MODEL CONCEPTS The relational model represents the database as a collection of relations. Informally, each relation resembles a table of values or, to some extent, a "flat" file of records. For example, the database of files that was shown in Figure 1.2 is similar to the relational model repre? sentation. However, there are important differences between relations and files, as we shall soon see. When a relation is thought of as a table of values, each row in the table represents a collection of related data values. We introduced entity types and relationship types as concepts for modeling real-world data in Chapter 3. In the relational model, each row in the table represents a fact that typically corresponds to a real-world entity or relationship. The table name and column names are used to help in interpreting the meaning of the values in each row. For example, the first table of Figure 1.2 is called STUDENT because each row represents facts about a particular student entity. The column names-Name, StudentNumber, Class, and Major-specify how to interpret the data values in each row, based on the column each value is in. All values in a column are of the same data type. In the formal relational model terminology, a row is called a tuple, a column header is called an attribute, and the table is called a relation. The data type describing the types of values that can appear in each column is represented by a domain of possible values. We now define these terms--domain, tuple, attribute, and relation-more precisely. 5.1 Relational Model Concepts I 127 5.1.1 Domains, Attributes, Tuples, and Relations A domain D is a set of atomic values. By atomic we mean that each value in the domain isindivisible as far as the relational model is concerned. A common method of specifying a domain is to specify a data type from which the data values forming the domain are drawn. It is also useful to specify a name for the domain, to help in interpreting its values. Some examples of domains follow: • uSA_phone_numbers: The set of ten-digit phone numbers valid in the United States. • Local_phone_numbers: The set of seven-digit phone numbers valid within a particu- lar area code in the United States. • Social_securiry_numbers: The set of valid nine-digit social security numbers. • Names: The set of character strings that represent names of persons. • Grade_paint_averages: Possible values of computed grade point averages; each must be a real (floating-point) number between 0 and 4. • Employee_ages: Possible ages of employees of a company; each must be a value between 15 and 80 years old. • Academicjiepartmentjiames: The set of academic department names in a univer? sity, such as Computer Science, Economics, and Physics. • Academic_departmenccodes: The set of academic department codes, such as CS, ECON, and PHYS. The preceding are called logical definitions of domains. A data type or format is also specified for each domain. For example, the data type for the domain uSA_phone_ numbers can be declared as a character string of the form (ddd)ddd-dddd, where each d is a numeric (decimal) digit and the first three digits form a valid telephone area code. The data type for Employee_ages is an integer number between 15 and 80. For Academic_ departmentjrames, the data type is the set of all character strings that represent valid department names. A domain is thus given a name, data type, and format. Additional information for interpreting the values of a domain can also be given; for example, a numeric domain such as Person_weights should have the units of measurement, such as pounds or kilograms. A relation schema/ R, denoted by R(A I, Az, ... , An)' is made up of a relation name Rand a list of attributes AI' A z, ..., An' Each attribute Ai is the name of a role played by some domain D in the relation schema R. D is called the domain of Ai and is denoted by dom(A). A relation schema is used to describe a relation; R is called the name of this relation. The degree (or arity) of a relation is the number of attributes n of its relation schema. 2. A relation schema is sometimes called a relation scheme. 128 I Chapter 5 The Relational Data Model and Relational Database Constraints An example of a relation schema for a relation of degree seven, which describes university students, is the following: STUDENT(Name, SSN, HomePhone, Address, OfficePhone, Age, GPA) Using the data type of each attribute, the definition is sometimes written as: STUDENT(Name: string, SSN: string, HomePhone: string, Address: string, OfficePhone: string, Age: integer, GPA: real) For this relation schema, STUDENT is the name of the relation, which has seven attributes. In the above definition, we showed assignment of generic types such as string or integer to the attributes. More precisely, we can specify the following previously defined domains for some of the attributes of the STUDENT relation: dom(Name) = Names; dom(SSN) = Social_security_numbers; dom(HomePhone) = LocaLphone_numbers,3 dom(OfficePhone) = Localjphonejiumbers, and dom(GPA) = Gradepoint averages. It is also possible to refer to attributes of a relation schema by their position within the relation; thus, the second attribute of the STUDENT relation is SSN, whereas the fourth attribute is Address. A relation (or relation state)" r of the relation schema R(A I , Az, ... , An)' also denoted by r(R), is a set of n-tuples r = {tl , tz, ... , tm}' Each n-tuple t is an ordered list of n values t = , where each value Vi' 1 ::; i ::; n, is an element of dom(A) or is a special null value. The ith value in tuple t, which corresponds to the attribute Ai' is referred to as t[AJ (or t[i] if we use the positional notation). The terms relation intension for the schema R and relation extension for a relation state r(R) are also commonly used. Figure 5.1 shows an example of a STUDENT relation, which corresponds to the STUDENT schema just specified. Each tuple in the relation represents a particular student entity. We HomePhone Address 373-1616 375-4409 null 376-9821 839-8461 ~ Age GPA OfficePhone 2918 Bluebonnet Lane 125 Kirby Road 3452 Elgin Road 265 Lark Lane 7384 Fontana Lane null null 749-1253 749-6492 null 19 18 25 28 19 I STUDENTRelationIname .>: Name SSN Benjamin Bayer 305-61-2435 Tuples------~.- CharlesBarbara CooperBenson 381-62-1245 Katherine Ashly 422-11-2320 Dick Davidson 489-22-1100 533-69-1238 FIGURE 5.1 The attributes and tuples of a relation STUDENT. 3. With the large increase in phone numbers caused by rhe proliferation of mobile phones, some metropolitan areas now have multiple area codes, so that seven-digit local dialing has been discon? tinued. In this case, we would use uSA_phone_numbers as the domain. 4. This has also been called a relation instance. We will not use this term because instance is also used to refer to a single tuple or row. 3.21 2.89 3.53 3.93 3.25 5.1 Relational Model Concepts I 129 display the relation as a table, where each tuple is shown as a row and each attribute corresponds to a column header indicating a role or interpretation of the values in that column. Null values represent attributes whose values are unknown or do not exist for some individual STUDENT tuple. The earlier definition of a relation can be restated more formally as follows. A relation (or relation state) r(R) is a mathematical relation of degree n on the domains dom(A1) , dom(Az), ... , domi.A}, which is a subset of the Cartesian product of the domains that define R: r(R) '= (dom(A1) X dom(Az) X ... X dom(An» The Cartesian product specifies all possible combinations of values from the underlying domains. Hence, if we denote the total number of values, or cardinality, in a domain D by ID I (assuming that all domains are finite), the total number of tuples in the Cartesian product is Idom(A 1) I X Idom(Az) I X ... X Idom(An ) I Of all these possible combinations, a relation state at a given time-the current relation state-reflects only the valid tuples that represent a particular state of the real world. In general, as the state of the real world changes, so does the relation, by being transformed into another relation state. However, the schema R is relatively static and does not change except very infrequently-for example, as a result of adding an attribute to represent new information that was not originally stored in the relation. It is possible for several attributes to have the same domain. The attributes indicate different roles, or interpretations, for the domain. For example, in the STUDENT relation, the same domain Local_phone_numbers plays the role of HomePhone, referring to the "home phone of a student," and the role of OfficePhone, referring to the "office phone of the student." 5.1.2 Characteristics of Relations The earlier definition of relations implies certain characteristics that make a relation dif? ferent from a file or a table. We now discuss some of these characteristics. Ordering of Tuples in a Relation. A relation is defined as a set of tuples. Mathe? matically, elements of a set have no order among them; hence, tuples in a relation do not have any particular order. However, in a file, records are physically stored on disk (or in memory), so there always is an order among the records. This ordering indicates first, sec? ond, ith, and last records in the file. Similarly, when we display a relation as a table, the rows are displayed in a certain order. Tuple ordering is not part of a relation definition, because a relation attempts to represent facts at a logical or abstract level. Many logical orders can be specified on a relation. For example, tuples in the STUDENT relation in Figure 5.1 could be logically ordered by values of Name, or by SSN, or by Age, or by some other attribute. The definition of a relation does not specify any order: There is no preference for one logical 130 I Chapter 5 The Relational Data Model and Relational Database Constraints I STUDENT ordering over another. Hence, the relation displayed in Figure 5.2 is considered identical to the one shown in Figure 5.1. When a relation is implemented as a file or displayed as a table, a particular ordering may be specified on the records of the file or the rows of the table. Ordering of Values within a Tuple, and an Alternative Definition of a Relation. According to the preceding definition of a relation, an n-tuple is an ordered list of n values, so the ordering of values in a tuple-and hence of attributes in a relation schema-is important. However, at a logical level, the order of attributes and their values is not that important as long as the correspondence between attributes and values is main? tained. An alternative definition of a relation can be given, making the ordering of values in , An} is a set of a tuple unnecessary. In this definition, a relation schema R = {AI' A2, ••• attributes, and a relation state r(R) is a finite set of mappings r = {tl , t2, •.• , tm }, where each tuple ti is a mapping from R to D, and D is the union of the attribute domains; that is, D = dom(A l ) U dom(A2) U ... U dom(An). In this definition, t[AJ must be in dom(A) for 1 ~ i ~ n for each mapping t in r. Each mapping ti is called a tuple. According to this definition of tuple as a mapping, a tuple can be considered as a set of «attribute>, , where Vi is the value cor? • An n-tuple t in a relation r(R) is denoted by t = , fromAz is at corre?list of attributes from R, refer to the subtuple of values from the STUDENT relation in Figure 5.1; we have t[Name] = <'Barbara Benson'>, and t[SSN, OPA, Age] = <'533-69-1238',3.25, 19>. 5.2 RELATIONAL MODEL CONSTRAINTS AND RELATIONAL DATABASE SCHEMAS So far, we have discussed the characteristics of single relations. In a relational database, there will typically be many relations, and the tuples in those relations are usually related 5.2 Relational Model Constraints and Relational Database Schemas I 133 in various ways. The state of the whole database will correspond to the states of all its relations at a particular point in time. There are generally many restrictions or con? straints on the actual values in a database state. These constraints are derived from the rules in the miniworld that the database represents, as we discussed in Section 1.6.8. In this section, we discuss the various restrictions on data that can be specified on a relational database in the form of constraints. Constraints on databases can generally be divided into three main categories: 1. Constraints that are inherent in the data model. We call these inherent model? based constraints. 2. Constraints that can be directly expressed in the schemas of the data model, typi? cally by specifying them in the DOL (data definition language, see Section 2.3.1). We call these schema-based constraints. 3. Constraints that cannot be directly expressed in the schemas of the data model, and hence must be expressed and enforced by the application programs. We call these application-based constraints. The characteristics of relations that we discussed in Section 5.1.2 are the inherent constraints of the relational model and belong to the first category; for example, the constraint that a relation cannot have duplicate tuples is an inherent constraint. The constraints we discuss in this section are of the second category, namely, constraints that can be expressed in the schema of the relational model via the DOL. Constraints in the third category are more general and are difficult to express and enforce within the data model, so they are usually checked within application programs. Another important category of constraints is data dependencies, which include functional dependencies and multivalued dependencies. They are used mainly for testing the "goodness" of the design of a relational database and are utilized in a process called normalization, which is discussed in Chapters 10 and 11. We now discuss the main types of constraints that can be expressed in the relational model-the schema-based constraints from the second category. These include domain constraints, key constraints, constraints on nulls, entity integrity constraints, and referential integrity constraints. 5.2.1 Domain Constraints Domain constraints specify that within each tuple, the value of each attribute A must be an atomic value from the domain dom(A). We have already discussed the ways in which domains can be specified in Section 5.1.1. The data types associated with domains typi? cally include standard numeric data types for integers (such as short integer, integer, and long integer) and real numbers (float and double-precision float). Characters, booleans, fixed-length strings, and variable-length strings are also available, as are date, time, time? stamp, and, in some cases, money data types. Other possible domains may be described by a subrange of values from a data type or as an enumerated data type in which all possible values are explicitly listed. Rather than describe these in detail here, we discuss the data types offered by the SQL-99 relational standard in Section 8.1. 134 I Chapter 5 The Relational Data Model and Relational Database Constraints 5.2.2 Key Constraints and Constraints on Null Values A relation is defined as a set of tuples. By definition, all elements of a set are distinct; hence, all tuples in a relation must also be distinct. This means that no two tuples can have the same combination of values for all their attributes. Usually, there are other sub? sets of attributes of a relation schema R with the property that no two tuples in any rela? tion state r of R should have the same combination of values for these attributes. Suppose that we denote one such subset of attributes by SKi then for any two distinct tuples t1 and t2 in a relation state r of R, we have the constraint that t1[SK] oF- tz[SK] Any such set of attributes SK is called a superkey of the relation schema R. A superkey SK specifies a uniqueness constraint that no two distinct tuples in any state r of R can have the same value for SK. Every relation has at least one default superkey-the set of all its attributes. A superkey can have redundant attributes, however, so a more useful concept is that of a key, which has no redundancy. A key K of a relation schema R is a superkey of R with the additional property that removing any attribute A from K leaves a set of attributes K' that is not a superkey of R any more. Hence, a key satisfies two constraints: 1. Two distinct tuples in any state of the relation cannot have identical values for (all) the attributes in the key. 2. It is a minimal superkey-that is, a superkey from which we cannot remove any attributes and still have the uniqueness constraint in condition 1 hold. The first condition applies to both keys and superkeys. The second condition is required only for keys. For example, consider the STUDENT relation of Figure 5.1. The attribute set {SSN} is a key of STUDENT because no two student tuples can have the same value for SSN.8 Any set of attributes that includes SSN-for example, {SSN, Name, Agel-is a superkey, However, the superkey {SSN, Name, Agel is not a key of STUDENT, because removing Name or Age or both from the set still leaves us with a superkey. In general, any superkey formed from a single attribute is also a key. A key with multiple attributes must require all its attributes to have the uniqueness property hold. The value of a key attribute can be used to identify uniquely each tuple in the relation. For example, the SSN value 305-61-2435 identifies uniquely the tuple corresponding to Benjamin Bayer in the STUDENT relation. Notice that a set of attributes constituting a key is a property of the relation schema; it is a constraint that should hold on every valid relation state of the schema. A key is determined from the meaning of the attributes, and the property is time-invariant: It must continue to hold when we insert new tuples in the relation. For example, we cannot and should not designate the Name attribute of the STUDENT relation in Figure 5.1 as a key, because it is possible that two students with identical names will exist at some point in a valid state," 8. Note that SSN is also a superkey. 9. Names are sometimes used as keys, but then some artifact-such as appending an ordinal num? ber-must be used to distinguish between identical names. 5.2 Relational Model Constraints and Relational Database Schemas I 135 I CAR LicenseNumber EngineSenalNumber Make Model Year Texas ABC-739 A69352 Ford Mustang Florida TVP-347 843696 Oldsmobile Cutlass New York MPO-22 Califomia 432-TFY X83554 Oldsmobile C43742 Delta California RSK-629 Texas RSK-629 Y82935 U028365 96 99 95 Mercedes Toyota Jaguar 19Q-D 93 Camry XJS FIGURE 5.4 The CAR relation, with two candidate keys: LicenseNumber and EngineSerialNumber. 98 98 In general, a relation schema may have more than one key. In this case, each of the keys is called a candidate key. For example, the CAR relation in Figure 5.4 has two candidate keys: LicenseNumber and EngineSerialNumber. It is common to designate one ofthe candidate keys as the primary key of the relation. This is the candidate key whose values are used to identify tuples in the relation. We use the convention that the attributes that form the primary key of a relation schema are underlined, as shown in Figure 5.4. Notice that when a relation schema has several candidate keys, the choice of one to become the primary key is arbitrary; however, it is usually better to choose a primary key with a single attribute or a small number of attributes. Another constraint on attributes specifies whether null values are or are not permitted. For example, if every STUDENT tuple must have a valid, nonnull value for the Name attribute, then Name of STUDENT is constrained to be NOT NULL. 5.2.3 Relational Databases and Relational Database Schemas The definitions and constraints we have discussed so far apply to single relations and their attributes. A relational database usually contains many relations, with tuples in relations that are related in various ways. In this section we define a relational database and a rela? tional database schema. A relational database schema S is a set of relation schemas S = (R I , Rz, ... , Rm } and a set of integrity constraints IC. A relational database state'" DB of S is a set of relation states DB = {r1, rz, ... , rm } such that each r j is a state of R, and such that the rj relation states satisfy the integrity constraints specified in IC. Figure 5.5 shows a relational database schema that we call COMPANY = {EMPLOYEE, DEPARTMENT, DEPT_LOCATIONS, PROJECT, WORKS_ON, DEPENDENT}. The underlined attributes represent primary keys. Figure 5.6 shows a relational database state corresponding to the COMPANY schema. We will use this schema and database state in this chapter and in Chapters 6 through 9 for developing example queries in different relational languages. When we refer to a relational database, 10. A relational database state is sometimes called a relational database instance. However, as we mentioned earlier, we will not use the term instance since it also applies to single tuples. 136 I Chapter 5 The Relational Data Model and Relational Database Constraints EMPLOYEE DEPARTMENT DNAME I-D-N-U-M-S-ER-I MGRSSN I MGRSTARTDATE I DEPT_LOCATIONS DNUMSER I DLOCATION PROJECT PNAME I-P-NU-M-S-E-R-I PLOCATION I DNUM I WORKS_ON ~-H-O-U-RS- DEPENDENT DEPENDENT_NAME RELATIONSHIP FIGURE 5.5 Schema diagram for the COMPANY relational database schema. SUPERSSN we implicitly include both its schema and its current state. A database state that does not obey all the integrity constraints is called an invalid state, and a state that satisfies all the constraints in Ie is called a valid state. In Figure 5.5, the DNUMBER attribute in both DEPARTMENT and DEPT_LOCATIONS stands for the same real-world concept-the number given to a department. That same concept is called DNO in EMPLOYEE and DNUM in PROJECT. Attributes that represent the same real-world concept mayor may not have identical names in different relations. Alternatively, attributes that represent different concepts may have the same name in different relations. For example, we could have used the attribute name NAME for both PNAME of PROJ ECT and DNAME of DEPARTMENT; in this case, we would have two attributes that share the same name but represent different real-world concepts-project names and department names. In some early versions of the relational model, an assumption was made that the same real-world concept, when represented by an attribute, would have identical attribute names in all relations. This creates problems when the same real-world concept is used in different roles (meanings) in the same relation. For example, the concept of social security number appears twice in the EMPLOYEE relation of Figure 5.5: once in the role of the employee's social security number, and once in the role of the supervisor's social security number. We gave them distinct attribute names-s-sss and SUPERSSN, respectively? in order to distinguish their meaning. I EMPLOYEE FNAME MINIT John Franklin Alicia Jennifer Ramesh Joyce Ahmad James B I DEPARTMENT T I WORKS_ON I DEPENDENT J S K A V E 5.2 Relational Model Constraints and Relational Database Schemas 1137 LNAME SSN BDATE ADDRESS SEX Smith Wong Zelaya Wallace Narayan English Jabbar Borg DNAME Research Administration Headquarters ESSN 123456789 123456789 666884444 453453453 453453453 333445555 333445555 333445555 333445555 999887777 999887777 987987987 987987987 987654321 987654321 888665555 ESSN 333445555 333445555 333445555 987654321 123456789 123456789 123456789 123456789 333445555 999887777 987654321 666884444 453453453 987987987 888665555 1965-01-09 1955-12-08 1968-01-19 1941-06-20 1962-09-15 1972-07·31 1969-03-29 1937-11-10 731 Fondren, Houston, TX 638 Voss, Houston, TX 3321 Castle, Spring, TX 291 Berry, Bellaire,TX 975 Fire Oak, Humble, TX 5631 Rice, Houston, TX 980 Dallas, Houston, TX 450 Stone, Houston, TX M M DNUMBER 5 4 I DEPT LOCATIONS SALARY SUPERSSN DNO 1 MGRSSN 333445555 987654321 888665555 MGRSTARTDATE 1988-05-22 1995-01-01 1981-06-19 PNO 1 2 HOURS 3 1 2 2 3 10 20 30 10 10 30 30 20 20 32.5 7.5 40.0 20.0 20.0 10.0 10.0 10.0 10.0 30.0 10.0 35.0 5.0 20.0 15.0 null I PROJECT DEPENDENT NAME SEX Alice Theodore Joy Abner Michael Alice Elizabeth F BDATE PNAME Product)( ProductY ProductZ Computerization Reoraanization Newbenefits M F M M F F 1986-04-05 1983-10-25 1958-05-03 1942-02-28 1988-01-04 1988-12-30 1967-05-05 F F M F M M 30000 40000 25000 43000 38000 25000 25000 55000 DNUMBER 1 4 5 5 333445555 888665555 987654321 888665555 333445555 333445555 987654321 null 5 DLOCATION Houston Stafford Bellaire Sugarland Houston 5 4 4 5 5 4 PNUMBER PLOCATION RELATIONSHIP DAUGHTER SON SPOUSE SPOUSE SON DAUGHTER SPOUSE 1 DNUM 2 3 10 20 30 Bellaire Suaarland Houston Stafford Houston Stafford FIGURE 5.6 One possible database state for the COMPANY relational database schema. 5 5 5 4 1 1 4 Each relational DBMS must have a data definition language (DOL) for defining a relational database schema. Current relational DBMSs are mostly using SQL for this purpose. We present the SQL DOL in Sections 8.1 through 8.3. Integrity constraints are specified on a database schema and are expected to hold on every valid database state of that schema. In addition to domain, key, and NOT NULL 138 I Chapter 5 The Relational Data Model and Relational Database Constraints constraints, two other types of constraints are considered part of the relational model: entity integrity and referential integrity. 5.2.4 Entity Integrity, Referential Integrity, and Foreign Keys The entity integrity constraint states that no primary key value can be null. This is because the primary key value is used to identify individual tuples in a relation. Having null values for the primary key implies that we cannot identify some tuples. For example, if two or more tuples had null for their primary keys, we might not be able to distinguish them if we tried to reference them from other relations. Key constraints and entity integrity constraints are specified on individual relations. The referential integrity constraint is specified between two relations and is used to maintain the consistency among tuples in the two relations. Informally, the referential integrity constraint states that a tuple in one relation that refers to another relation must refer to an existing tuple in that relation. For example, in Figure 5.6, the attribute DNO of EMPLOYEE gives the department number for which each employee works; hence, its value in every EMPLOYEE tuple must match the DNUMBER value of some tuple in the DEPARTMENT relation. To define referential integrity more formally, we first define the concept of a foreign key. The conditions for a foreign key, given below, specify a referential integrity constraint between the two relation schemas R[ and Rz. A set of attributes FK in relation schema R[ is a foreign key of R[ that references relation Rz if it satisfies the following two rules: 1. The attributes in FK have the same dornaints) as the primary key attributes PK of Rz; the attributes FK are said to reference or refer to the relation Rz. 2. A value of FK in a tuple t[ of the current state r[ (R[) either occurs as a value of PK for some tuple tz in the current state rz(Rz) or is null. In the former case, we have t[[FK] = tz[PK]' and we say that the tuple t[ references or refers to the tuple tz' In this definition, R[ is called the referencing relation and Rz is the referenced relation. If these two conditions hold, a referential integrity constraint from R[ to Rz is said to hold. In a database of many relations, there are usually many referential integrity constraints. To specify these constraints, we must first have a clear understanding of the meaning or role that each set of attributes plays in the various relation schemas of the database. Referential integrity constraints typically arise from the relationships among the entities represented by the relation schemas. For example, consider the database shown in Figure 5.6. In the EMPLOYEE relation, the attribute DNO refers to the department for which an employee works; hence, we designate DNO to be a foreign key of EMPLOYEE referring to the DEPARTMENT relation. This means that a value of DNa in any tuple t[ of the EMPLOYEE relation must match a value of the primary key of DEPARTMENT-the DNUMBER attribute-in some tuple tz of the DEPARTMENT relation, or the value of DNO can be null if the employee does not belong 5.2 Relational Model Constraints and Relational Database Schemas I 139 to a department. In Figure 5.6 the tuple for employee 'JohnSmith' references the tuple for the 'Research'department, indicating that 'John Smith' works for this department. Notice that a foreign key can refer to its own relation. For example, the attribute SUPERSSN in EMPLOYEE refers to the supervisor of an employee; this is another employee, represented by a tuple in the EMPLOYEE relation. Hence, SUPERSSN is a foreign key that references the EMPLOYEE relation itself. In Figure 5.6 the tuple for employee 'John Smith' referencesthe tuple for employee 'Franklin Wong,' indicating that 'FranklinWong' is the supervisor of 'John Smith.' We can diagrammatically display referential integrity constraints by drawing a directed arc from each foreign key to the relation it references. For clarity, the arrowhead may point to the primary key of the referenced relation. Figure 5.7 shows the schema in Figure 5.5with the referential integrity constraints displayed in this manner. All integrity constraints should be specified on the relational database schema if we want to enforce these constraints on the database states. Hence, the DOL includes provisions for specifying the various types of constraints so that the DBMS can automatically enforce them. Most relational DBMSs support key and entity integrity DEPT_LOCATIONS DNUMBER DLOCATION PROJECT PLOCATION DEPENDENT_NAME MGRSTARTDATE RELATIONSHIP FIGURE 5.7 Referential integrity constraints displayed on the COMPANY relational database schema. 140 I Chapter 5 The Relational Data Model and Relational Database Constraints constraints, and make provisions to support referential integrity. These constraints are specified as a part of data definition. 5.2.5 Other Types of Constraints The preceding integrity constraints do not include a large class of general constraints, sometimes called semantic integrity constraints, that may have to be specified and enforced on a relational database. Examples of such constraints are "the salary of an employee should not exceed the salary of the employee's supervisor" and "the maximum number of hours an employee can work on all projects per week is 56." Such constraints can be spec? ified and enforced within the application programs that update the database, or by using a general-purpose constraint specification language. Mechanisms called triggers and asser? tions can be used. In sQL-99, a CREATE ASSERTION statement is used for this purpose (see Chapters 8 and 9). It is more common to check for these types of constraints within the application programs than to use constraint specification languages, because the latter are difficult and complex to use correctly, as we discuss in Section 24.1. Another type of constraint is the functional dependency constraint, which establishes a functional relationship among two sets of attributes X and Y. This constraint specifies that the value of X determines the value of Y in all states of a relation; it is denoted as a functional dependency X ~ Y. We use functional dependencies and other types of dependencies in Chapters 10 and 11 as tools to analyze the quality of relational designs and to "normalize" relations to improve their quality. The types of constraints we discussed so far may be called state constraints, because they define the constraints that a valid state of the database must satisfy. Another type of constraint, called transition constraints, can be defined to deal with state changes in the database. I I An example of a transition constraint is: "the salary of an employee can only increase." Such constraints are typically enforced by the application programs or specified using active rules and triggers, as we discuss in Section 24.1. 5.3 UPDATE OPERATIONS AND DEALING WITH CONSTRAINT VIOLATIONS The operations of the relational model can be categorized into retrievals and updates. The relational algebra operations, which can be used to specify retrievals, are discussed in detail in Chapter 6. A relational algebra expression forms a new relation after applying a number of algebraic operators to an existing set of relations; its main use is for querying a database. The user formulates a query that specifies the data of interest, and a new rela? tion is formed by applying relational operators to retrieve this data. That relation ------------- 11. State constraints are sometimes called static constraints, and transition constraints are sometimes called dynamic constraints. 5.3 Update Operations and Dealing with Constraint Violations 1141 becomes the answer to the user's query. Chapter 6 also introduces the language called relational calculus, which is used to declaratively define the new relation without giving a specific order of operations. In this section, we concentrate on the database modification or update operations. There are three basic update operations on relations: insert, delete, and modify. Insert is used to insert a new tuple or tuples in a relation, Delete is used to delete tuples, and Update (or Modify) is used to change the values of some attributes in existing tuples. Whenever these operations are applied, the integrity constraints specified on the relational database schema should not be violated. In this section we discuss the types of constraints that may be violated by each update operation and the types of actions that may be taken if an update does cause a violation. We use the database shown in Figure 5.6 for examples and discuss only key constraints, entity integrity constraints, and the referential integrity constraints shown in Figure 5.7. For each type of update, we give some example operations and discuss any constraints that each operation may violate. 5.3.1 The Insert Operation The Insert operation provides a list of attribute values for a new tuple t that is to be inserted into a relation R. Insert can violate any of the four types of constraints discussed in the previous section. Domain constraints can be violated if an attribute value is given that does not appear in the corresponding domain. Key constraints can be violated if a key value in the new tuple t already exists in another tuple in the relation r(R). Entity integrity can be violated if the primary key of the new tuple t is null. Referential integrity can be violated if the value of any foreign key in t refers to a tuple that does not exist in the referenced relation. Here are some examples to illustrate this discussion. 1. Insert <'Cecilia','F', 'Kolonsky',null, '1960-04-05','6357 Windy Lane, Katy, TX', F, 28000, null, 4> into EMPLOYEE. • This insertion violates the entity integrity constraint (null for the primary key SSN), so it is rejected. 2. Insert <'Alicia','I'.'Zelaya','999887777','1960-04-05','6357 Windy Lane, Katy, TX', F,28000, '987654321',4> into EMPLOYEE. • This insertion violates the key constraint because another tuple with the same SSN value already exists in the EMPLOYEE relation, and so it is rejected. 3. Insert into EMPLOYEE. • This insertion violates the referential integrity constraint specified on DNO because no DEPARTMENT tuple exists with DNUMBER = 7. 4. Insert into EMPLOYEE. • This insertion satisfies all constraints, so it is acceptable. 142 I Chapter 5 The Relational Data Model and Relational Database Constraints If an insertion violates one or more constraints, the default option is to reject the insertion. In this case, it would be useful if the DBMS could explain to the user why the insertion was rejected. Another option is to attempt to correct the reason for rejecting the insertion, but this is typically not used for violations caused by Insert; rather, it is used more often in correcting violations for Delete and Update. In operation 1 above, the DBMS could ask the user to provide a value for SSN and could accept the insertion if a valid SSN value were provided. In operation 3, the DBMS could either ask the user to change the value of DNO to some valid value (or set ir to null), or it could ask the user to insert a DEPARTMENT tuple with DNUMBER = 7 and could accept the original insertion only after such an operation was accepted. Notice that in the latter case the insertion violation can cascade back to the EMPLOYEE relation if the user attempts to insert a tuple for department 7 with a value for MGRSSN that does not exist in the EMPLOYEE relation. 5.3.2 The Delete Operation The Delete operation can violate only referential integrity, if the tuple being deleted is referenced by the foreign keys from other tuples in the database. To specify deletion, a condition on the attributes of the relation selects the tuple (or tuples) to be deleted. Here are some examples. 1. Delete the WORKS_ON tuple with ESSN = '999887777'and PNO = 10. • This deletion is acceptable. 2. Delete the EMPLOYEE tuple with SSN = '999887777'. • This deletion is not acceptable, because tuples in WORKS_ON refer to this tuple. Hence, if the tuple is deleted, referential integrity violations will result. 3. Delete the EMPLOYEE tuple with SSN = '333445555'. • This deletion will result in even worse referential integrity violations, because the tuple involved is referenced by tuples from the EMPLOYEE, DEPARTMENT, WORKS_ON, and DEPENDENT relations. Several options are available if a deletion operation causes a violation. The first option is to reject the deletion. The second option is to attempt to cascade (or propagate) the deletion by deleting tuples that reference the tuple that is being deleted. For example, in operation 2, the DBMS could automatically delete the offending tuples from WORKS_ON with ESSN = '999887777'. A third option is to modify the referencing attribute values that cause the violation; each such value is either set to null or changed to reference another valid tuple. Notice that if a referencing attribute that causes a violation is part of the primary key, it cannot be set to null; otherwise, it would violate entity integrity. Combinations of these three options are also possible. For example, to avoid having operation 3 cause a violation, the DBMS may automatically delete all tuples from WORKS_ON and DEPENDENT with ESSN = '333445555'.Tuples in EMPLOYEE with SUPERSSN = '333445555'and the tuple in DEPARTMENT with MGRSSN = '333445555'can have their SUPERSSN and MGRSSN values changed to other valid values or to null. Although it may make sense to delete 5.4 Summary 1143 automatically the WORKS_ON and DEPENDENT tuples that refer to an EMPLOYEE tuple, it may not make sense to delete other EMPLOYEE tuples or a DEPARTMENT tuple. In general, when a referential integrity constraint is specified in the DOL, the DBMS will allow the user to specify which of the options applies in case of a violation of the constraint. We discuss how to specify these options in the SQL-99 DOL in Chapter 8. 5.3.3 The Update Operation The Update (or Modify) operation is used to change the values of one or more attributes in a tuple (or tuples) of some relation R. It is necessary to specify a condition on the attributes of the relation to select the tuple (or tuples) to be modified. Here are some examples. 1. Update the SALARY of the EMPLOYEE tuple with SSN = '999887777'to 28000. • Acceptable. 2. Update the DNO of the EMPLOYEE tuple with SSN = '999887777'to 1. • Acceptable. 3. Update the DNO of the EMPLOYEE tuple with SSN = '999887777'to 7. • Unacceptable, because it violates referential integrity. 4. Update the SSN of the EMPLOYEE tuple with SSN = '999887777'to '987654321'. • Unacceptable, because it violates primary key and referential integrity constraints. Updating an attribute that is neither a primary key nor a foreign key usually causes no problems; the DBMS need only check to confirm that the new value is of the correct data type and domain. Modifying a primary key value is similar to deleting one tuple and inserting another in its place, because we use the primary key to identify tuples. Hence, the issues discussed earlier in both Sections 5.3.1 (Insert) and 5.3.2 (Delete) come into play. If a foreign key attribute is modified, the DBMS must make sure that the new value refers to an existing tuple in the referenced relation (or is null). Similar options exist to deal with referential integrity violations caused by Update as those options discussed for the Delete operation. In fact, when a referential integrity constraint is specified in the DDL, the DBMS will allow the user to choose separate options to deal with a violation causedby Delete and a violation caused by Update (see Section 8.2). 5.4 SUMMARY In this chapter we presented the modeling concepts, data structures, and constraints pro? vided by the relational model of data. We started by introducing the concepts of domains, attributes, and tuples. We then defined a relation schema as a list of attributes that describe the structure of a relation. A relation, or relation state, is a set of tuples that con? forms to the schema. 144 I Chapter 5 The Relational Data Model and Relational Database Constraints Several characteristics differentiate relations from ordinary tables or files. The first is that tuples in a relation are not ordered. The second involves the ordering of attributes in a relation schema and the corresponding ordering of values within a tuple. We gave an alternative definition of relation that does not require these two orderings, but we continued to use the first definition, which requires attributes and tuple values to be ordered, for convenience. We then discussed values in tuples and introduced null values to represent missing or unknown information. We then classified database constraints into inherent model-based constraints, schema-based constraints and application-based constraints. We then discussed the schema constraints pertaining to the relational model, starting with domain constraints, then key constraints, including the concepts of superkey, candidate key, and primary key, and the NOT NULL constraint on attributes. We then defined relational databases and relational database schemas. Additional relational constraints include the entity integrity constraint, which prohibits primary key attributes from being null. The interrelation referential integrity constraint was then described, which is used to maintain consistency of references among tuples from different relations. The modification operations on the relational model are Insert, Delete, and Update. Each operation may violate certain types of constraints. These operations were discussed in Section 5.3. Whenever an operation is applied, the database state after the operation is executed must be checked to ensure that no constraints have been violated. Review Questions 5.1. Define the following terms: domain, attribute, n-tuple, relation schema, relation state, degree of a relation, relational database schema, relational database state. 5.2. Why are tuples in a relation not ordered? 5.3. Why are duplicate tuples not allowed in a relation? 5.4. What is the difference between a key and a superkey? 5.5. Why do we designate one of the candidate keys of a relation to be the primary key? 5.6. Discuss the characteristics of relations that make them different from ordinary tables and files. 5.7. Discuss the various reasons that lead to the occurrence of null values in relations. 5.8. Discuss the entity integrity and referential integrity constraints. Why is each con? sidered important? 5.9. Define foreign key. What is this concept used for? Exercises 5.10. Suppose that each of the following update operations is applied directly to the database state shown in Figure 5.6. Discuss all integrity constraints violated by each operation, if any, and the different ways of enforcing these constraints. a. Insert into EMPLOYEE. b. Insert <'ProductA',4, 'Bellaire',2> into PROJECT. c. Insert <'Production',4, '943775543','1998-10-01'>into DEPARTMENT. d. Insert <'677678989',null, '40.0'>into WORKS_ON. e. Insert <'453453453','John',M, '1970-12-12','SPOUSE'> into DEPENDENT. f. Delete the WORKS_ON tuples with ESSN = '333445555'. g. Delete the EMPLOYEE tuple with SSN = '987654321'. h. Delete the PROJECT tuple with PNAME = 'ProductX'. i. Modify the MGRSSN and MGRSTARTDATE of the DEPARTMENT tuple with DNUMBER = 5 to '123456789'and '1999-10-01',respectively. j. Modify the SUPERSSN attribute of the EMPLOYEE tuple with SSN = '999887777' to '943775543'. k. Modify the HOURS attribute of the WORKS_ON tuple with ESSN = '999887777' and PNO = 10 to '5.0'. 5.11. Consider the AIRLINE relational database schema shown in Figure 5.8, which describes a database for airline flight information. Each FLIGHT is identified by a flight NUMBER, and consists of one or more FLIGHT_LEGS with LEG_NUMBERS 1, 2, 3, and so on. Each leg has scheduled arrival and departure times and airports and has many LEG_IN STANCES-one for each DATE on which the flight travels. FARES are kept for each flight. For each leg instance, SEAT_RESERVATIONS are kept, as are the AIRPLANE used on the leg and the actual arrival and departure times and airports. An AIR? PLANE is identified by an AIRPLANE_ID and is of a particular AIRPLANE_TYPE. CAN_LAND relates AIRPLANE_TYPES to the AIRPORTS in which they can land. An AIRPORT is identi? fied by an AIRPORT_CODE. Consider an update for the AIRLINE database to enter a res? ervation on a particular flight or flight leg on a given date. a. Give the operations for this update. b. What types of constraints would you expect to check? c. Which of these constraints are key, entity integrity, and referential integrity constraints, and which are not? d. Specify all the referential integrity constraints that hold on the schema shown in Figure 5.8. 5.12. Consider the relation CLASs(Course#, Univ Section«, InstructorName, Semester, BuildingCode, Roome, TimePeriod, Weekdays, CreditHours). This represents classes taught in a university, with unique Univ_Section#. Identify what you think should be various candidate keys, and write in your own words the con? straints under which each candidate key would be valid. 5.13. Consider the following six relations for an order-processing database application in a company: CUSTOMER(Cust#, Cname, City) ORDER(Order#, Odate, Custw, Ord Amt) ORDER_ITEM(Order#, Item#, C2ty) ITEM(Item#, Unicprice) SHIPMENT(Order#, Warehouse#, Ship_date) Exercises I 145 WAREHousE(Warehouse#, City) 146 I Chapter 5 The Relational Data Model and Relational Database Constraints AIRPORT IAIRPORT CODE I NAME ~I STATE I FLIGHT I NUMBER I AIRLINE I WEEKDAYS I I FLIGHT NUMBER I LEG NUMBER I DEPARTURE_AIRPORT_CODE I SCHEDULED_DEPARTURE_TIME [ ARRIVAL_AIRPORT_CODE I SCHEDULED_ARRIVAL_TIME I LEG_INSTANCE I FLIGHT NUMBER I LEG NUMBER I~ NUMBER_OF_AVAILABLE_SEATS IAIRPLANE_ID [ DEPARTURE_AIRPORT_CODE I DEPARTURCTIME I ARRIVAL_AIRPORT_CODE I ARRIVAL_TIME FARES FLIGHT NUMBER I FARE CODE I AMOUNT I RESTRICTIONS I I TYPE NAME I MAX_SEATS [COMPANY I I AIRPLANE TYPE NAME I AIRPORT CODE I AIRPLANE I AIRPLANE 10 I TOTAL NUMBER OF SEATS I AIRPLANE_TYPE I SEAT_RESERVATION I FLIGHT NUMBER I LEG NUMBER I~ SEAT NUMBER I CUSTOMER NAME I CUSTOMER PHONE FIGURE 5.8 The AIRLINE relational database schema. Here, Ord_Amt refers to total dollar amount of an order; Odate is the date the order was placed; Ship_date is the date an order is shipped from the warehouse. Assume that an order can be shipped from several warehouses. Specify the foreign keys for this schema, stating any assumptions you make. 5.14. Consider the following relations for a database that keeps track of business trips of salespersons in a sales office: SALESPERSON(SSN, Name, Start Year, DepcNo) Selected Bibliography I 147 TRIP(SSN, From_City, To_City, Departure_Date, Return_Date, Trip ID) EXPENsE(Trip ID, Accountg, Amount) Specify the foreign keys for this schema, stating any assumptions you make. 5.15. Consider the following relations for a database that keeps track of student enroll? ment in courses and the books adopted for each course: sTuDENT(SSN, Name, Major, Bdate) COURSE(Course#, Cname, Dept) ENROLL(SSN, Course#, Quarter, Grade) BOOK_ADOPTION(Course#, Quarter, Book_ISBN) TEXT(Book ISBN, BooLTitle, Publisher, Author) Specify the foreign keys for this schema, stating any assumptions you make. 5.16. Consider the following relations for a database that keeps track of auto sales in a car dealership (Option refers to some optional equipment installed on an auto): cAR(Serial-No, Model, Manufacturer, Price) OPTIoNs(Serial-No, Option-Name, Price) sALEs(Salesperson-id, Serial-No, Date, Sale-price) sALEsPERsoN(Salesperson-id, Name, Phone) First, specify the foreign keys for this schema, stating any assumptions you make. Next, populate the relations with a few example tuples, and then give an example of an insertion in the SALES and SALESPERSON relations that violates the referential integrity constraints and of another insertion that does not. Selected Bibliography The relational model was introduced by Codd (1970) in a classic paper. Codd also intro? duced relational algebra and laid the theoretical foundations for the relational model in a series of papers (Codd 1971, 1972, 1972a, 1974); he was later given the Turing award, the highest honor of the ACM, for his work on the relational model. In a later paper, Codd (1979) discussed extending the relational model to incorporate more meta-data and semantics about the relations; he also proposed a three-valued logic to deal with uncer? tainty in relations and incorporating NULLs in the relational algebra. The resulting model is known as RM/T. Childs (1968) had earlier used set theory to model databases. Later, Codd (1990) published a book examining over 300 features of the relational data model and database systems. Since Codd's pioneering work, much research has been conducted on various aspects of the relational model. Todd (1976) describes an experimental DBMS called PRTV that directly implements the relational algebra operations. Schmidt and Swenson (1975) introduces additional semantics into the relational model by classifying different types of relations. Chen's (1976) entity-relationship model, which is discussed in Chapter 3, is a means to communicate the real-world semantics of a relational database at the conceptual level. Wiederhold and Elmasri (1979) introduces various types of connections 148 I Chapter 5 The Relational Data Model and Relational Database Constraints between relations to enhance its constraints. Extensions of the relational model are discussed in Chapter 24. Additional bibliographic notes for other aspects of the relational model and its languages, systems, extensions, and theory are given in Chapters 6 to 11, 15, 16, 17, and 22 to 25. The Relational Algebra and Relational Calculus In this chapter we discuss the two formal languages for the relational model: the rela? tional algebra and the relational calculus. As we discussed in Chapter 2, a data model must include a set of operations to manipulate the database, in addition to the data model's concepts for defining database structure and constraints. The basic set of opera? tionsfor the relational model is the relational algebra. These operations enable a user to specify basic retrieval requests. The result of a retrieval is a new relation, which may have been formed from one or more relations. The algebra operations thus produce new rela? tions, which can be further manipulated using operations of the same algebra. A sequence of relational algebra operations forms a relational algebra expression, whose result will also be a relation that represents the result of a database query (or retrieval request). The relational algebra is very important for several reasons. First, it provides a formal foundation for relational model operations. Second, and perhaps more important, it is used as a basis for implementing and optimizing queries in relational database management systems (RDBMSs), as we discuss in Part IV of the book. Third, some of its concepts are incorporated into the SQL standard query language for RDBMSs. Whereas the algebra defines a set of operations for the relational model, the relational calculus provides a higher-level declarative notation for specifying relational queries. A relational calculus expression creates a new relation, which is specified in terms of variables that range over rows of the stored database relations (in tuple calculus) or over columns of the stored relations (in domain calculus). In a calculus expression, there is no order of operations to specify how to retrieve the query result-a calculus 149 150 I Chapter 6 The Relational Algebra and Relational Calculus expression specifies only what information the result should contain. This is the main distinguishing feature between relational algebra and relational calculus. The relational calculus is important because it has a firm basis in mathematical logic and because the SQL (standard query language) for RDBMSs has some of its foundations in the tuple relational calculus. 1 The relational algebra is often considered to be an integral part of the relational data model, and its operations can be divided into two groups. One group includes set operations from mathematical set theory; these are applicable because each relation is defined to be a set of tuples in the formal relational model. Set operations include UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT. The other group consists of operations developed specifically for relational databases-these include SELECT, PROJECT, and JOIN, among others. We first describe the SELECT and PROJECT operations in Section 6.1, because they are unary operations that operate on single relations. Then we discuss set operations in Section 6.2. In Section 6.3, we discuss JOIN and other complex binary operations, which operate on two tables. The COMPANY relational database shown in Figure 5.6 is used for our examples. Some common database requests cannot be performed with the original relational algebra operations, so additional operations were created to express these requests. These include aggregate functions, which are operations that can summarize data from the tables, as well as additional types of JOIN and UNION operations. These operations were added to the original relational algebra because of their importance to many database applications, and are described in Section 6.4. We give examples of specifying queries that use relational operations in Section 6.5. Some of these queries are used in subsequent chapters to illustrate various languages. In Sections 6.6 and 6.7 we describe the other main formal language for relational databases, the relational calculus. There are two variations of relational calculus. The tuple relational calculus is described in Section 6.6, and the domain relational calculus is described in Section 6.7. Some of the SQL constructs discussed in Chapter 8 are based on the tuple relational calculus. The relational calculus is a formal language, based on the branch of mathematical logic called predicate calculus.r In tuple relational calculus, variables range over tuples, whereas in domain relational calculus, variables range over the domains (values) of attributes. In Appendix D we give an overview of the QBE (Query-By-Example) language, which is a graphical user-friendly relational language based on domain relational calculus. Section 6.8 summarizes the chapter. For the reader who is interested in a less detailed introduction to formal relational languages, Sections 6.4, 6.6, and 6.7 may be skipped. --- -----~ ~---- 1. SQL is based on tuple relational calculus, but also incorporates some of the operations from the relational algebra and its extensions, as we shall see in Chapters 8 and 9. 2. In this chapter no familiarity with first-order predicate calculus-which deals with quantified variables and values-is assumed. 6.1 Unary Relational Operations: SELECT and PROJECT 6.1 UNARY RELATIONAL OPERATIONS: SELECT AND PROJECT 6.1.1 The SELECT Operation The SELECT operation is used to select a subset of the tuples from a relation that satisfy a selection condition. One can consider the SELECT operation to be a filter that keeps only those tuples that satisfy a qualifying condition. The SELECT operation can also be visual? ized as a horizontal partition of the relation into two sets of tuples-those tuples that satisfy the condition and are selected, and those tuples that do not satisfy the condition and are discarded. For example, to select the EMPLOYEE tuples whose department is 4, or those whose salary is greater than $30,000, we can individually specify each of these two condi? tions with a SELECT operation as follows: UDNO=4 (EMPLOYEE) USALARY>30000(EMPLOYEE) In general, the SELECT operation is denoted by rr(R) where the symbol IT (sigma) is used to denote the SELECT operator, and the selection con? dition is a Boolean expression specified on the attributes of relation R. Notice that R is generally a relational algebra expression whose result is a relation-the simplest such expression is just the name of a database relation. The relation resulting from the SELECT operation has the same attributes as R. The Boolean expression specified in is made up of a number of clauses of the form , or where is the name of an attribute of R, is normally one of the operators {=, <, :::;, >, 2:, ;t:}, and is a constant value from the attribute domain. Clauses can be arbitrarily connected by the Boolean operators AND, OR, and NOT to form a general selection condition. For example, to select the tuples for all employees who either work in department 4 and make over $25,000 per year, or work in department 5 and make over $30,000, we can specify the following SELECT operation: U(DNO=4 AND SALARY;>25000) OR (DNO=5 AND SALARY;> 30000)(EMPLOYEE) The result is shown in Figure 6.1 a. Notice that the comparison operators in the set {=, <, -s, >, 2:, ;t:} apply to attributes whose domains are ordered values, such as numeric or date domains. Domains of strings of characters are considered ordered based on the collating sequence of the characters. If the domain of an attribute is a set of unordered values, then only the comparison operators in the set {=,:;t:} can be used. An example of an unordered domain is the domain Color = {red, I 151 152 I Chapter 6 The Relational Algebra and Relational Calculus (a) (b) FNAME MINIT LNAME SSN Wong T Franklin 333445555 Jennifer Wallace 987654321 Ramesh Narayan 666884444 LNAME Smith Wong Zelaya Wallace Narayan English Jabbar Borg BDATE 1955-12-08 1941-06-20 1962-09-15 ADDRESS 638 Voss,HouSlon,TX 291 Berry,Beliaire,TX 975 FireOak,Humble,TX 4 FNAME John Franklin Alicia Jennifer Ramesh Joyce Ahmad James SALARY 30000 40000 25000 43000 38000 25000 25000 55000 (e) SEX SALARY 30000 M M 40000 25000 F 43000 38000 25000 55000 F M M M 5 SEX SALARY SUPERSSN DNO 40000 888665555 M 5 F 43000 888665555 38000 333445555 M FIGURE 6.1 Results of SELECT and PROJECT operations. (a) (J'(DNO~4 AND SALARY>25000) OR (DNO~5 AND SALARY>30000)(EMPLOYEE). (b) "ITLNAME, FNAME, SALARy(EMPLOYEE). (c) "ITSEX, SALARy(EMPLOYEE). blue, green, white, yellow, ...} where no order is specified among the various colors. Some domains allow additional types of comparison operators; for example, a domain of character strings may allow the comparison operator SUBSTRING_ OF. In general, the result of a SELECT operation can be determined as follows. The is applied independently to each tuple t in R. This is done by substituting each occurrence of an attribute Ai in the selection condition with its value in the tuple t[AJ If the condition evaluates to TRUE, then tuple t is selected. All the selected tuples appear in the result of the SELECT operation. The Boolean conditions AND, OR, and NOT have their normal interpretation, as follows: • (condl AND cond2) is TRUE if both (cond l ) and (cond2) are TRUE; otherwise, it is FALSE. • (cond l OR cond2) is TRUE if either (cond l ) or (cond2) or both are TRUE; other? wise, it is FALSE. • (NOT cond) is TRUE if cond is FALSE; otherwise, it is FALSE. The SELECT operator is unary; that is, it is applied to a single relation. Moreover, the selection operation is applied to eachtuple individually; hence, selection conditions cannot involve more than one tuple. The degree of the relation resulting from a SELECT operation-its number of attributes-is the same as the degree of R. The number of tuples in the resulting relation is always less than or equal to the number of tuples in R. That is, I (J'c (R) I :5 I R I for any condition C. The fraction of tuples selected by a selection condition is referred to as the selectivity of the condition. Notice that the SELECT operation is commutative; that is, (J' ((J' (R)) = (J' ( (J' (R)) 6.1 Unary Relational Operations: SELECT and PROJECT Hence, a sequence of SELECTs can be applied in any order. In addition, we can always combine a cascade of SELECT operations into a single SELECT operation with a conjunc? tive (AND) condition; that is: (J( (J(' ..(J(R» ... » = (J AND AND. . AND (R) 6.1.2 The PROJECT Operation Ifwethink of a relation as a table, the SELECT operation selects some of the rows from the table while discarding other rows. The PROJECT operation, on the other hand, selects cer? tain columns from the table and discards the other columns. If we are interested in only certain attributes of a relation, we use the PROJECT operation to project the relation over these attributes only. The result of the PROJECT operation can hence be visualized as a vertical partition of the relation into two relations: one has the needed columns (attributes) and contains the result of the operation, and the other contains the discarded columns. For example, to list each employee's first and last name and sal-ary, we can use the PROJECT operation as follows: 'ITLNAME, FNAME, SALARY( EMPLOYEE) The resulting relation is shown in Figure 6.1 (b). The general form of the PROJECT opera? tion is 'IT (R) where 'IT (pi) is the symbol used to represent the PROJECT operation, and isthe desired list of attributes from the attributes of relation R. Again, notice that R is, in general, a relational algebra expression whose result is a relation, which in the simplest case isjust the name of a database relation. The result of the PROJECT operation has only the attributes specified in in the same order as they appear in the list. Hence, its degree is equal to the number of attributes in . If the attribute list includes only nonkey attributes of R, duplicate tuples are likely to occur. The PROJECT operation removes any duplicate tuples, so the result of the PROJECT operation is a set of tuples, and hence a valid relation.' This is known as duplicate elimination. For example, consider the following PROJECT operation: 'ITSEX, SALARY( EMPLOYEE) The result is shown in Figure 6.1c. Notice that the tuple appears only once in Figure 6.1c, even though this combination of values appears twice in the EMPLOYEE relation. The number of tuples in a relation resulting from a PROJECT operation is always less than or equal to the number of tuples in R. If the projection list is a superkey of R-that I 153 ---- -. ----~----- ---- 3. If duplicates are not eliminated, the result would be a multiset or bag of tuples rather than a set. Although this is not allowed in the formal relation model, it is permitted in practice. We shall see in Chapter 8 that SQL allows the user to specify whether duplicates should be eliminated or not. 154 I Chapter 6 The Relational Algebra and Relational Calculus (a) FNAME John Franklin Ramesh Joyce (b) I TEMP is, it includes some key of R-the resulting relation has the same number of tuples as R. Moreover, 'IT ('IT(R» = 'IT (R) as long as contains the attributes in ; otherwise, the left-hand side is an incorrect expression. It is also noteworthy that commutativity does not hold on PROJECT. 6.1.3 Sequences of Operations and the RENAME Operation The relations shown in Figure 6.1 do not have any names. In general, we may want to apply several relational algebra operations one after the other. Either we can write the operations as a single relational algebra expression by nesting the operations, or we can apply one operation at a time and create intermediate result relations. In the latter case, we must give names to the relations that hold the intermediate results. For example, to retrieve the first name, last name, and salary of all employees who work in department number 5, we must apply a SELECT and a PROJECT operation. We can write a single rela? tional algebra expression as follows: 'IT FNAME, LNAME, SALARY( < SSN=ESSN DEPENDENT The general form of a JOIN operation on two relations" R(A I , Az, ... , An) and 5(B1, Bz, ... , Bm ) is R i><1 S The result of the JOIN is a relation Q with n + m attributes Q(AI, A z, ... , An' BI, B2, ... , Bm ) in that order; Q has one tuple for each combination of tuples-one from Rand one from 5-whenever the combination satisfies the join condition. This is the main difference between CARTESIAN PRODUCT and JOIN. In JOIN, only combinations of tuples satisfying the join condition appear in the result, whereas in the CARTESIAN PRODUCT all combinations of tuples are included in the result. The join condition is specified on attributes from the two relations Rand 5 and is evaluated for each combination of tuples. Each tuple combination for which the join condition evaluates to TRUE is included in the resulting relation Q as a single combined tuple. A general join condition is of the form AND AND ... AND MGRSSN ····........ FNAME 333445555 Franklin 987654321 Jennifer 888665555 James FIGURE 6.6 Result of the JOIN operation DEPT_MGR f- DEPARTMENT t>, 2:, t}. A JOIN operation with such a general join condition is called a THETA JOIN. Tuples whose join attributes are null do not appear in the result. In that sense, the JOIN operation doesnotnecessarily preserve all of the information in the participating relations. 6.3.2 The EQUljOIN and NATURAL JOIN Variations of JOIN The most common use of JOIN involves join conditions with equality comparisons only. Such a JOIN, where the only comparison operator used is =, is called an EQUIJOIN. Both examples we have considered were EQUI]OINs. Notice that in the result of an EQUI]OIN we always have one or more pairs of attributes that have identical values in every tuple. For example, in Figure 6.6, the values of the attributes MGRSSN and SSN are identical in every tuple of DEPT_MGR because of the equality join condition specified on these two attributes. Becauseone of each pair of attributes with identical values is superfluous, a new operation called NATURAL JOIN-denoted by *-was created to get rid of the second (superfluous) attribute in an EQUI]OIN condition.s The standard definition of NATURAL JOIN requires that the two join attributes (or each pair of join attributes) have the same name in both relations. If this is not the case, a renaming operation is applied first. In the following example, we first rename the DNUMBER attribute of DEPARTMENT to DNUM-SO that it has the same name as the DNUM attribute in PROJECT-and then apply NATURAL JOIN: PROJ_DEPT f- PROJECT * P(DNAME,DNUM,MGRSSN,MGRSTARTDATE) (DEPARTMENT) The same query can be done in two steps by creating an intermediate table DEPT as follows: DEPT f- P (DNAME, DNJM ,MGRSSN ,MGRSTARTDATE) (DEPARTMENT) PROJ_DEPT f- PROJECT * DEPT The attribute DNUM is called the join attribute. The resulting relation is illustrated in Figure 6.7a. In the PROJ_DEPT relation, each tuple combines a PROJECT tuple with the DEPARTMENT tuple for the department that controls the project, but only one joinattribute is kept. If the attributes on which the natural join is specified already have the same names in both relations, renaming is unnecessary. For example, to apply a natural join on the DNUMBER attributes of DEPARTMENT and DEPT_LOCATIONS, it is sufficient to write DEPT_LOCS f- DEPARTMENT * DEPT_LOCATIONS The resulting relation is shown in Figure 6.7b, which combines each department with its loca? tions and has one tuple for each location. In general, NATURAL JOIN is performed by equating aU attribute pairs that have the same name in the two relations. There can be a list of join attributes from each relation, and each corresponding pair must have the same name. --------- 5.NATURAL JOIN is basically an EQUIJOIN followed by removal of the superfluous attributes. 162 I Chapter 6 The Relational Algebra and Relational Calculus (a) I PROJ DEPT (b) I DEPT_LOCS PNAME Productx ProductY ProductZ Computerization Reorganization Newbenefits PNUMBER 1 2 3 10 20 30 DNUMBER 1 4 PLOCATION Bellaire Sugarland Houston Stafford Houston Stafford DNUM 5 5 5 4 1 4 DNAME Research Research Research Administration Headquarters Administration MGRSSN 333445555 333445555 333445555 987654321 888665555 987654321 MGRSTARTDATE 1988-05-22 1988-05-22 1988-05-22 1995-01-01 1981-06-19 1995-01-01 MGRSTARTDATE LOCATION DNAME MGRSSN 888665555 Headquarters Houston 1981-06-19 Stafford 1995-01-01 Administration 987654321 1988-05-22 333445555 Bellaire Research 1988-05-22 333445555 Sugarland Research 1988-05-22 Research 333445555 5 Houston FIGURE 6.7 Results of two NATURAL JOIN operations. (a) PROJ_DEPT f- PROJECT * DEPT. (b) DEPT_LOCS f? DEPARTMENT * DEPT_LOCATIONS. 5 5 A more general but nonstandard definition for NATURAL JOIN is Q f- R *«listl».«!ist2»S In this case, specifies a list of i attributes from R, and specifies a list of i attributes from S. The lists are used to form equality comparison conditions between pairs of corresponding attributes, and the conditions are then ANDed together. Only the list corresponding to attributes of the first relation R--is kept in the result Q. Notice that if no combination of tuples satisfies the join condition, the result of a JOIN is an empty relation with zero tuples. In general, if R has nR tuples and S has ns tuples, the result of a JOIN operation R LX) S will have between zero and nR * ns tuples. The expected size of the join result divided by the maximum size nR * ns leads to a ratio called join selectivity, which is a property of each join condition. If there is no join condition, all combinations of tuples qualify and the JOIN degenerates into a CARTESIAN PRODUCT, also called CROSS PRODUCT or CROSS JOIN. As we can see, the JOIN operation is used to combine data from multiple relations so that related information can be presented in a single table. These operations are also known as inner joins, to distinguish them from a different variation of join called outer joins (see Section 6.4.3). Note that sometimes a join may be specified between a relation and itself, as we shall illustrate in Section 6.4.2. The NATURAL JOIN or EQUIJOIN operation can also be specified among multiple tables, leading to an n-way join. For example, consider the following three-way join: ( (PROJECT >< DNUM~DNUMBER DEPARTMENT) >1 MGRSSN~SSN EMPLOYEE) This links each project to its controlling department, and then relates the department to its manager employee. The net result is a consolidated relation in which each tuple con? tains this project-department-manager information. 6.3 Binary Relational Operations: JOIN and DIVISION 6.3.3 A Complete Set of Relational Algebra Operations It has been shown that the set of relational algebra operations {a, 'IT, U, -, x] is a com? pleteset; that is, any of the other original relational algebra operations can be expressed asa sequence of operations from this set. For example, the INTERSECTION operation can be expressed by using UNION and MINUS as follows: R n 5 == (R U 5) - ((R - 5) U (5 - R)) Although, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this complex expression every time we wish to specify an intersection. As another example, a JOIN operation can be specified as a CARTESIAN PRODUCT followed by a SELECT operation, as we discussed: R x 5 == a (R X S) Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT preceded by RENAME and followed by SELECT and PROJECT operations. Hence, the various JOIN operations are also not strictly necessary for the expressive power of the relational algebra. However, they are important to consider as separate operations because they are convenient to use and are very commonly applied in database applications. Other operations have been included in the relational algebra for convenience rather than necessity. We discuss one of these-the DIVISION operation-in the next section. 6.3.4 The DIVISION Operation The DIVISION operation, denoted by --;-, is useful for a special kind of query that some? times occurs in database applications. An example is "Retrieve the names of employees who work on all the projects that 'John Smith' works on." To express this query using the DIVISION operation, proceed as follows. First, retrieve the list of project numbers that 'JohnSmith' works on in the intermediate relation SMITH_PNOS: SMITH f- a FNAME~' JOHN' AND LNAME~'SMITH' (EMPLOYEE) SMITH_PNOS f- 'ITPNO(WORKS_ON IX1ESSN~SSN SMITH) Next, create a relation that includes a tuple whenever the employee whose social security number is ESSN works on the project whose number is PNO in the intermediate relation SSN_PNOS: SSN_PNOS f- 'ITESSN,PNO (WORKS_ON) Finally, apply the DIVISION operation to the two relations, which gives the desired employees' social security numbers: SSNS (SSN) f- SSN_PNOS --;- SMITH_PNOS RESULT f- 'ITFNAME, LNAME (SSNS 1, EMPLOYEE) I 163 The previous operations are shown in Figure 6.8a. 164 I Chapter 6 The Relational Algebra and Relational Calculus (a) I SSN PNOS ESSN 123456789 123456789 666884444 453453453 453453453 333445555 333445555 333445555 333445555 999887777 999887777 987987987 987987987 987654321 987654321 888665555 PNO 1 2 3 1 2 2 3 10 20 30 10 10 30 30 20 20 (b) I R A a1 a2 a3 a4 a1 a3 a2 a3 a4 at a2 a3 I SMITH_PNOS I SSNS B b1 b1 b1 b1 b2 b2 b3 b3 b3 b4 b4 b4 PNO 1 2 SSN 123456789 453453453 ~a1A a2 a3 ~b1B b4 FIGURE 6.8 The DIVISION operation. (a) Dividing SSN_PNOS by SMITH_PNOS. (b) T f- R --;- S. 6.4 Additional Relational Operations I 165 In general, the DlVISION operation is applied to two relations R(Z) -7- S(X), where X ~ Z. Let Y = Z - X (and hence Z = X U Y); that is, let Y be the set of attributes of R that are not attributes of S. The result of DIVISION is a relation T(Y) that includes a tuple t if tuples tR appear in R with tR[Yl = t, and with tR[Xj = ts for every tuple ts in S. This means that, for a tuple t to appear in the result T of the DlVISION, the values in t must appear in Rin combination with every tuple in S. Note that in the formulation of the DIVISION operation, the tuples in the denominator relation restrict the numerator relation by selecting those tuples in the result that match all values present in the denominator. It is notnecessary to know what those values are. Figure 6.8b illustrates a DIVISION operation where X = {A}, Y = {B}, and Z = {A, B}. Notice that the tuples (values) bj and b4 appear in R in combination with all three tuples inS;that is why they appear in the resulting relation T. All other values of B in R do not appear with all the tuples in S and are not selected: bzdoes not appear with az, and b3 does notappear with aj' The DIVISION operation can be expressed as a sequence of 1T, x, and - operations as follows: n f- 1TY(R) T2 f- 1TY((S x T1) T f- T1 - T: - R) The DIVISION operation is defined for convenience for dealing with queries that involve "universal quantification" (see Section 6.6.6) or the all condition. Most RDBMS implementations with SQL as the primary query language do not directly implement division. SQL has a roundabout way of dealing with the type of query illustrated above (see Section 8.5,4). Table 6.1 lists the various basic relational algebra operations we have discussed. 6.4 ADDITIONAL RELATIONAL OPERATIONS Some common database requests-which are needed in commercial query languages for RDBMSs-cannot be performed with the original relational algebra operations described in Sections 6.1 through 6.3. In this section we define additional operations to express these requests. These operations enhance the expressive power of the original relational algebra. 6.4.1 Aggregate Functions and Grouping Thefirst type of request that cannot be expressed in the basic relational algebra is to spec? ify mathematical aggregate functions on collections of values from the database. Exam? ples of such functions include retrieving the average or total salary of all employees or the total number of employee tuples. These functions are used in simple statistical queries that summarize information from the database tuples. Common functions applied to col? lections of numeric values include SUM, AVERAGE, MAXIMUM, and MINIMUM. The COUNT function is used for counting tuples or values. 166 I Chapter 6 The Relational Algebra and Relational Calculus TABLE 6.1 OPERATIONS OF RElATIONAL ALGEBRA Operation SELECT PROJECT THETA JOIN EQUIJOIN NATURAL JOIN Notation a (R) 1T (R) UNION INTERSECTION DIFFERENCE CARTESIAN PRODUCT DIVISION Purpose Selects all tuples that satisfy the selection condition from a relation R. Produces a new relation with only some of the attributes of R, and removes duplicate tuples. Produces all combinations of tuples from R j and Rz that satisfy the join condition. Produces all the combinations of tuples from R j and Rz R 1 DR2, OR that satisfy a join condition with only equality compar? R11XI «JOIN ATTRIBUTES 1» , isons. elOIN ATTRIBUTES 2» Same as EQUIJOIN except that the join attributes of Rz R 1" R2, OR R1* «JOIN ATTRIBUTES 1» are not included in the resulting relation; if the join , «JOIN ATTRIBUTES 2» R2 attributes have the same names, they do not have to be OR R1 " R2 specified at all. Produces a relation that includes all the tuples in R j or Rz or both R j and Rz; R j and Rz must be union compat? ible. Produces a relation that includes all the tuples in both R j and Rz; R j and Rz must be union compatible. Produces a relation that includes all the tuples in R j that are not in Rz;R j and Rz must be union compatible. Produces a relation that has the attributes of Rj and Rz and includes as tuples all possible combinations of tuples from R j and Rz. Produces a relation R(X) that includes all tuples t[Xj in R j (2) that appear in R j in combination with every tuple from Rz(Y), where 2 = X U Y. R2 Another common type of request involves grouping the tuples in a relation by the value of some of their attributes and then applying an aggregate function independently to each group. An example would be to group employee tuples by DNO, so that each group includes the tuples for employees working in the same department. We can then list each DNO value along with, say, the average salary of employees within the department, or the number of employees who work in the department. We can define an AGGREGATE FUNCTION operation, using the symbol lJ (pro? nounced "script F"),6 to specify these types of requests as follows: ~ (R) 6. There is no single agreed-upon notation for specifying aggregate functions. In some cases a "script A" is used. 6.4 Additional Relational Operations 1167 where is a list of attributes of the relation specified in R, and is a list of «function> is one of the allowed functions-such as SUM, AVERAGE, MAXIMUM, MINIMUM, COUNT-and is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list. For example, to retrieve each department number, the number of employees in the department, and theiraverage salary, while renaming the resulting attributes as indicated below, we write: PR(DNO, NO_OF_EMPLOYEES, AVERAGE_SAL)(DNO ~ COUNT SSN' AVERAGE SALARY (EMPLOYEE)) The result of this operation is shown in Figure 6.9a. In the above example, we specified a list of attribute names-between parentheses in the RENAME operation-for the resulting relation R. If no renaming is applied, then the attributes of the resulting relation that correspond to the function list will each be the concatenation of the function name with the attribute name in the form '~~ COUNT (DND ""~~ COUNT (EMPLOYEE). AVERAGE AVERAGE '" (C) [~ COUNT (EMPLOYEE). SALARY 7.Note that this is an arbitrary notation we are suggesting. There is no standard notation. 168 I Chapter 6 The Relational Algebra and Relational Calculus It is important to note that, in general, duplicates are not eliminated when an aggregate function is applied; this way, the normal interpretation of functions such as SUM and AVERAGE is computed.f It is worth emphasizing that the result of applying an aggregate function is a relation, not a scalar number-even if it has a single value. This makes the relational algebra a closed system. 6.4.2 Recursive Closure Operations Another type of operation that, in general, cannot be specified in the basic original rela? tional algebra is recursive closure. This operation is applied to a recursive relationship between tuples of the same type, such as the relationship between an employee and a supervisor. This relationship is described by the foreign key SUPERSSN of the EMPLOYEE rela? tion in Figures 5.5 and 5.6, and it relates each employee tuple (in the role of supervisee) to another employee tuple (in the role of supervisor). An example of a recursive opera? tion is to retrieve all supervisees of an employee e at all levels-that is, all employees e' directly supervised bye, all employees e" directly supervised by each employee e'; all employees e'" directly supervised by each employee e"; and so on. Although it is straightforward in the relational algebra to specify all employees supervised bye at a specific level, it is difficult to specify all supervisees at all levels. For example, to specify the SSNs of all employees e' directly supervised--at level one-by the employee e whose name is 'James Borg' (see Figure 5.6), wecan apply the following operation: BORG_SSN ~ 'ITSSN (<: keeps all tuples in both the left and the right relations when no matching tuples are found, padding them with null values as needed. The three outer join operations are part of the sQL2 standard (see Chapter 8). 6.4.4 The OUTER UNION Operation The OUTER UNION operation was developed to take the union of tuples from two rela? tions if the relations are not union compatible. This operation will take the UNION of tuples in two relations R(X, Y) and S(X, Z) that are partially compatible, meaning that only some of their attributes, say X, are union compatible. The attributes that are union com? patible are represented only once in the result, and those attributes that are not union compatible from either relation are also kept in the result relation T(X, Y, Z). Two tuples t1 in Rand tz in S are said to match if t1[X]=tZ[X], and are considered to represent the same entity or relationship instance. These will be combined (unioned) into a single tuple in T. Tuples in either relation that have no matching tuple in the other relation are padded with null values. For example, an OUTER UNION can be applied to two relations whose schemas are STUDENT(Name, SSN, Department, Advisor) and INSTRuCToR(Name, SSN, Department, Rank). Tuples from the two relations are matched based on having the same I RESULT FNAME John Franklin Alicia Jennifer Ramesh Joyce Ahmad James MINIT B T J S K A V E LNAME Smith Wong Zelaya Wallace Narayan English Jabbar Borg DNAME null Research null Administration null null null Headquarters FIGURE 6.11 The result of a LEFT OUTER JOIN operation. 6.5 Examples of Queries in Relational Algebra I 171 combination of values of the shared attributes-Name, SSN, Department. The result relation, STUDENT_OR_INSTRUCTOR, will have the following attributes: STUDENT_OR_INSTRuCToR(Name, SSN, Department, Advisor, Rank) All the tuples from both relations are included in the result, but tuples with the same (Name, SSN, Department) combination will appear only once in the result. Tuples appearing only in STUDENT will have a null for the Rank attribute, whereas tuples appearing only in INSTRUCTOR will have a null for the Advisor attribute. A tuple that exists in both relations, such as a student who is also an instructor, will have values for all its attributes.l" Notice that the same person may still appear twice in the result. For example, we couldhave a graduate student in the Mathematics department who is an instructor in the Computer Science department. Although the two tuples representing that person in STU? DENT and INSTRUCTOR will have the same (N arne, SSN) values, they will not agree on the Department value, and so will not be matched. This is because Department has two separate meanings in STUDENT (the department where the person studies) and INSTRUCTOR (the department where the person is employed as an instructor). If we wanted to union persons based on the same (Name, SSN) combination only, we should rename the Department attribute in each table to reflect that they have different meanings, and designate them as not being part of the union-compatible attributes. Another capability that exists in most commercial languages (but not in the basic relational algebra) is that of specifying operations on values after they are extracted from the database. For example, arithmetic operations such as +, -, and * can be applied to numeric values that appear in the result of a query. 6.5 EXAMPLES OF QUERIES IN RELATIONAL ALGEBRA Wenow give additional examples to illustrate the use of the relational algebra operations. All examples refer to the database of Figure 5.6. In general, the same query can be stated in numerous ways using the various operations. We will state each query in one way and leave it to the reader to come up with equivalent formulations. QUERY 1 Retrieve the name and address of all employees who work for the 'Research'department. RESEARCH_DEPT f- (J'DNAME=' RESEARCH' (DEPARTMENT) RESEARCH_EMPS f- (RESEARCH_DEPT txJDNUMBER=DNOEMPLOYEE) RESULT f- 'ITFNAME. LNAME, ADDRESS (RESEARCH_EMPS) 10. Notice that OUTER UNION is equivalent to a FULL OUTER JOIN if the join attributes are all the commonattributes of the two relations. 172 I Chapter 6 The Relational Algebra and Relational Calculus This query could be specified in other ways; for example, the order of the JOIN and SELECT operations could be reversed, or the JOIN could be replaced by a NATURAL JOIN after renaming one of the join attributes. QUERY 2 For every project located in 'Stafford',list the project number, the controlling depart? ment number, and the department manager's last name, address, and birth date. STAFFORO_PROJS f- (J PLOCATION=' STAFFORD' (PROJECT) CONTR_DEPT f- (STAFFORD_PROJS N DNVM=DNVMBER DEPARTMENT) PROJ_DEPT_MGR f- (CONTR_DEPT NMGRSSN=SSN EMPLOYEE) RESULT f- 'ITPNUMBER, DNUM, LNAME, ADDRESS. BDATE (PROJ_DEPT_MGR) QUERY 3 Find the names of employees who work on all the projects controlled by department number 5. DEPT5_PROJS (PNO) f- 'IT PNUMBER(JDNUM=5 (PROJECT)) EMP_PROJ(SSN, PNO) f- 'IT ESSN, PNO (WORKS_ON) RESULT_EMP_SSNS f- EMP_PROJ -;- DEPT5_PROJS RESULT f- 'IT LNAME, FNAME (RESULT_EMP_SSNS * EMPLDYEE) QUERY 4 Make a list of project numbers for projects that involve an employee whose last name is 'Smith',either as a worker or as a manager of the department that controls the project. SMITHS(ESSN) f- 'ITSSN(JLNAME=' SMITH' (EMPLOYEE)) SMITH_WORKER_PROJ f- 'IT PND(WORKS_ON * SMITHS) MGRS f- 'IT LNAME, DNUMBER (EMPLOYEE >< SSN=MGRSSN DEPARTMENT) SMITH_MANAGED_DEPTS (DNUM) f- 'ITDNUMBER(JLNAME=' SMITH' (MGRS)) SMITH_MGR_PROJS (PNO) f- 'IT PNUMBER (SMITH_MANAGED_DEPTS * PROJ ECT) RESULT f- (SMITH_WORKER_PROJS U SMITH_MGR_PROJS) QUERY 5 List the names of all employees with two or more dependents. Strictly speaking, this query cannot be done in the basic (original) relational algebra. We have to use the AGGREGATE FUNCTION operation with the COUNT aggregate function. We assume that dependents of the same employee have distinct DEPENDENT_NAME values. n(SSN, NO_OF_DEPTS) f- ESSN ~ COUNT DEPENDENT NAME T 2 f- (JNO_OF_DEPS20:2 (Tl.) (OEPENDENT) RESULT f- 'ITLNAME, FNAME(T2 " EMPLOYEE) 6.6 The Tuple Relational Calculus I 173 QUERY 6 Retrieve the names of employees who have no dependents. This is an example of the type of query that uses the MINUS (SET DIFFERENCE) opera? tion. ALL_EMPS f- 'ITSSN (EMPLOYEE) EMPS_WITH_DEPS (SSN) f- 'ITESSN (DEPENDENT) EMPS_WITHOUT_DEPS f- (ALL_EMPS - EMPS_WITH_DEPS) RESULT f- 'ITLNAME, FNAME (EMPS_WITHOUT_DEPS * EMPLOYEE) QUERY 7 List the names of managers who have at least one dependent. MGRS(SSN) f- 'ITMGRSSN(DEPARTMENT) EMPS_WITH_DEPS (SSN) f- 'ITESSN (DEPENDENT) MGRS_WITH_DEPS f- (MGRS n EMPS_WITH_DEPS) RESULT f- 'ITLNAME, FNAME (MGRS_WITH_DEPS * EMPLOYEE) As we mentioned earlier, the same query can in general be specified in many different ways. For example, the operations can often be applied in various orders. In addition, some operations can be used to replace others; for example, the INTERSECTION operation in Query 7 can be replaced by a NATURAL JOIN. As an exercise, try to do each of the above example queries using different operations. I I In Chapter 8 and in Sections 6.6 and 6,7, we show how these queries are written in other relational languages. 6.6 THE TUPLE RELATIONAL CALCULUS In this and the next section, we introduce another formal query language for the rela? tional model called relational calculus. In relational calculus, we write one declarative expression to specify a retrieval request, and hence there is no description of how to eval? uate a query. A calculus expression specifies what is to be retrieved rather than how to retrieve it. Therefore, the relational calculus is considered to be a nonprocedural lan? guage. This differs from relational algebra, where we must write a sequence of operations to specify a retrieval request; hence, it can be considered as a procedural way of stating a query. It is possible to nest algebra operations to form a single expression; however, a cer? tain order among the operations is always explicitly specified in a relational algebra expression. This order also influences the strategy for evaluating the query. A calculus expression may be written in different ways, but the way it is written has no bearing on howa query should be evaluated. 11.When queries are optimized (see Chapter 15), the system will choose a particular sequence of operations that corresponds to an execution strategy that can be executed efficiently. 174 I Chapter 6 The Relational Algebra and Relational Calculus It has been shown that any retrieval that can be specified in the basic relational algebra can also be specified in relational calculus, and vice versa; in other words, the expressive power of the two languages is identical. This led to the definition of the concept of a relationally complete language. A relational query language L is considered relationally complete if we can express in L any query that can be expressed in relational calculus. Relational completeness has become an imporrant basis for comparing the expressive power of high-level query languages. However, as we saw in Section 6,4, cerrain frequently required queries in darabase applications cannor be expressed in basic relational algebra or calculus. Most relational query languages are relationally complete but have more expressive power than relational algebra or relational calculus because of additional operations such as aggregate functions, grouping, and ordering. In this section and the next, all our examples again refer to the database shown in Figures 5.6 and 5.7. We will use the same queries that were used in Section 6.5. Sections 6.6.5 and 6.6.6 discuss dealing with universal quantifiers and may be skipped by students interested in a general introduction to tuple calculus. 6.6.1 Tuple Variables and Range Relations The tuple relational calculus is based on specifying a number of tuple variables. Each tuple variable usually ranges over a parricular database relation, meaning that the variable may take as its value any individual tuple from that relation. A simple tuple relational calculus query is of the form {t I COND(t)} where t is a tuple variable and CONO( t) is a conditional expression involving t. The result of such a query is the set of all tuples t that satisfy CONO(t). For example, to find all employees whose salary is above $50,000, we can write the following tuple calculus expression: [r I EMPLOYEE(t) and t.SALARy>50000} The condition EMPLOYEE(t) specifies that the range relation of tuple variable t is EMPLOYEE. Each EMPLOYEE tuple t rhat satisfies the condition t.SALARy>50000 will be retrieved. Notice that t.SALARY references attribute SALARY of tuple variable t; this notation resembles how attribute names are qualified wirh relarion names or aliases in SQL, as we shall see in Chapter 8. In the notation of Chapter 5, t.SALARY is the same as writing t[SALARyj. The above query retrieves all attribute values for each selected EMPLOYEE tuple r. To retrieve only some of the attributes-say, the first and last names-we write {t.FNAME, t.LANME I EMPLOYEE(t) AND t.SALARy>50000} Informally, we need to specify the following information in a tuple calculus expression: • For each tuple variable t, the range relation R of t. This value is specified by a condi? tion of the form R(t). 6.6 The Tuple Relational Calculus I 175 • A condition to select particular combinations of tuples. As tuple variables range over their respective range relations, the condition is evaluated for every possible combi? nation of tuples to identify the selected combinations for which the condition evalu? ates to TRUE. • A set of attributes to be retrieved, the requested attributes. The values of these attributes are retrieved for each selected combination of tuples. Before we discuss the formal syntax of tuple relational calculus, consider another query. QUERY 0 Retrieve the birth date and address of the employee (or employees) whose name is 'John B.Smith'. QO: {t.BDATE, t.ADDRESS I EMPLOYEE(t) AND t.FNAME='John' AND t.MINIT='B' AND t.LNAME='Smith'} ln tuple relational calculus, we frrst specify the requested attributes t.BDATE and t.ADDRESS for each selected tuple r. Then we specify the condition for selecting a tuple following the bar ( I )-namely, that t be a tuple of the EMPLOYEE relation whose FNAME, MINIT, and LNAME attribute values are 'John','B', and 'Smith',respectively. 6.6.2 Expressions and Formulas in Tuple Relational Calculus Ageneral expression of the tuple relational calculus is of the form ,tn·Am I COND(tl, tz, ... , tn' tn+ l , tn+Z' ••• {tl·Aj , tz·Ak, ••• , tn+m)} where tl, tz, ... , tn' tn+I' ... , tn+m are tuple variables, each A j is an attribute of the relation on which tj ranges, and COND is a condition or formula'< of the tuple relational calculus. Aformula is made up of predicate calculus atoms, which can be one of the following: 1. An atom of the form R(t), where R is a relation name and tj is a tuple variable. This atom identifies the range of the tuple variable ti as the relation whose name is R. 2. An atom of the form tj.A op tj'B, where op is one of the comparison operators in the set {=, <, :S, >, 2:, :t}, tj and tj are tuple variables, A is an attribute of the rela? tion on which tj ranges, and B is an attribute of the relation on which tj ranges. 3. An atom of the form ti.A op core op tj.B, where op is one of the comparison oper? ators in the set {=, <, :S, >, 2:, :t}, tj and tj are tuple variables, A is an attribute of the relation on which t j ranges, B is an attribute of the relation on which tj ranges, and c is a constant value. 12. Also called a well-formed formula, or wff, in mathernaticallogic. 176 I Chapter 6 The Relational Algebra and Relational Calculus Each of the preceding atoms evaluates to either TRUE or FALSE for a specific combination of tuples; this is called the truth value of an atom. In general, a tuple variable t ranges over all possible tuples "in the universe." For atoms of the form R(t), if t is assigned to a tuple that is a member of the specified relation R, the atom is TRUE; otherwise, it is FALSE. In atoms of types 2 and 3, if the tuple variables are assigned to tuples such that the values of the specified attributes of the tuples satisfy the condition, then the atom is TRUE. A formula (condition) is made up of one or more atoms connected via the logical operators AND, OR, and NOT and is defined recursively as follows: 1. Every atom is a formula. 2. If F[ and Fz are formulas, then so are (F[ AND Fz)' (F[ OR Fz), NOT(F[), and NOT (Fz). The truth values of these formulas are derived from their component formulas F[ and Fz as follows: a. (F[ AND Fz) is TRUE if both F[ and Fz are TRUE; otherwise, it is FALSE. b. (F[ OR Fz) is FALSE if both F[ and Fz are FALSE; otherwise, it is TRUE. c. NOT(F[) is TRUE if F[ is FALSE; it is FALSE if F[ is TRUE. d. NOT(Fz) is TRUE if Fz is FALSE; it is FALSE if Fz is TRUE. 6.6.3 The Existential and Universal Quantifiers In addition, two special symbols called quantifiers can appear in formulas; these are the universal quantifier ('V) and the existential quantifier (3). Truth values for formulas with quantifiers are described in rules 3 and 4 below; first, however, we need to define the concepts of free and bound tuple variables in a formula. Informally, a tuple variable t is bound if it is quantified, meaning that it appears in an (3 t) or ('rI t) clause; otherwise, it is free. Formally, we define a tuple variable in a formula as free or bound according to the following rules: • An occurrence of a tuple variable in a formula F that is an atom is free in F. • An occurrence of a tuple variable t is free or bound in a formula made up of logical connectives-s-ff', ANd Fz), (F[ OR Fz), NOT(F[), and NOT(Fz)-depending on whether it is free or bound in F[ or Fz (if it occurs in either). Notice that in a formula of the form F = (F[ AND Fz) or F = (F[ OR Fz), a tuple variable may be free in F] and bound in Fz, or vice versa; in this case, one occurrence of the tuple variable is bound and the other is free in F. • All free occurrences of a tuple variable t in F are bound in a formula F' of the form F' = (3 t)(F) or F' = ('rI t)(F). The tuple variable is bound to the quantifier specified in F'. For example, consider the following formulas: FlO. DNAME=' RESEARCH' (3T) (D. DNUMBER=T. DNO) F2 F3 ('riD) (D. MGRSSN= , 333445555 ') 6.6 The Tuple Relational Calculus I 177 The tuple variable d is free in both F j and Fz, whereas it is bound to the (V) quantifier in F3• Variable t is bound to the (3) quantifier in Fz. We can now give rules 3 and 4 for the definition of a formula we started earlier: 3. If F is a formula, then so is (3 t)(F), where t is a tuple variable. The formula (3 t)(F) is TRUE if the formula F evaluates to TRUE for some (at least one) tuple assigned to free occurrences of tin F; otherwise, (3 t)(F) is FALSE. 4. If F is a formula, then so is (V t)(F), where t is a tuple variable. The formula (V t)(F) is TRUE if the formula F evaluates to TRUE for every tuple (in the uni? verse) assigned to free occurrences of tin F; otherwise, (V t)(F) is FALSE. The (3) quantifier is called an existential quantifier because a formula (3 t)(F) is TRUE if "there exists" some tuple that makes F TRUE. For the universal quantifier, (V t)(F) is TRUE if every possible tuple that can be assigned to free occurrences of t in F is substituted for t, and F is TRUE for every such substitution. It is called the universal or "for all" quantifier because every tuple in "the universe of" tuples must make F TRUE to make thequantified formula TRUE. 6.6.4 Example Queries Using the Existential Quantifier We will use some of the same queries from Section 6.5 to give a flavor of how the same que? ries are specified in relational algebra and in relational calculus. Notice that some queries areeasier to specify in the relational algebra than in the relational calculus, and vice versa. QUERY 1 Retrieve the name and address of all employees who work for the 'Research'department. Ql: {t.FNAME, t.LNAME, t.ADDRESS I EMPLOYEE(t) AND (3d) (DEPARTMENT(d) AND d.DNAME='Research'AND d.DNUMBER=t.DNO) } The only free tuple variables in a relational calculus expression should be those that appear to the left of the bar ( I ). In Ql, t is the only free variable; it is then bound successively to each tuple. If a tuple satisfies the conditions specified in Ql, the attributes FNAME, LNAME, and ADDRESS are retrieved for each such tuple. The conditions EMPLOYEE(t) and DEPARTMENT(d) specify the range relations for t and d. The condition d.DNAME = 'Research'is a selection condition and corresponds to a SELECT operation in the relational algebra, whereas the condition d.DNUMBER = t.DNO is a join condition and serves a similar purpose to the JOIN operation (see Section 6.3). QUERY 2 Forevery project located in 'Stafford',list the project number, the controlling department number, and the department manager's last name, birth date, and address. Q2: {p.PNUMBER, p.DNUM, m.LNAME, m.BDATE, m.ADDRESS I PROJECT(p) AND EMPLOYEE(m) AND p.PLOCATION='Stafford'AND ( (3d)(DEPARTMENT(d) AND p.DNUM=d.DNUMBER AND d.MGRSSN=m.SSN) ) } 178 I Chapter 6 The Relational Algebra and Relational Calculus In Q2 there are two free tuple variables, p and rn. Tuple variable d is bound to the existential quantifier. The query condition is evaluated for every combination of tuples assigned to p and m; and out of all possible combinations of tuples to which p and mare bound, only the combinations that satisfy the condition are selected. Several tuple variables in a query can range over the same relation. For example, to specify the query Q8-for each employee, retrieve the employee's first and last name and the first and last name of his or her immediate supervisor-we specify two tuple variables e and s that both range over the EMPLOYEE relation: Q8: {e.FNAME, e.LNAME, s.FNAME, s.LNAME I EMPLOYEE(e) AND EMPLOYEE(s) AND e.SUPERSSN=s.SSN} QUERY 3' Find the name of each employee who works on some project controlled by depart? ment number 5. This is a variation of query 3 in which "all" is changed to "some." In this case we need two join conditions and two existential quantifiers. Q3': {e. LNAME. e. FNAME I EMPLOYEE(e) AND ( (3 x)(3 w) (PROJECT(x) AND WORKS_ON(w) AND x.DNUM=5 AND w.ESSN=e.SSN AND x.PNUMBER=w.PNO) ) } QUERY 4 Make a list of project numbers for projects that involve an employee whose last name is 'Smith', either as a worker or as manager of the controlling department for the project. Q4: {p.PNUMBER I PROJECT(p) AND ( ( (3 e)(3 w)(EMPLOYEE(e) AND WORKS_ON(w) AND w.PNO=p.PNUMBER AND e.LNAME='Smith'AND e.SSN=w.ESSN) ) or ( (3 m)(3 d)(EMPLOYEE(m) AND DEPARTMENT(d) AND p.DNUM=d.DNUMBER AND d.MGRSSN=m.SSN AND m.LNAME='Smith') ) ) } Compare this with the relational algebra version of this query in Section 6.5. The UNION operation in relational algebra can usually be substituted with an OR connective in relational calculus. In the next section we discuss the relationship between the universal and existential quantifiers and show how one can be transformed into the other. 6.6.5 Transforming the Universal and Existential Quantifiers We now introduce some well-known transformations from mathematical logic that relate the universal and existential quantifiers. It is possible to transform a universal quantifier into an existential quantifier, and vice versa, to get an equivalent expression. One general transformation can be described informally as follows: Transform one type of quantifier 6.6 The Tuple Relational Calculus I 179 into the other with negation (preceded by NOT); AND and OR replace one another; a negated formula becomes unnegated; and an unnegated formula becomes negated. Some special cases of this transformation can be stated as follows, where the == symbol stands for equivalent to: ('if x) (P(x)) fNOT (3 x) (NOT (P(x))) (3 x) (P(x)) f NOT <,if x) (NOT (P(x))) ('if x) (P(x) AND Q(x)) f NOT (3 x) (NOT (P(x)) OR NOT (Q(x))) (ifx) (P(x) OR Q(x)) f NOT (3 x) (NOT (P(x)) AND NOT «xo» (3 x) (P(x)) OR Q(x)) f NOT <,if x) (NOT (P(x)) AND NOT (Q(x))) (3 x) (P(x) AND Q(x)) f NOT <'if x) (NOT (P(x)) OR NOT (Q(x))) Notice also that the following is TRUE, where the ~ symbol stands for implies: ('if x) (P(x)) ~ (3 x) (P(x)) NOT (3 x) (P(x)) ~ NOT ('V x) (P(x)) 6.6.6 Using the Universal Quantifier Wheneverwe use a universal quantifier, it is quite judicious to follow a few rules to ensure thatour expression makes sense. We discuss these rules with respect to Query 3. QUERY 3 Find the names of employees who work on all the projects controlled by department number 5. One way of specifying this query is by using the universal quantifier as shown. Q3: {e.LNAME, e.FNAME I EMPLOYEE(e) AND ( ('V x)(NOT(PROJECT(x)) OR NOT (x. DNUM=5) OR ( (3 w)(WORKS_ON(w) AND w.ESSN=e.SSN AND x.PNUMBER=w.PNO) ) ) ) } We can break up Q3 into its basic components as follows: Q3: {e.LNAME, e.FNAME I EMPLOYEE(e) AND F' } F' = ( ('V x)(NOT(PROJECT(x)) OR FI ) ) FI = NOT(x.DNUM=5) OR Fz Fz = ( (3 w) (WORKS_ON (w) AND w. ESSN = e. SSN AND x. PNUMBER=w. PNO) ) We want to make sure that a selected employee e works on all the projects controlled by department 5, but the definition of universal quantifier says that to make the quantified formula TRUE, the inner formula must be TRUE for all tuples in the universe. The trick is to exclude from the universal quantification all tuples that we are not interested in by making the condition TRUE for all such tuples. This is necessary because a universally quantified tuple variable, such as x in Q3, must evaluate to TRUE for every possible tuple assigned to it to make the quantified formula TRUE. The first tuples to 180 I Chapter 6 The Relational Algebra and Relational Calculus exclude (by making them evaluate automatically to TRUE) are those that are not in the relation R of interest. In Q3, using the expression NOT(PROJECT(x)) inside the universally quantified formula evaluates to TRUE all tuples x that are not in the PROJECT relation. Then we exclude the tuples we are not interested in from R itself. In Q3, using the expression NOT(x.DNUM=5) evaluates to TRUE all tuples x that are in the PROJECT relation but are not controlled by department 5. Finally, we specify a condition Fz that must hold on all the remaining tuples in R. Hence, we can explain Q3 as follows: 1. For the formula F' = ('if x)(F) to be TRUE, we must have the formula F be TRUE for all tuples in the universe that can be assigned to x. However, in Q3 we are only interested in F being TRUE for all tuples of the PROJECT relation that are controlled by department 5. Hence, the formula F is of the form (NOT(PROJECT(X)) OR F1). The 'NOT(PROJECT(X)) OR ...' condition is TRUE for all tuples not in the PROJECT relation and has the effect of eliminating these tuples from consideration in the truth value of Fl' For every tuple in the PROJECT relation, FI must be TRUE if F' is to be TRUE. 2. Using the same line of reasoning, we do not want to consider tuples in the PROJECT relation that are not controlled by department number 5, since we are only inter? ested in PROJECT tuples whose DNUM = 5. We can therefore write: IF (x.DNUM=5) THEN Fz which is equivalent to (NOT (x.DNUM=5) OR Fz) 3. Formula FI , hence, is of the form NOT(x.DNuM=5) OR Fz. In the context ofQ3, this means that, for a tuple x in the PROJECT relation, either its DNUM,t5 or it must satisfy Fz. 4. Finally, Fz gives the condition that we want to hold for a selected EMPLOYEE tuple: that the employee works on every PROJECT tuple that has not been excluded yet. Such employee tuples are selected by the query. In English, Q3 gives the following condition for selecting an EMPLOYEE tuple e: For every tuple x in the PROJECT relation with X.DNUM = 5, there must exist a tuple w in WORKS_ON such that W.ESSN = e.SSN and W.PNO = X.PNUMBER. This is equivalent to saying that EMPLOYEE e works on every PROJECT x in DEPARTMENT number 5. (Whew!) Using the general transformation from universal to existential quantifiers given in Section 6.6.5, we can rephrase the query in Q3 as shown in Q3A: Q3A: {e.LNAME, e.FNAME I EMPLOYEE(e) AND (NOT (3 x) (PROJECT(x) AND (x.DNUM=5) AND (NOT (3 w)(WORKS_ON(w) AND w.ESSN=e.SSN AND x.PNUMBER=w.PNO))))} We now give some additional examples of queries that use quantifiers. QUERY 6 Find the names of employees who have no dependents. 6.7 The Domain Relational Calculus 1181 Q6: {e. FNAME, e. LNAME I EMPLOYEE(e) AND (NOT (3d)(DEPENDENT(d) AND e.SSN=d.ESSN))} Using the general transformation rule, we can rephrase Q6 as follows: Q6A: {e.FNAME, e.LNAME I EMPLOYEE(e) AND (('Vd) (NOT(DEPENDENT(d)) OR NOT (e. SSN=d. ESSN)))} QUERY 7 List the names of managers who have at least one dependent. Q7: {e.FNAME, e.LNAME I EMPLOYEE(e) AND ((3 d) (3 P) (DEPARTMENTCd) AND DEPENDENT(P) AND e. SSN=d.MGRSSN AND p.ESSN=e.SSN))} This query is handled by interpreting "managers who have at least one dependent" as "managers for whom there exists some dependent." 6.6.7 Safe Expressions Whenever we use universal quantifiers, existential quantifiers, or negation of predicates in a calculus expression, we must make sure that the resulting expression makes sense. A safe expression in relational calculus is one that is guaranteed to yield a finite number of tuples as its result; otherwise, the expression is called unsafe. For example, the expression [r I NOT (EMPLOYEE(t))} is unsafe because it yields all tuples in the universe that are not EMPLOYEE tuples, which are infinitely numerous. If we follow the rules for Q3 discussed earlier, we will get a safe expres? sion when using universal quantifiers. We can define safe expressions more precisely by introducing the concept of the domain of a tuple relational calculus expression: This is the set ofall values that either appear as constant values in the expression or exist in any tuple in the relations referenced in the expression. The domain of [t I NOT(EMPLOYEE(t))} is the set of all attribute values appearing in some tuple of the EMPLOYEE relation (for any attribute). The domain of the expression Q3A would include all values appearing in EMPLOYEE, PROJECT, and WORKS_ON (unioned with the value 5 appearing in the query itself). An expression is said to be safe if all values in its result are from the domain of the expression. Notice that the result of [t I NOT(EMPLOYEE(t))} is unsafe, since it will, in general, include tuples (and hence values) from outside the EMPLOYEE relation; such values are not in the domain of the expression. All of our other examples are safe expressions. 6.7 THE DOMAIN RELATIONAL CALCULUS There is another type of relational calculus called the domain relational calculus, or sim? ply, domain calculus. While SQL (see Chapter 8), a language based on tuple relational calculus, was being developed by IBM Research at San Jose, California, another language 182 I Chapter 6 The Relational Algebra and Relational Calculus called QBE (Query-By-Example) that is related to domain calculus was being developed almost concurrently at IBM Research at Yorktown Heights, New York. The formal specifi? cation of the domain calculus was proposed after the development of the QBE system. Domain calculus differs from tuple calculus in the type of variables used in formulas: Rather than having variables range over tuples, the variables range over single values from domains of attributes. To form a relation of degree n for a query result, we must have n of these domain variables-one for each attribute. An expression of the domain calculus is of the form {Xl' X2' ... 'Xn I COND(XI' X2' ... ,Xn' Xn+l' Xn+2' ... 'xn+m )} where Xl' X2' ... , x", Xn+l' x,,+2' ... , x,,+m are domain variables that range over domains (of attributes), and COND is a condition or formula of the domain relational calculus. A formula is made up of atoms. The atoms of a formula are slightly different from those for the tuple calculus and can be one of the following: 1. An atom of the form R(XI' X2' ... , xj ) , where R is the name of a relation of degreej and each Xi' 1 ::::; i ::::; i. is a domain variable. This atom states that a list of values of , 2':, ;t}, and Xi and xj are domain variables. 3. An atom of the form Xi op core op xj' where op is one of the comparison operators in the set {=, <, ::::;, >, 2':, ;t},Xi and Xj are domain variables, and c is a constant value. As in tuple calculus, atoms evaluate to either TRUE or FALSE for a specific set of values, called the truth values of the atoms. In case 1, if the domain variables are assigned values corresponding to a tuple of the specified relation R, then the atom is TRUE. In cases 2 and 3, if the domain variables are assigned values that satisfy the condition, then the atom is TRUE. In a similar way to the tuple relational calculus, formulas are made up of atoms, variables, and quantifiers, so we will not repeat the specifications for formulas here. Some examples of queries specified in the domain calculus follow. We will use lowercase letters I, m, n, ... , x, y, zfor domain variables. QUERY 0 Retrieve the birthdate and address of the employee whose name is 'John B. Smith'. QO: {uv I (3 q) (3 r) (3 s) (3 r) (3 w) (3 X) (3 y) (3 z) (EMPLOYEE(qrstuvwxyz) AND q='JOHN' AND r='B' AND 5='SMITH')} 6.7 The Domain Relational Calculus I 183 We need ten variables for the EMPLOYEE relation, one to range over the domain of each attribute in order. Of the ten variables q, r, s, ... , z, only u and v are free. We first specify the requested attributes, BDATE and ADDRESS, by the free domain variables u for BDATE and v for ADDRESS. Then we specify the condition for selecting a tuple following the bar (1)? namely, that the sequence of values assigned to the variables qrstuvwxyz be a tuple of the EMPLOYEE relation and that the values for q (FNAME), r (MINH), and s (LNAME) be 'John','B', and 'Smith', respectively. For convenience, we will quantify only those variables actually appearing in a condition (these would be q, r, and s in QO) in the rest of our examples.l' An alternative shorthand notation, used in QBE, for writing this query is to assign the constants 'John', 'B', and 'Smith' directly as shown in QOA. Here, all variables not appearing to the left of the bar are implicitly existentially quantified.!" QOA: {uv I EMPLOYEE('John','B','Smith',t,u,v,w,x,y,Z)} QUERY 1 Retrieve the name and address of all employees who work for the 'Research'department. Ql: {qsv I C3 z) C3 I) C3 m) CEMPLOYEECqrstuvwxyz) AND DEPARTMENTClmno) AND 1=' RESEARCH' AND m=z)} A condition relating two domain variables that range over attributes from two relations, such as m = Z in Ql, is a join condition; whereas a condition that relates a domain variable to a constant, such as I == 'Research',is a selection condition. QUERY 2 For every project located in 'Stafford',list the project number, the controlling depart? ment number, and the department manager's last name, birth date, and address. Q2: {iksuv I C3 j) C3 m) C3 n) C3 t) CPROJECTChijk) AND EMPLOYEECqrstuvwxyz) AND DEPARTMENTClmno) AND k=m AND n=t AND j='STAFFORD')} QUERY 6 Find the names of employees who have no dependents. Q6: {qs I C3 t) CEMPLOYEECqrstuvwxyz) AND CNOTC3 I) CDEPENDENTClmnop) AND t=l)))} Query 6 can be restated using universal quantifiers instead of the existential quantifiers, as shown in Q6A: Q6A: {qs I (3 t) (EMPLOYEE(qrstuvwxyz) AND (("1/ l) (NOT(DEPENDENT(lmnop» OR NOT(t=I»»} -- - -~------ 13. Note that the notation of quantifying only the domain variables actually used in conditions and ofshowing a predicate such as EMPLOYEE(qrstuvwxyz) without separating domain variables with com? mas isan abbreviated notation used for convenience; it is not the correct formal notation. 14. Again, this is not formally accurate notation. 184 I Chapter 6 The Relational Algebra and Relational Calculus QUERY 7 List the names of managers who have at least one dependent. Q7: {sq I (3 t) (3 j) (3 I) (EMPLOYEE (qrstuvwxyz) AND DEPARTMENTChijk) AND DEPENDENT(lmnop) AND t=j AND I=t)} As we mentioned earlier, it can be shown that any query that can be expressed in the relational algebra can also be expressed in the domain or tuple relational calculus. Also, any safe expression in the domain or tuple relational calculus can be expressed in the relational algebra. The Query-By-Example (QBE) language was based on the domain relational calculus, although this was realized later, after the domain calculus was formalized. QBE was one of the first graphical query languages with minimum syntax developed for database systems. It was developed at IBM Research and is available as an IBM commercial product as part of the QMF (Query Management Facility) interface option to DB2. It has been mimicked by several other commercial products. Because of its important place in the field of relational languages, we have included an overview of QBE in Appendix D. 6.8 SUMMARY In this chapter we presented two formal languages for the relational model of data. They are used to manipulate relations and produce new relations as answers to queries. We dis? cussed the relational algebra and its operations, which are used to specify a sequence of operations to specify a query. Then we introduced two types of relational calculi called tuple calculus and domain calculus; they are declarative in that they specify the result of a query without specifying how to produce the query result. In Sections 6.1 through 6.3, we introduced the basic relational algebra operations and illustrated the types of queries for which each is used. The unary relational operators SELECT and PROJECT, as well as the RENAME operation, were discussed first. Then we discussed binary set theoretic operations requiring that relations on which they are applied be union compatible; these include UNION, INTERSECTION, and SET DIFFERENCE. The CARTESIAN PRODUCT operation is a set operation that can be used to combine tuples from two relations, producing all possible combinations. It is rarely used in practice; however, we showed how CARTESIAN PRODUCT followed by SELECT can be used to define matching tuples from two relations and leads to the JOIN operation. Different JOIN operations called THETA JOIN, EQUIJOIN, and NATURAL JOIN were introduced. We then discussed some important types of queries that cannot be stated with the basic relational algebra operations but are important for practical situations. We introduced the AGGREGATE FUNCTION operation to deal with aggregate types of requests. We discussed recursive queries, for which there is no direct support in the algebra but which can be approached in a step-by-step approach, as we demonstrated. We then presented the OUTER JOIN and OUTER UNION operations, which extend JOIN and UNION and allow all information in source relations to be preserved in the result. Review Questions I 185 The last two sections described the basic concepts behind relational calculus, which is based on the branch of mathematical logic called predicate calculus. There are two types of relational calculi: (I) the tuple relational calculus, which uses tuple variables that range over tuples (rows) of relations, and (2) the domain relational calculus, which uses domain variables that range over domains (columns of relations). In relational calculus, a query is specified in a single declarative statement, without specifying any order or method for retrieving the query result. Hence, relational calculus is often considered to be a higher-level language than the relational algebra because a relational calculus expression states what we want to retrieve regardless of how the query may be executed. We discussed the syntax of relational calculus queries using both tuple and domain variables. We also discussed the existential quantifier (3) and the universal quantifier (tI). We saw that relational calculus variables are bound by these quantifiers. We described in detail how queries with universal quantification are written, and we discussed the problem of specifying safe queries whose results are finite. We also discussed rules for transforming universal into existential quantifiers, and vice versa. It is the quantifiers that give expressive power to the relational calculus, making it equivalent to relational algebra. There is no analog to grouping and aggregation functions in basic relational calculus, although some extensions have been suggested. Review Questions 6.1. List the operations of relational algebra and the purpose of each. 6.2. What is union compatibility? Why do the UNION, INTERSECTION, and DiFFER? ENCE operations require that the relations on which they are applied be union compatible? 6.3. Discuss some types of queries for which renaming of attributes is necessary in order to specify the query unambiguously. 6.4. Discuss the various types of inner join operations. Why is theta join required? 6.5. What role does the concept of foreign key play when specifying the most common types of meaningful join operations? 6.6. What is the FUNCTION operation? What is it used for? 6.7. How are the OUTER JOIN operations different from the INNER JOIN opera? tions? How is the OUTER UNION operation different from UNION? 6.8. In what sense does relational calculus differ from relational algebra, and in what sense are they similar? 6.9. How does tuple relational calculus differ from domain relational calculus? 6.10. Discuss the meanings of the existential quantifier (3) and the universal quantifier (V). 6.11. Define the following terms with respect to the tuple calculus: tuple variable, range relation, atom, formula, and expression. 6.12. Define the following terms with respect to the domain calculus: domain variable, range relation, atom, formula, and expression. 6.13. What is meant by a safe expression in relational calculus? 6.14. When is a query language called relationally complete? 186 I Chapter 6 The Relational Algebra and Relational Calculus Exercises 6.15. Show the result of each of the example queries in Section 6.5 as it would apply to the database state of Figure 5.6. 6.16. Specify the following queries on the database schema shown in Figure 5.5, using the relational operators discussed in this chapter. Also show the result of each query as it would apply to the database state of Figure 5.6. a. Retrieve the names of all employees in department 5 who work more than 10 hours per week on the 'ProductX'project. b. List the names of all employees who have a dependent with the same first name as themselves. c. Find the names of all employees who are directly supervised by 'Franklin Wong'. d. For each project, list the project name and the total hours per week (by all employees) spent on that project. e. Retrieve the names of all employees who work on every project. f. Retrieve the names of all employees who do not work on any project. g. For each department, retrieve the department name and the average salary of all employees working in that department. h. Retrieve the average salary of all female employees. i. Find the names and addresses of all employees who work on at least one project located in Houston but whose department has no location in Houston. j. List the last names of all department managers who have no dependents. 6.17. Consider the AIRLINE relational database schema shown in Figure 5.8, which was described in Exercise 5.11. Specify the following queries in relational algebra: a. For each flight, list the flight number, the departure airport for the first leg of the flight, and the arrival airport for the last leg of the flight. b. List the flight numbers and weekdays of all flights or flight legs that depart from Houston Intercontinental Airport (airport code 'IAH') and arrive in Los Angeles International Airport (airport code 'LAX'). c. List the flight number, departure airport code, scheduled departure time, arrival airport code, scheduled arrival time, and weekdays of all flights or flight legs that depart from some airport in the city of Houston and arrive at some airport in the city of Los Angeles. d. List all fare information for flight number 'co197'. e. Retrieve the number of available seats for flight number 'co197'on '1999-10-09'. 6.18. Consider the LIBRARY relational database schema shown in Figure 6.12, which is used to keep track of books, borrowers, and book loans. Referential integrity con? straints are shown as directed arcs in Figure 6.12, as in the notation of Figure 5.7. Write down relational expressions for the following queries: a. How many copies of the book titled The Lost Tribe are owned by the library branch whose name is 'Sharpstown'? b. How many copies of the book titled The Lost Tribe are owned by each library branch? c. Retrieve the names of all borrowers who do not have any books checked out. Exercises I 187 d. For each book that is loaned out from the 'Sharpstown' branch and whose DueDate is today, retrieve the book title, the borrower's name, and the bor? rower's address. e. For each library branch, retrieve the branch name and the total number of books loaned out from that branch. f. Retrieve the names, addresses, and number of books checked out for all bor? rowers who have more than five books checked out. g. For each book authored (or coauthored) by 'Stephen King,' retrieve the title and the number of copies owned by the library branch whose name is 'Central.' 6.19. Specify the following queries in relational algebra on the database schema given in Exercise 5.13: a. List the Order-s and Ship_date for all orders shipped from Warehouse number 'W2'. b. List the Warehouse information from which the Customer named 'Jose Lopez' was supplied his orders. Produce a listing: Order-s, Warehouse#. c. Produce a listing CUSTNAME, #OFORDERS, AVG_ORDER_AMT, where the middle column is the total number of orders by the customer and the last column is the average order amount for that customer. d. List the orders that were not shipped within 30 days of ordering. e. List the Orders for orders that were shipped from all warehouses that the com? pany has in New York. 6.20. Specify the following queries in relational algebra on the database schema given in Exercise 5.14: a. Give the details (all attributes of TRIP relation) for trips that exceeded $2000 in expenses. PublisherName PUBLISHER ~--A-dd-re-ss--I BranchName Phone I BOOK~LOANS BORROWER I ~ I-N-a-me-I Address I Phone FIGURE 6.12 A relational database schema for a LIBRARY database. I 188 I Chapter 6 The Relational Algebra and Relational Calculus b. Print the SSN of salesman who took trips to 'Honolulu'. c. Print the total trip expenses incurred by the salesman with SSN = '234-56? 7890'. 6.21. Specify the following queries in relational algebra on the database schema given in Exercise 5.15: a. List the number of courses taken by all students named 'John Smith' in Winter 1999 (i.e., Quarter = 'W99'). b. Produce a list of textbooks {include Courses, BooLISBN, Book, Title} for courses offered by the 'CS'department that have used more than two books. c. List any department that has all its adopted books published by 'AWL Publishing'. 6.22. Consider the two tables T1 and T2 shown in Figure 6.13. Show the results of the following operations: a. T1 tx:Tl.P= T2.A T2 b. T1 tx:TLQ = T2.B T2 c. T1 :>1 Tl .P = T2.A T2 d. T1 i>1 (Tl.P = T2.A AND Tl.R ~ Eel T2 6.23. Specify the following queries in relational algebra on the database schema of Exercise 5.16: a. For the salesperson named 'Jane Doe', list the following information for all the cars she sold: Serial», Manufacturer, Sale-price. b. List the Serials and Model of cars that have no options. c. Consider the NATURAL JOIN operation between SALESPERSON and SALES. What is the meaning of a left OUTER JOIN for these tables (do not change the order of relations). Explain with an example. d. Write a query in relational algebra involving selection and one set operation and say in words what the query does. 6.24. Specify queries a, b, c, e, f, i, and j of Exercise 6.16 in both tuple and domain rela? tional calculus. 6.25. Specify queries a, b, c, and d of Exercise 6.17 in both tuple and domain relational calculus. 6.26. Specify queries c, d, f, and g of Exercise 6.18 in both tuple and domain relational calculus. TableT1 ~ 5 10 a 15 b 25 a 8 6 TableT2 c:ITI:ITJ 6 10 b 25 10 c b 3 5 FIGURE 6.13 A database state for the relations T1 and T2. Selected Bibliography I 189 6.27. In a tuple relational calculus query with n tuple variables, what would be the typi? cal minimum number of join conditions? Why? What is the effect of having a smaller number of join conditions? 6.28. Rewrite the domain relational calculus queries that followed QO in Section 6.7 in the style of the abbreviated notation of QOA, where the objective is to minimize the number of domain variables by writing constants in place of variables wher? ever possible. 6.29. Consider this query: Retrieve the SSNS of employees who work on at least those projects on which the employee with SSN = 123456789 works. This may be stated as (FORALL x) (IF P THEN Q), where • x is a tuple variable that ranges over the PROJECT relation. • P == employee with SSN = 123456789 works on project x. • Q == employee e works on project x. Express the query in tuple relational calculus, using the rules • ('ifx)(P(x)) == NOT(3x)(NOT(P(x))). • (IF P THEN Q) == (NOT(P) OR Q). 6.30. Show how you may specify the following relational algebra operations in both tuple and domain relational calculus. a. ITA=dR(A, B, C)) b. 1T(R(A, B, C)) c. R(A: B, C) * S(C, 0, E) d. R(A, B, C) U S(A, B, C) e. R(A, B, C) n S(A, B, C) f. R(A, B, C) - S(A, B, C) g. R(A, B, C) X S(O, E, F) h. R(A, B) -7- S(A) 6.31. Suggest extensions to the relational calculus so that it may express the following types of operations that were discussed in Section 6.4: (a) aggregate functions and grouping; (b) OUTER JOIN operations; (c) recursive closure queries. Selected Bibliography Codd (1970) defined the basic relational algebra. Date (1983a) discusses outer joins. Work on extending relational operations is discussed by Cadis (1986) and Ozsoyoglu et al. (1985). Cammarata et al. (1989) extends the relational model integrity constraints and joins. Codd (1971) introduced the language Alpha, which is based on concepts of tuple relational calculus. Alpha also includes the notion of aggregate functions, which goes beyond relational calculus. The original formal definition of relational calculus was given by Codd (1972), which also provided an algorithm that transforms any tuple relational calculus expression to relational algebra. The QUEL (Stonebraker et al, 1976) is based on tuple relational calculus, with implicit existential quantifiers but no universal quantifiers, and was implemented in the Ingres system as a commercially available language. Codd defined relational completeness of a query language to mean at least as powerful as 190 I Chapter 6 The Relational Algebra and Relational Calculus relational calculus. Ullman (1988) describes a formal proof of the equivalence of relational algebra with the safe expressions of tuple and domain relational calculus. Abiteboul et a1. (1995) and Atzeni and deAntonellis (1993) give a detailed treatment of formal relational languages. Although ideas of domain relational calculus were initially proposed in the QBE language (Zloof 1975), the concept was formally defined by Lacroix and Pirotte (1977). The experimental version of the Query-By-Example system is described in Zloof (1977). The ILL (Lacroix and Pirotte 1977a) is based on domain relational calculus. Whang et al. (1990) extends QBE with universal quantifiers. Visual query languages, of which QBE is an example, are being proposed as a means of querying databases; conferences such as the Visual Database Systems Workshop (e.g., Arisawa and Catarci (2000) or Zhou and Pu (2002) have a number of proposals for such languages. Relational Database Design by ER- and EER-to-Relational Mapping We now focus on how to design a relational database schema based on a conceptual schema design. This corresponds to the logical database design or data model mapping step discussed in Section 3.1 (see Figure 3.1). We present the procedures to create a relational schema from an entity-relationship (ER) or an enhanced ER (EER) schema. Our discussion relates the constructs of the ER and EER models, presented in Chapters 3 and 4, to the con? structs of the relational model, presented in Chapters 5 and 6. Many CASE (computer-aided software engineering) tools are based on the ERor EER models, or other similar models, as we have discussed in Chapters 3 and 4. These computerized tools are used interactively by data? base designers to develop an ER or EER schema for a database application. Many tools use ER or EER diagrams or variations to develop the schema graphically, and then automatically convert it into a relational database schema in the DOL of a specific relational DBMS by employing algorithms similar to the ones presented in this chapter. We outline a seven-step algorithm in Section 7.1 to convert the basic ER model constructs--entity types (strong and weak), binary relationships (with various structural constraints), n-ary relationships, and attributes (simple, composite, and multivalued)-into relations. Then, in Section 7.2, we continue the mapping algorithm by describing how to map EER model constructs-specialization/generalization and union types (categories)? into relations. 191 192 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping 7.1 RELATIONAL DATABASE DESIGN USING ER-TO-RELATIONAL MAPPING 7.1.1 ER-to-Relational Mapping Algorithm We now describe the steps of an algorithm for ER-to-relational mapping. We will use the COMPANY database example to illustrate the mapping procedure. The COMPANY ER schema is shown again in Figure 7.1, and the corresponding COMPANY relational database schema is shown in Figure 7.2 to illustrate the mapping steps. Bdate CONTROLS supetVisor SUPERVISION supervisee N N FIGURE 7.1 The ER conceptual schema diagram for the COMPANY database. N 7.1 Relational Database Design Using ER-to-Relational Mapping I 193 DEPT_LOCATIONS DNUMBER PROJECT MGRSTARTDATE DLOCATION PLOCATION DEPENDENT_NAME RELATIONSHIP FIGURE 7.2 Result of mapping the COMPANY ER schema into a relational database schema. Step 1: Mapping of Regular Entity Types. For each regular (strong) entity type Ein the ERschema, create a relation R that includes all the simple attributes of E. Include only the simple component attributes of a composite attribute. Choose one of the key attributes of E as primary key for R. If the chosen key of E is composite, the set of simple attributes that form it will together form the primary key of R. If multiple keys were identified for E during the conceptual design, the information describing the attributes that form each additional key is kept in order to specify secondary (unique) keys of relation R. Knowledge about keys is also kept for indexing purposes and other types of analyses. In our example, we create the relations EMPLOYEE, DEPARTMENT, and PROJECT in Figure 7.2 to correspond to the regular entity types EMPLOYEE, DEPARTMENT, and PROJ ECTfrom Figure 7.1. The foreign key and relationship attributes, if any, are not included yet; they will be added during subsequent steps. These include the attributes SUPERSSN and DNO of EMPLOYEE, MGRSSN and MGRSTARTDATE of DEPARTMENT, and DNUM of PROJECT. In our example, we choose SSN, DNUMBER, and PNUMBER as primary keys for the relations EMPLOYEE, DEPARTMENT, and PROJECT, 194 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping respectively. Knowledge that DNAME of DEPARTMENT and PNAME of PROJECT are secondary keys is kept for possible use later in the design. The relations that are created from the mapping of entity types are sometimes called entity relations because each tuple (row) represents an entity instance. Step 2: Mapping of Weak Entity Types. For each weak entity type W in the ER schema with owner entity type E, create a relation R and include all simple attributes (or simple components of composite attributes) of W as attributes of R. In addition, include as foreign key attributes of R the primary key attributets) of the relationts) that corre? spond to the owner entity tvpets): this takes care of the identifying relationship type of W The primary key of R is the combination of the primary keyts) of the ownerts) and the partial key of the weak entity type W, if any. If there is a weak entity type E2 whose owner is also a weak entity type E 1, then E] should be mapped before E2 to determine its primary key first. Inour example, we create the relation DEPENDENT in this step to correspond to the weak entity type DEPENDENT. We include the primary key SSN of the EMPLOYEE relation-which corresponds to the owner entity type-as a foreign key attribute of DEPENDENT; we renamed it ESSN, although this is not necessary. The primary key of the DEPENDENT relation is the combination {ESSN, DEPENDENT_NAME} because DEPENDENT_NAME is the partial key of DEPENDENT. It is common to choose the propagate (CASCADE) option for the referential triggered action (see Section 8.2) on the foreign key in the relation corresponding to the weak entity type, since a weak entity has an existence dependency on its owner entity. This can be used for both ON UPDATE and ON DELETE. Step 3: Mapping of Binary 1:1 Relationship Types. For each binary 1:1 rela? tionship type R in the ER schema, identify the relations 5 and T that correspond to the entity types participating in R. There are three possible approaches: (1) the foreign key approach, (2) the merged relationship approach, and (3) the cross-reference or relation? ship relation approach. Approach 1 is the most useful and should be followed unless spe? cial conditions exist, as we discuss below. 1. Foreign key approach: Choose one of the relations-5, say-and include as a for? eign key in 5 the primary key of T. It is better to choose an entity type with total participation in R in the role of 5. Include all the simple attributes (or simple com? ponents of composite attributes) of the 1:1 relationship type R as attributes of S. In our example, we map the 1:1 relationship type MANAGES from Figure 7.1 by choosing the participating entity type DEPARTMENT to serve in the role of 5, because its participation in the MANAGES relationship type is total (every department has a manager). We include the primary key of the EMPLOYEE relation as foreign key in the DEPARTMENT relation and rename it MGRSSN. We also include the simple attribute STARTDATE of the MANAGES relationship type in the DEPARTMENT relation and rename it MGRSTARTDATE. Note that it is possible to include the primary key of 5 as a foreign key in T instead. In our example, this amounts to having a foreign key attribute, say in the EMPLOYEE relation, but it will have a null value for DEPARTMENT_MANAGED 7.1 Relational Database Design Using ER-to-Relational Mapping 1195 employee tuples who do not manage a department. If only 10 percent of employ? ees manage a department, then 90 percent of the foreign keys would be null in this case. Another possibility is to have foreign keys in both relations Sand T redundantly, but this incurs a penalty for consistency maintenance. 2. Merged relation option: An alternative mapping of a 1:1 relationship type is possi? ble by merging the two entity types and the relationship into a single relation. This may be appropriate when bothparticipations are total. 3. Cross-reference or relationship relation option: The third alternative is to set up a third relation R for the purpose of cross-referencing the primary keys of the two relations Sand T representing the entity types. As we shall see, this approach is required for binary M:N relationships. The relation R is called a relationship rela? tion, (or sometimes a lookup table), because each tuple in R represents a relation? ship instance that relates one tuple from S with one tuple of T. Step 4: Mapping of Binary 1 :N Relationship Types. For each regular binary l:N relationship type R, identify the relation S that represents the participating entity type at the N-side of the relationship type. Include as foreign key in S the primary key of therelation T that represents the other entity type participating in R; this is done because each entity instance on the N-side is related to at most one entity instance on the I-side ofthe relationship type. Include any simple attributes (or simple components of compos? iteattributes) of the I:N relationship type as attributes of S. In our example, we now map the I:N relationship types WORKS_FOR, CONTROLS, and SUPER? VISION from Figure 7.1. For WORKS_FOR we include the primary key DNUMBER of the DEPARTMENT relation as foreign key in the EMPLOYEE relation and call it DNO. For SUPERVISION we include the primary key of the EMPLOYEE relation as foreign key in the EMPLOYEE relation itself? because the relationship is recursive-and call it SUPERSSN. The CONTROLS relationship is mapped to the foreign key attribute DNUM of PROJECT, which references the primary key DNUM? BER ofthe DEPARTMENT relation. An alternative approach we can use here is again the relationship relation (cross? reference) option as in the case of binary 1:1 relationships. We create a separate relation Rwhose attributes are the keys of Sand T, and whose primary key is the same as the key ofS. This option can be used if few tuples in S participate in the relationship to avoid excessive null values in the foreign key. Step 5: Mapping of Binary M:N Relationship Types. For each binary M:N relationship type R, create a new relation S to represent R. Include as foreign key attributes in S the primary keys of the relations that represent the participating entity types; their combination will form the primary key of S. Also include any simple attributes of the M:N relationship type (or simple components of composite attributes) as attributes of S. Notice thatwe cannot represent an M:N relationship type by a single foreign key attribute in one ofthe participating relations (as we did for 1:1 or I:N relationship types) because of the M:N cardinality ratio; we must create a separate relationship relation S. In our example, we map the M:N relationship type WORKS_ON from Figure 7.1 by creating the relation WORKS_ON in Figure 7.2. We include the primary keys of the PROJECT 196 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping and EMPLOYEE relations as foreign keys in WORKS_ON and rename them PNO and ESSN, respectively. We also include an attribute HOURS in WORKS_ON to represent the HOURS attribute of the relationship type. The primary key of the WORKS_ON relation is the combination of the foreign key attributes {ESSN, PNO}. The propagate (CASCADE) option for the referential triggered action (see Section 8.2) should be specified on the foreign keys in the relation corresponding to the relationship R, since each relationship instance has an existence dependency on each of the entities it relates. This can be used for both ON UPDATE and ON DELETE. Notice that we can always map 1:1 or l:N relationships in a manner similar to M:N relationships by using the cross-reference (relationship relation) approach, as we discussed earlier. This alternative is particularly useful when few relationship instances exist, in order to avoid null values in foreign keys. In this case, the primary key of the relationship relation will be only one of the foreign keys that reference the participating entity relations. For a l:N relationship, the primary key of the relationship relation will be the foreign key that references the entity relation on the N -side. For a 1:1 relationship, either foreign key can be used as the primary key of the relationship relation as long as no null entries are present in that relation. Step 6: Mapping of Multivalued Attributes. For each multivalued attribute A, create a new relation R. This relation R will include an attribute corresponding to A, plus the primary key attribute K-as a foreign key in R-of the relation that represents the entity type or relationship type that has A as an attribute. The primary key of R is the combination of A and K. If the multivalued attribute is composite, we include its simple components. In our example, we create a relation DEPT_LOCATIONS. The attribute DLOCATION represents the multivalued attribute LOCATIONS of DEPARTMENT, while DNUMBER-as foreign key? represents the primary key of the DEPARTMENT relation. The primary key of DEPT_LOCATIONS is the combination of {DNUMBER, DLOCATION}. A separate tuple will exist in DEPT_LOCATIONS for each location that a department has. The propagate (CASCADE) option for the referential triggered action (see Section 8.2) should be specified on the foreign key in the relation R corresponding to the multivalued attribute for both ON UPDATE and ON DELETE. We should also note that the key of R when mapping a composite, multivalued attribute requires some analysis of the meaning of the component attributes. In some cases when a multivalued attribute is composite, only some of the component attributes are required to be part of the key of Rj these attributes are similar to a partial key of a weak entity type that corresponds to the multivalued attribute (see Section 3.5). Figure 7.2 shows the COMPANY relational database schema obtained through steps 1 to 6, and Figure 5.6 shows a sample database state. Notice that we did not yet discuss the mapping of n-ary relationship types (n > 2), because none exist in Figure 7.1 j these are mapped in a similar way to M:N relationship types by including the following additional step in the mapping algorithm. Step 7: Mapping of N-ary Relationship Types. For each n-ary relationship type R, where n > 2, create a new relation S to represent R. Include as foreign key 7.1 Relational Database Design Using ER-to-Relational Mapping I 197 attributes in S the primary keys of the relations rhat represent rhe participating entity types. Also include any simple attributes of the n-ary relationship type (or simple compo? nents of composite attributes) as attributes of S. The primary key of S is usually a combi? nation of all the foreign keys that reference the relations representing the participating entity types. However, if the cardinality constraints on any of the entity types E partici? pating in R is 1, then the primary key of S should not include the foreign key attribute thatreferences the relation E'corresponding to E (see Section 4.7). For example, consider the relationship type SUPPLY of Figure 4.11a. This can be mapped to the relation SUPPLY shown in Figure 7.3, whose primary key is the combination ofthe three foreign keys {SNAME, PARTNO, PROJNAME}. 7.1.2 Discussion and Summary of Mapping for Model Constructs Table 7.1 summarizes the correspondences between ER and relational model constructs and constraints. One of the main points to note in a relational schema, in contrast to an ER schema, is that relationship types are not represented explicitly; instead, they are represented by having two attributes A and B, one a primary key and the other a foreign key (over the same domain) included in two relations Sand T. Two tuples in Sand T are related when they have the same value for A and B. By using the EQUI)OIN operation (or NATURAL JOIN ifthe two join attributes have the same name) over S.A and T.B, we can combine all pairs ofrelated tuples from Sand T and materialize the relationship. When a binary 1:1 or SUPPLIER I~ PROJECT I PROJNAME PART I~ SUPPLY I SNAME PROJNAME PARTNO QUANTITY FIGURE 7.3 Mapping the n-ary relationship type SUPPLY from Figure 4.11a. 198 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping TABLE 7.1 CORRESPONDENCE BETWEEN ER AND RElATIONAL MODELS ER MODEL Entity type 1:1 or l:N relationship type M:N relationship type n-ary relationship type Simple attribute Composite attribute Multivalued attribute Value set Key attribute RELATIONAL MODEL "Entity" relation Foreign key (or "relationship" relation) "Relationship" relation and two foreign keys "Relationship" relation and n foreign keys Attribute Set of simple component attributes Relation and foreign key Domain Primary (or secondary) key l:N relationship type is involved, a single join operation is usually needed. For a binary M:N relationship type, two join operations are needed, whereas for n-ary relationship types, n joins are needed to fully materialize the relationship instances. For example, to form a relation that includes the employee name, project name, and hours that the employee works on each project, we need to connect each EMPLOYEE tuple to the related PROJ ECT tuples via the WORKS_ON relation of Figure 7.2. Hence, we must apply the EQUI]OlN operation to the EMPLOYEE and WORKS_ON relations with the join condition SSN = ESSN, and then apply another EQUI]OIN operation to the resulting relation and the PROJECT relation with join condition PNO = PNUMBER. In general, when multiple relationships need to be traversed, numerous join operations must be specified. A relational database user must always be aware of the foreign key attributes in order to use them correctly in combining related tuples from two or more relations. This is sometimes considered to be a drawback of the relational data model because the foreign key/primary key correspondences are not always obvious upon inspection of relational schemas. If an equijoin is performed among attributes of two relations that do not represent a foreign key/primary key relationship, the result can often be meaningless and may lead to spurious (invalid) data. For example, the reader can try joining the PROJECT and DEPT_LOCATIONS relations on the condition DLOCA? TION = PLaCATION and examine the result (see also Chapter 10). Another point to note in the relational schema is that we create a separate relation for each multivalued attribute. For a particular entity with a set of values for the multi valued attribute, the key attribute value of the entity is repeated once for each value of the multivalued attribute in a separate tuple. This is because the basic relational model does not allow multiple values (a list, or a set of values) for an attribute in a single tuple. For example, because department 5 has three locations, three tuples exist in the DEPT_LOCATIONS relation of Figure 5.6; each tuple specifies one of the locations. In our example, we apply EQUIJOIN to DEPT_LOCATIONS and DEPARTMENT on the DNUMBER attribute to get the values of all locations along with other DEPARTMENT attributes. In the resulting relation, the values of the other department attributes are repeated in separate tuples for every location that a department has. 7.2 Mapping EER Model Constructs to Relations 1199 The basic relational algebra does not have a NEST or COMPRESS operation that would produce from the DEPT_LOCATIONS relation of Figure 5.6 a set of tuples of the form {, <4, Stafford>, <5, {Bellaire, Sugarland, Houston]»]. This is a serious drawback ofthe basic normalized or "flat" version of the relational model. On this score, the object? oriented model and the legacy hierarchical and network models have better facilities than does the relational model. The nested relational model and object-relational systems (see Chapter 22) attempt to remedy this. 7.2 MAPPING EER MODEL CONSTRUCTS TO RELATIONS We now discuss the mapping of EER model constructs to relations by extending the Ek-to? relational mapping algorithm that was presented in Section 7.1.1. 7.2.1 Mapping of Specialization or Generalization There are several options for mapping a number of subclasses that together form a special? ization (or alternatively, that are generalized into a superclass), such as the {SECRETARY, TECHNICIAN, ENGINEER} subclasses of EMPLOYEE in Figure 4.4. We can add a further step to our ER-to-relational mapping algorithm from Section 7.1.1, which has seven steps, to handle the mapping of specialization. Step 8, which follows, gives the most common options; other mappings are also possible. We then discuss the conditions under which each option should be used. We use Attrs(R) to denote the attributes of relation R, and PK(R) to denote the primary key of R. Step 8: Options for Mapping Specialization or Generalization. Convert each specialization with m subclasses {SI' S2'..., Sm} and (generalized) superclass C, where the attributes of Care {k, aI' ... an} and k is the (primary) key, into relation schemas using one ofthe four following options: • Option 8A: Multiple relations-Superclass and subclasses. Create a relation L for C with attributes Attrs(L) = {k, aI' ... , an} and PK(L) = k. Create a relation L, for each subclass Sj, 1 :::; i :::; m, with the attributes Attrs(L) = {k} U {attributes of SJ and PK(L) = k. This option works for any specialization (total or partial, disjoint or over? lapping). • Option 8B: Multiple relations-Subclass relations only. Create a relation Lj for each subclassSj' 1 :::; i :::; rn, with the attributes Attrs(Lj ) = {attributes of SJ U {k, aI' ..., an} and PK(L) = k. This option only works for a specialization whose subclasses are total (every entity in the superclass must belong to (at least) one of the subclasses). • Option 8e: Single relation with one type attribute. Create a single relation L with attributes Attrs(L) = {k, aI' ... , an} U {attributes of 51} U ... U {attributes of Sm} U It} and PK(L) = k. The attribute t is called a type (or discriminating) attribute that 200 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping indicates the subclass to which each tuple belongs, if any. This option works only for a specialization whose subclasses are disjoint, and has the potential for generating many null values if many specific attributes exist in the subclasses. • Option 8D: Single relation with multiple type attributes. Create a single relation schema L with attributes Attrs(L) = {k, aI' ... , an} U {attributes of Sl} U ... U , tm}and PK(L) =k. Each ti , 1 :::; i :::; m, is a Boolean type {attributes of Sm} U ttl' t2, ••• attribute indicating whether a tuple belongs to subclass Sj.This option works for a specialization whose subclasses are overlapping (but will also work for a disjoint spe? cialization). Options 8A and 8B can be called the multiple-relation options, whereas options se and 8D can be called the single-relation options. Option 8A creates a relation L for the superclass C and its attributes, plus a relation L, for each subclass Si; each Li includes the specific (or local) attributes of Sj, plus the primary key of the superclass C, which is propagated to Lj and becomes its primary key. An EQUIJOIN operation on the primary key between any Lj and L produces all the specific and inherited attributes of the entities in 5,. This option is illustrated in Figure 7.4a for the EER schema in Figure 4.4. Option SA (a) (b) (c) (d) SECRETARY ~ TypingSpeed TECHNICIAN ~ TGrade CAR ENGINEER ~I-En-g-l'-yp-e- LicensePlateNo NoOfPassengers UcensePlateNo ManufactureDate SupplierName FIGURE 7.4 Options for mapping specialization or generalization. (a) Mapping the EER schema in Figure 4.4 using option 8A. (b) Mapping the EER schema in Figure 4.3b using option 8B. (c) Mapping the EER schema in Figure 4.4 using option BC. (d) Mapping Figure 4.5 using option 80 with Boolean type fields MFlag and PFlag. 7.2 Mapping EER Model Constructs to Relations I 201 works for any constraints on the specialization: disjoint or overlapping, total or partial. Notice that the constraint 'IT(L) must hold for each Li. This specifies a foreign key from each Li to L, as well as an inclusion dependency Li.k < L.k (see Section 11.5). In option 8B, the EQUIJOIN operation is built into the schema, and the relation L is done awaywith, as illustrated in Figure 7.4b for the EER specialization in Figure 4.3b. This option works well only when both the disjoint and total constraints hold. If the specialization is not total, an entity that does not belong to any of the subclasses 5i is lost. If the specialization is not disjoint, an entity belonging to more than one subclass will have its inherited attributes from the superclass C stored redundantly in more than one Li• With option 8B, no relation holds all the entities in the superclass C; consequently, we must apply an OUTER UNION (or FULL OUTER JOIN) operation to the L, relations to retrieve all the entities in C. The result of the outer union will be similar to the relations under options 8C and 8D except that the type fields will be missing. Whenever we search for an arbitrary entity in C, we must search all the m relations Li. Options 8C and 8D create a single relation to represent the superclass C and all its subclasses. An entity that does not belong to some of the subclasses will have null values for the specific attributes of these subclasses. These options are hence not recommended if many specific attributes are defined for the subclasses. If few specific subclass attributes exist, however, these mappings are preferable to options 8A and 8B because they do away with the need to specify EQUIJOIN and OUTER UNION operations and hence can yield a more efficient implementation. Option 8C is used to handle disjoint subclasses by including a single type (or image ordiscriminating) attribute t to indicate the subclass to which each tuple belongs; hence, the domain of t could be {I, 2, ... , m}. If the specialization is partial, t can have null values in tuples that do not belong to any subclass. If the specialization is attribute? defined, that attribute serves the purpose of t and t is not needed; this option is illustrated inFigure 7.4c for the EERspecialization in Figure 4.4. Option 8D is designed to handle overlapping subclasses by including m Boolean type fields, one for each subclass. It can also be used for disjoint subclasses. Each type field r, can have a domain {yes, no}, where a value of yes indicates that the tuple is a member of subclass 5i. If we use this option for the EER specialization in Figure 4.4, we would include three types attributes-IsASecretary, IsAEngineer, and IsATechnician-instead of the JobType attribute in Figure 7.4c. Notice that it is also possible to create a single type attribute of m bits instead of the m type fields. When we have a multilevel specialization (or generalization) hierarchy or lattice, we do not have to follow the same mapping option for all the specializations. Instead, we can use one mapping option for part of the hierarchy or lattice and other options for other parts. Figure 7.5 shows one possible mapping into relations for the EER lattice of Figure 4.6. Here we used option 8A for PERSON/{EMPLOYEE, ALUMNUS, STUDENT}, option 8C for EMPLOYEE/ {STAFF, FACULTY, STUDENT_ASSISTANT}, and option 8D for STUDENT_ASSISTANT/{RESEARCH_ASSISTANT, TEACHING_ASSISTANT}, STUDENT/STUDENT_ASSISTANT (in STUDENT), and STUDENT/{GRADUATE_STUDENT, UNDERGRADUATE_STUDENT}. In Figure 7.5, all attributes whose names end with 'Type' or 'Flag' are type fields. 202 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping PERSON ~I-N-a-m-e---rl-B-irt-h-D-a-te-~ Address I EmployeeType ALUMNUS I SSN I ALUMNUS_DEGREES ~Degree~ UndergradFlag PercentTIme DegreeProgram StudAssistFlag FIGURE 7.5 Mapping the EER specialization lattice in Figure 4.6 using multiple options. 7.2.2 Mapping of Shared Subclasses (Multiple Inheritance) A shared subclass, such as ENGINEERING_MANAGER of Figure 4.6, is a subclass of several super? classes, indicating multiple inheritance. These classes must all have the same key attribute; otherwise, the shared subclass would be modeled as a category. We can apply any of the options discussed in step 8 to a shared subclass, subject to the restrictions discussed in step8 of the mapping algorithm. In Figure 7.5, both options 8C and 8D are used for the shared subclass STUDENT_ASSISTANT. Option 8C is used in the EMPLOYEE relation (EmployeeType attribute) and option 8D is used in the STUDENT relation (StudAssistFlag attribute). 7.2.3 Mapping of Categories (Union Types) We now add another step to the mapping procedure-step 9-to handle categories. A category (or union type) is a subclass of the union of two or more superclasses that can have different keys because they can be of different entity types. An example is the OWNER category shown in Figure 4.7, which is a subset of the union of three entity types PERSON, BANK, and COMPANY. The other category in that figure, REGISTERED_VEHICLE, has two superclasses that have the same key attribute. Step 9: Mapping of Union Types (Categories). For mapping a category whose defining superclasses have different keys, it is customary to specify a new key attribute, called a surrogate key, when creating a relation to correspond to the category. This is because the keys of the defining classes are different, so we cannot use anyone of them exclusively to identify all entities in the category. In our example of Figure 4.7, we can create a relation OWNER to correspond to the OWNER category, as illustrated in Figure 7.6, and include any attributes of the category in this relation. The primary key of the OWNER relation 7.3 Summary I 203 PERSON SSN DriverLicenseNo BANK I~ I BAddress Ownerld COMPANY ~~-C-A-dd-r-es-s-[ Ownerld I OWNER I~I REGISTERED VEHICLE I~ I LicensePlateNumber CAR I~ CStyie I CMake CModel CYear TRUCK I~ TMake I TModel I Tonnage I TYear I PurchaseDate LienOrRegular FIGURE 7.6 Mapping the EER categories (union types) in Figure 4.7 to relations. is thesurrogate key, which we called Ownerld. We also include the surrogate key attribute Ownerld as foreign key in each relation corresponding to a superclass of the category, to specify the correspondence in values between the surrogate key and the key of each superclass. Notice that if a particular PERSON (or BANK or COMPANY) entity is not a member of OWNER, it would have a null value for its Ownerld attribute in its corresponding tuple in the PERSON (or BANK or COMPANY) relation, and it would not have a tuple in the OWNER relation. For a category whose superclasses have the same key, such as VEHICLE in Figure 4.7, there is no need for a surrogate key. The mapping of the REGISTERED_VEHICLE category, which illustrates this case, is also shown in Figure 7.6. 7.3 SUMMARY InSection7.1, we showed how a conceptual schema design in the ER model can be mapped to arelational database schema. An algorithm for ER-to-relationaI mapping was given and illus? trated by examples from the COMPANY database. Table 7.1 summarized the correspondences between the ER and relational model constructs and constraints. We then added additional steps to the algorithm in Section 7.2 for mapping the constructs from the EER model into the 204 I Chapter 7 Relational Database Design by ER- and EER-to-Relational Mapping relational model. Similar algorithms are incorporated into graphical database design tools to automatically create a relational schema from a conceptual schema design. Review Questions 7.1. Discuss the correspondences between the ER model constructs and the relational model constructs. Show how each ER model construct can be mapped to the rela? tional model, and discuss any alternative mappings. 7.2. Discuss the options for mapping EER model constructs to relations. Exercises 7.3. Try to map the relational schema of Figure 6.12 into an ER schema. This is part of a process known as reverse engineering, where a conceptual schema is created for an existing implemented database. State any assumptions you make. 7.4. Figure 7.7 shows an ER schema for a database that may be used to keep track of transport ships and their locations for maritime authorities. Map this schema into a relational schema, and specify all primary keys and foreign keys. 7.5. Map the BANK ER schema of Exercise 3.23 (shown in Figure 3.17) into a relational schema. Specify all primary keys and foreign keys. Repeat for the AIRLINE schema Date N ~ 1 N TYPE (0:) ~(1,1) ~ (0:) \--F~===",~====c.--N0~1 ON --~ ---! FIGURE 7.7 An ER schema for a SHIP_TRACKING database. Selected Bibliography I 205 (Figure 3.16) of Exercise 3.19 and for the other schemas for Exercises 3.16 through 3.24. 7.6. Map the EER diagrams in Figures 4.10 and 4.17 into relational schemas. Justify your choice of mapping options. Selected Bibliography The original ER-to-relational mapping algorithm was described in Chen's classic paper (Chen 1976) that presented the original ER model. sQL-99: Schema Definition, Basic Constraints, and Queries The SQL language may be considered one of the major reasons for the success of rela? tional databases in the commercial world. Because it became a standard for relational databases, users were less concerned about migrating their database applications from other types of database systems-for example, network or hierarchical systems-to rela? tional systems.The reason is that even if users became dissatisfied with the particular rela? tional DBMS product they chose to use, converting to another relational DBMS product would not be expected to be too expensive and time-consuming, since both systems would follow the same language standards. In practice, of course, there are many differ? ences between various commercial relational DBMS packages. However, if the user is dili? gent in using only those features that are part of the standard, and if both relational systems faithfully support the standard, then conversion between the two systems should be much simplified. Another advantage of having such a standard is that users may write statements in a database application program that can access data stored in two or more relational DBMSs without having to change the database sublanguage (SQL) if both rela? tional DBMSs support standard SQL. This chapter presents the main features of the SQL standard for commercial relational DBMSs, whereas Chapter 5 presented the most important concepts underlying the formal relational data model. In Chapter 6 (Sections 6.1 through 6.5) we discussed the relational algebra operations, which are very important for understanding the types of requests that may be specified on a relational database. They are also important for query processing and optimization in a relational DBMS, as we shall see in Chapters 15 and 16. However, the 207 208 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries relational algebra operations are considered to be too technical for most commercial DBMS users because a query in relational algebra is written as a sequence of operations that, when executed, produces the required result. Hence, the user must specify how-that is, in what order-to execute the query operations. On the other hand, the SQL language providesa higher-level declarative language interface, so the user only specifies what the result is to be, leaving the actual optimization and decisions on how to execute the query to the DBMS. Although SQL includes some features from relational algebra, it is based to a greater extent on the tuple relational calculus, which we described in Section 6.6. However, the SQL syntax is more user-friendly than either of the two formal languages. The name SQL is derived from Structured Query Language. Originally, SQL was called SEQUEL (for Structured English QUEry Language) and was designed and implemented at IBM Research as the interface for an experimental relational database system called SYSTEM R. SQL is now the standard language for commercial relational DBMSs. A joint effort by ANSI (the American National Standards Institute) and ISO (the International Standards Organization) has led to a standard version of SQL (ANSI 1986), called sQL-86 or SQLl. A revised and much expanded standard called sQL2 (also referred to as sQL-92) was subsequently developed. The next version of the standard was originally called SQL3, but is now called sQL-99. We will try to cover the latest version of SQL as much as possible. SQL is a comprehensive database language: It has statements for data definition, query, and update. Hence, it is both a DOL and a DML. In addition, it has facilities for defining views on the database, for specifying security and authorization, for defining integrity constraints, and for specifying transaction controls. It also has rules for embedding SQL statements into a general-purpose programming language such as Java or COBOL or C/C+ +.1 We will discuss most of these topics in the following subsections. Because the specification of the SQL standard is expanding, with more features in each version of the standard, the latest SQL-99 standard is divided into a core specification plus optional specialized packages. The core is supposed to be implemented by all RDBMS vendors that are sQL-99 compliant. The packages can be implemented as optional modules to be purchased independently for specific database applications such as data mining, spatial data, temporal data, data warehousing, on-line analytical processing (OLAP), multimedia data, and so on. We give a summary of some of these packages-and where they are discussed in the book-at the end of this chapter. Because SQL is very important (and quite large) we devote two chapters to its basic features. In this chapter, Section 8.1 describes the SQL DOL commands for creating schemas and tables, and gives an overview of the basic data types in SQL. Section 8.2 presents how basic constraints such as key and referential integrity are specified. Section 8.3 discusses statements for modifying schernas, tables, and constraints. Section 8,4 describes the basic SQL constructs for specifying retrieval queries, and Section 8.5 goes over more complex features of SQL queries, such as aggregate functions and grouping. Section 8.6 describes the SQL commands for insertion, deletion, and updating of data. ---- ---_..__...-----,,--_.__ ._-" 1. Originally, SQL had statements for creating and dropping indexes on the files that representrela? tions, but these have been dropped from the SQL standard for some time. 8.1 SQL Data Definition and Data Types I 209 Section 8.7 lists some SQL features that are presented in other chapters of the book; these include transaction control in Chapter 17, security/authorization in Chapter 23, active databases (triggers) in Chapter 24, object-oriented features in Chapter 22, and OLAP (Online Analytical Processing) features in Chapter 28. Section 8.8 summarizes the chapter. In the next chapter, we discuss the concept of views (virtual tables), and then describe how more general constraints may be specified as assertions or checks. This is followed by a description of the various database programming techniques for programming with SQL. For the reader who desires a less comprehensive introduction to SQL, parts of Section 8.5 may be skipped. 8.1 SQL DATA DEFINITION AND DATA TYPES SQL uses the terms table, row, and column for the formal relational model terms relation, tuple, and attribute, respectively. We will use the corresponding terms interchangeably. The main SQL command for data definition is the CREATE statement, which can be used to create schemas, tables (relations), and domains (as well as other constructs such as views, assertions, and triggers). Before we describe the relevant CREATE statements, we discuss schema and catalog concepts in Section 8.1.1 to place our discussion in perspec? tive. Section 8.1.2 describes how tables are created, and Section 8.1.3 describes the most important data types available for attribute specification. Because the SQL specification is very large, we give a description of the most important features. Further details can be found in the various SQL standards documents (see bibliographic notes). 8.1.1 Schema and Catalog Concepts in SQL Early versions of SQL did not include the concept of a relational database schema; all tables (relations) were considered part of the same schema. The concept of an SQL schema was incorporated starting with sQL2 in order to group together tables and other constructs that belong to the same database application. An SQL schema is identified by a schema name, and includes an authorization identifier to indicate the user or account who owns the schema, as well as descriptors for each element in the schema. Schema ele? ments include tables, constraints, views, domains, and other constructs (such as authori? zation grants) that describe the schema. A schema is created via the CREATE SCHEMA statement, which can include all the schema elements' definitions. Alternatively, the schema can be assigned a name and authorization identifier, and the elements can be defined later. Forexample, the following statement creates a schema called COMPANY, owned by the user with authorization identifier JSMITH: CREATE SCHEMA COMPANY AUTHORIZATION JSMITH; In general, not all users are authorized to create schemas and schema elements. The privilege to create schemas, tables, and other constructs must be explicitly granted to the relevant user accounts by the system administrator or DBA. 210 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries In addition to the concept of a schema, sQL2 uses the concept of a cataIog-a named collection of schemas in an SQL environment. An SQL environment is basically an installation of an SQL-compliant RDBMS on a computer sysrem.i A catalog always contains a special schema called INFORMATION_SCHEMA, which provides information on all the schemas in the catalog and all the element descriptors in these schemas. Integrity constraints such as referential integrity can be defined between relations only if they exist in schemas within the same catalog. Schemas within the same catalog can also share certain elements, such as domain definitions. 8.1.2 The CREATE TABLE Command in SQL The CREATE TABLE command is used to specify a new relation by giving it a name and specifying its attributes and initial constraints. The attributes are specified first, and each attribute is given a name, a data type to specify its domain of values, and any attribute constraints, such as NOT NULL. The key, entity integrity, and referential integrity con? straints can be specified within the CREATE TABLE statement after the attributes are declared, or they can be added later using the ALTER TABLE command (see Section 8.3). Figure 8.1 shows sample data definition statements in SQL for the relational database schema shown in Figure 5.7. Typically, the SQL schema in which the relations are declared is implicitly specified in the environment in which the CREATE TABLE statements are executed. Alternatively, we can explicitly attach the schema name to the relation name, separated by a period. For example, by writing CREATE TABLE COMPANY.EMPLOYEE ... rather than CREATE TABLE EMPLOYEE . . . as in Figure 8.1, we can explicitly (rather than implicitly) make the EMPLOYEE table part of the COMPANY schema. The relations declared through CREATE TABLE statements are called base tables (or base relations); this means that the relation and its tuples are actually created and stored as a file by the DBMS. Base relations are distinguished from virtual relations, created through the CREATE VIEW statement (see Section 9.2), which mayor may not correspond to an actual physical file. In SQL the attributes in a base table are considered to be ordered in the sequence in which they are specified in the CREATE TABLE statement. However, rows (tuples) are not considered to be ordered within a relation. -------- --_._----------- 2. SQL also includes the concept of a cluster of catalogs within an environment, but it is not very clear if so many levels of nesting are required in most applications. VARCHAR(15) CHAR, VARCHAR(15) CHAR(9) DATE, VARCHAR(30) , CHAR, DECIMAL(10,2) , CHAR(9) , INT 8.1 SQL Data Definition and Data Types I 211 (a) CREATE TABLE EMPLOYEE ( FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO PRIMARY KEY (SSN) , FOREIGN KEY (SUPERSSN) REFERENCES EMPLOYEE(SSN) , FOREIGN KEY (DNO) REFERENCES DEPARTMENT(DNUMBER) ) ; CREATE TABLE DEPARTMENT VARCHAR(15) ( DNAME INT DNUMBER MGRSSN CHAR(9) MGRSTARTDATE DATE, PRIMARY KEY(DNUMBER) , UNIQUE (DNAME) , FOREIGN KEY(MGRSSN) REFERENCES EMPLOYEE(SSN) ) ; CREATETABLE DEPT_LOCATIONS INT ( DNUMBER DLOCATION VARCHAR(15) PRIMARY KEY(DNUMBER, DLOCATION) , FOREIGN KEY (DNUMBER) REFERENCES DEPARTMENT(DNUMBER) ) ; CREATE TABLE PROJECT VARCHAR(15) ( PNAME INT PNUMBER VARCHAR(15), PLOCATION INT DNUM PRIMARY KEY(PNUMBER) , UNIQUE (PNAME) , FOREIGN KEY(DNUM) REFERENCES DEPARTMENT(DNUMBER) ) ; CREATETABLEWORKS_ON CHAR(9) ( ESSN INT PNO HOURS DECIMAL(3,1) PRIMARY KEY(ESSN, PNO) , FOREIGN KEY(ESSN) REFERENCES EMPLOYEE(SSN) , FOREIGN KEY(PNO) REFERENCES PROJECT(PNUMBER) ) ; CREATE TABLE DEPENDENT CHAR(9) ( ESSN DEPENDENT_NAME VARCHAR(15) SEX CHAR, BDATE DATE, VARCHAR(8) , RELATIONSHIP PRIMARY KEY(ESSN, DEPENDENT_NAME) , FOREIGN KEY(ESSN) REFERENCES EMPLOYEE(SSN) ) ; NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL, NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , NOT NULL , FIGURE 8.1 SQL CREATE TABLE data defi n ition statements for defi n ing the COMPANY schema from Figure 5.7 212 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries 8.1.3 Attribute Data Types and Domains in SQL The basic data types available for attributes include numeric, character string, bit string, boolean, date, and time. • Numeric data types include integer numbers of various sizes (INTEGER or INT, and SMALLINT) and floating-point (real) numbers of various precision (FLOAT or REAL, and DOUBLE PRECISION). Formatted numbers can be declared by using DECIMAL(i,j)? or DEC(i,j) or NUMERIC(i,j)-where i, the precision, is the total number of decimal dig? its and j, the scale, is the number of digits after the decimal point. The default for scale is zero, and the default for precision is implementation-defined. • Character-string data types are either fixed length--eHAR(n) or CHARACTER(n), where n is the number of characters-or varying length-VARCHAR(n) or CHAR VARYING(n) or CHARACTER VARYING(n), where n is the maximum number of char? acters. When specifying a literal string value, it is placed between single quotation marks (apostrophes), and it is case sensitive (a distinction is made between uppercase and lowercase l.l For fixed-length strings, a shorter string is padded with blank char? acters to the right. For example, if the value 'Smith' is for an attribute of type ' if needed. CHAR(lO), it is padded with five blank characters to become 'Smith Padded blanks are generally ignored when strings are compared. For comparison pur? poses, strings are considered ordered in alphabetic (or lexicographic) order; if a string str1 appears before another string str2 in alphabetic order, then str1 is considered to be less than str2.4 There is also a concatenation operator denoted by I I (double vertical bar) that can concatenate two strings in SQL. For example, 'abc' I I 'XYZ' results in a single string 'abcXYZ'. • Bit-string data types are either of fixed length n-BIT(n)-or varying length-BIT VARYING(n), where n is the maximum number of bits. The default for n, the length of a character string or bit string, is 1. Literal bit strings are placed between single quotes but preceded by a B to distinguish them from character strings; for example, B'10101,.5 • A boolean data type has the traditional values of TRUE or FALSE. In SQL, because of the presence of NULL values, a three-valued logic is used, so a third possible value for a boolean data type is UNKNOWN. We discuss the need for UNKNOWN and the three? valued logic in Section 8.5.1. • New data types for date and time were added in sQLI. The DATE data type has ten positions, and its components are YEAR, MONTH, and DAY in the form YYYY-MM-DD. The TIME data type has at least eight positions, with the components HOUR, MINUTE, and SECOND in the form HH:MM:SS. Only valid dates and times should be allowed by -~- --------- ._-_. 3. This is not the case with SQL keywords, such as CREATE or CHAR. With keywords, SQL is case insensitive, meaning that SQL treats uppercase and lowercase letters as equivalent in keywords. 4. For nonalphabetic characters, there is a defined order. 5. Bit strings whose length is a multiple of 4 can also be specified in hexadecimal notation, where the literal string is preceded by X and each hexadecimal character represents 4 bits. 8.2 Specifying Basic Constraints in SQL I 213 the SQL implementation. The < (less than) comparison can be used with dates or times-an earlier date is considered to be smaller than a later date, and similarly with time. Literal values are represented by single-quoted strings preceded by the keyword DATE or TIME; for example, DATE '2002-09-27' or TIME '09:12:47'. In addition, a data typeTIME(i), where i is called time fractional seconds precision, specifies i + 1 additional positions for TIME-one position for an additional separator character, and i positions for specifying decimal fractions of a second. A TIME WITH TIME ZONE data type includes an additional six positions for specifying the displacement from the standard universal time zone, which is in the range +13:00 to -12:59 in units of HOURS:MINUTES. If WITH TIME ZONE is not included, the default is the local time zone for the SQL session. • A timestamp data type (TIMESTAMP) includes both the DATE and TIME fields, plus a minimum of six positions for decimal fractions of seconds and an optional WITH TIME ZONE qualifier. Literal values are represented by single-quoted strings preceded by the keyword TIMESTAMP, with a blank space between data and time; for example, TIME? STAMP '2002-09-2709:12:47648302'. • Another data type related to DATE, TIME, and TIMESTAMP is the INTERVAL data type. This specifies an interval-a relative value that can be used to increment or decre? ment an absolute value of a date, time, or timestamp. Intervals are qualified to be either YEAR/MONTH intervals or DAY/TIME intervals. • The format of DATE, TIME, and TIMESTAMP can be considered as a special type of string. Hence, they can generally be used in string comparisons by being cast (or coerced or converted) into the equivalent strings. It is possible to specify the data type of each attribute directly, as in Figure 8.1; alternatively, a domain can be declared, and the domain name used with the attribute specification. This makes it easier to change the data type for a domain that is used by numerous attributes in a schema, and improves schema readability. For example, we can create a domain SSN_TYPE by the following statement: CREATE DOMAIN SSN_TYPE AS CHAR(9); We can use SSN_TYPE in place of CHAR(9) in Figure 8.1 for the attributes SSN and SUPERSSN of EMPLOYEE, MGRSSN of DEPARTMENT, ESSN of WORKS_ON, and ESSN of DEPENDENT. A domain can also have an optional default specification via a DEFAULT clause, as we discuss later for attributes. 8.2 SPECIFYING BASIC CONSTRAINTS IN SQl We now describe the basic constraints that can be specified in SQL as part of table cre? ation. These include key and referential integrity constraints, as well as restrictions on attribute domains and NULLs, and constraints on individual tuples within a relation. We discuss the specification of more general constraints, called assertions, in Secion 9.1. 214 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries 8.2.1 Specifying Attribute Constraints and Attribute Defaults Because SQL allows NULLs as attribute values, a constraint NOT NULL may be specified if NULL is not permitted for a particular attribute. This is always implicitly specified for the attributes that are part of the primary key of each relation, but it can be specified for any other attributes whose values are required not to be NULL, as shown in Figure 8.1. It is also possible to define a default value for an attribute by appending the clause DEFAULT to an attribute definition. The default value is included in any new tuple if an explicit value is not provided for that attribute. Figure 8.2 illustrates an example of specifying a default manager for a new department and a default department for a new employee. If no default clause is specified, the default default value is NULL for attributes that do not have the NOT NULL constraint. Another type of constraint can restrict attribute or domain values using the CHECK clause following an attribute or domain definition.6 For example, suppose that department numbers are restricted to integer numbers between 1 and 20; then, we can change the attribute declaration of DNUMBER in the DEPARTMENT table (see Figure 8.1) to the following: DNUMBER INT NOT NULL CHECK (DNUMBER > 0 AND DNUMBER < 21); The CHECK clause can also be used in conjunction with the CREATE DOMAIN statement. For example, we can write the following statement: CREATE DOMAIN D_NUM AS INTEGER CHECK (D_NUM > 0 AND D_NUM < 21); We can then use the created domain D_NUM as the attribute type for all attributes that referto department numbers in Figure 8.1, such as DNUMBER of DEPARTMENT, DNUM of PROJECT, DNO of EMPLOYEE, and so on. 8.2.2 Specifying Key and Referential Integrity Constraints Because keys and referential integrity constraints are very important, there are special clauses within the CREATE TABLE statement to specify them. Some examples to illustrate the specification of keys and referential integrity are shown in Figure 8.1.7 The PRIMARY KEY clause specifies one or more attributes that make up the primary key of a relation. Ifa primary key has a single attribute, the clause can follow the attribute directly. For example, 6. The CHECK clause can also be used for other purposes, as we shall see. 7. Key and referential integrity constraints were not included in early versions of SQL. In some earlier implementations, keys were specified implicitly at the intemallevel via the CREATE INDEX command. 8.2 Specifying Basic Constraints in SQL CREATETABLE EMPLOYEE ( ... , INT NOTNULL DEFAULT 1, DNO CONSTRAINT EMPPK PRIMARY KEY (SSN) , CONSTRAINT EMPSUPERFK FOREIGN KEY (SUPERSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET NULL ON UPDATE CASCADE, CONSTRAINT EMPDEPTFK FOREIGN KEY (DNO) REFERENCES DEPARTMENT(DNUMBER) ON DELETE SET DEFAULT ON UPDATE CASCADE ); CREATE TABLE DEPARTMENT ( ... , MGRSSN CHAR(9) NOTNULLDEFAULT '888665555' , CONSTRAINT DEPTPK PRIMARY KEY (DNUMBER) , CONSTRAINT DEPTSK UNIQUE (DNAME), CONSTRAINT DEPTMGRFK FOREIGN KEY (MGRSSN) REFERENCES EMPLOYEE(SSN) ON DELETE SET DEFAULT ON UPDATE CASCADE ); CREATETABLE DEPLLOCATIONS ( ... , PRIMARY KEY (DNUMBER, DLOCATION), FOREIGN KEY (DNUMBER) REFERENCES DEPARTMENT(DNUMBER) ONDELETE CASCADE ON UPDATE CASCADE) ; FIGURE 8.2 Example illustrating how default attribute values and referential trig? gerred actions are specified in SQL the primary key of DEPARTMENT can be specified as follows (instead of the way it is specified in Figure 8.1): DNUMBER INT PRIMARY KEY; I 215 The UNIQUE clause specifies alternate (secondary) keys, as illustrated in the DEPARTMENT and PRO] ECT table declarations in Figure 8.1. Referential integrity is specified via the FOREIGN KEY clause, as shown in Figure 8.1. As we discussed in Section 5.2.4, a referential integrity constraint can be violated when tuples are inserted or deleted, or when a foreign key or primary key attribute value is modified. The default action that SQL takes for an integrity violation is to reject the update operation that will cause a violation. However, the schema designer can specify an alternative action to be taken if a referential integrity constraint is violated, by attaching a referential triggered action clause to any foreign key constraint. The options include 216 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries SET NULL, CASCADE, and SET DEFAULT. An option must be qualified with either ON DELETE or ON UPDATE. We illustrate this with the examples shown in Figure 8.2. Here, the database designer chooses SET NULL ON DELETE and CASCADE ON UPDATE for the foreign key SUPERSSN of EMPLOYEE. This means that if the tuple for a supervising employee is deleted, the value of SUPERSSN is automatically set to NULL for all employee tuples that were referencing the deleted employee tuple. On the other hand, if the SSN value for a supervising employee is updated (say, because it was entered incorrectly), the new value is cascaded to SUPERSSN for all employee tuples referencing the updated employee tuple. In general, the action taken by the DBMS for SET NULL or SET DEFAULT is the same for both ON DELETE or ON UPDATE: The value of the affected referencing attributes is changed to NULL for SET NULL, and to the specified default value for SET DEFAULT. The action for CASCADE ON DELETE is to delete all the referencing tuples, whereas the action for CASCADE ON UPDATE is to change the value of the foreign key to the updated (new) primary key value for all referencing tuples. It is the responsibility of the database designer to choose the appropriate action and to specify it in the database schema. As a general rule, the CASCADE option is suitable for "relationship" relations (see Section 7.1), such as WORKS_ON; for relations that represent multivalued attributes, such as DEPT_LOCATIONS; and for relations that represent weak entity types, such as DEPENDENT. 8.2.3 Giving Names to Constraints Figure 8.2 also illustrates how a constraint may be given a constraint name, following the keyword CONSTRAINT. The names of all constraints within a particular schema must be unique. A constraint name is used to identify a particular constraint in case the constraint must be dropped later and replaced with another constraint, as we discuss in Section 8.3. Giving names to constraints is optional. 8.2.4 Specifying Constraints on Tuples Using CHECK In addition to key and referential integrity constraints, which are specified by special keywords, other table constraints can be specified through additional CHECK clauses at the end of a CREATE TABLE statement. These can be called tuple-based constraints because they apply to each tuple individually and are checked whenever a tuple is inserted or modified. For example, suppose that the DEPARTMENT table in Figure 8.1 had an additional attribute DEPT_CREATE_DATE, which stores the date when the department was created. Then we could add the following CHECK clause at the end of the CREATE TABLE statement for the DEPARTMENT table to make sure that a manager's start date is later than the department creation date: CHECK (DEPT_CREATE_DATE < MGRSTARTDATE); The CHECK clause can also be used to specify more general constraints using the CREATE ASSERTION statement of SQL. We discuss this in Section 9.1 because it requires the full power of queries, which are discussed in Sections 8.4 and 8.5. 8.3 Schema Change Statements in SQL 8.3 SCHEMA CHANGE STATEMENTS IN SQL In this section, we give an overview of the schema evolution commands available in SQL, which can be used to alter a schema by adding or dropping tables, attributes, constraints, and other schema elements. 8.3.1 The DROP Command The DROP command can be used to drop named schema elements, such as tables, domains, or constraints. One can also drop a schema. For example, if a whole schema is not needed any more, the DROP SCHEMA command can be used. There are two drop behavior options: CASCADE and RESTRICT. For example, to remove the COMPANY database schema and all its tables, domains, and other elements, the CASCADE option is used as follows: DROP SCHEMA COMPANY CASCADE; If the RESTRICT option is chosen in place of CASCADE, the schema is dropped only if ithasno elements in it; otherwise, the DROP command will not be executed. If a base relation within a schema is not needed any longer, the relation and its definition can be deleted by using the DROP TABLE command. For example, if we no longer wish to keep track of dependents of employees in the COMPANY database of Figure 8.1, we can get rid of the DEPENDENT relation by issuing the following command: DROPTABLE DEPENDENT CASCADE; If the RESTRICT option is chosen instead of CASCADE, a table is dropped only if it is not referenced in any constraints (for example, by foreign key definitions in another relation) or views (see Section 9.2). With the CASCADE option, all such constraints and views that reference the table are dropped automatically from the schema, along with the table itself. The DROP command can also be used to drop other types of named schema elements, such as constraints or domains. 8.3.2 The ALTER Command The definition of a base table or of other named schema elements can be changed by using the ALTER command. For base tables, the possible alter table actions include adding ordropping a column (attribute), changing a column definition, and adding or dropping table constraints. For example, to add an attribute for keeping track of jobs of employees tothe EMPLOYEE base relations in the COMPANY schema, we can use the command ALTER TABLE COMPANYEMPLOYEE ADD JOB VARCHAR(12); We must still enter a value for the new attribute JOB for each individual EMPLOYEE tuple. This can be done either by specifying a default clause or by using the UPDATE command (see Section 8.6). If no default clause is specified, the new attribute will have NULLs in all I 217 218 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries the tuples of the relation immediately after the command is executed; hence, the NOT NULL constraint is not allowed in this case. To drop a column, we must choose either CASCADE or RESTRICT for drop behavior. If CASCADE is chosen, all constraints and views that reference the column are dropped automatically from the schema, along with the column. If RESTRICT is chosen, the command is successful only if no views or constraints (or other elements) reference the column. For example, the following command removes the attribute ADDRESS from the EMPLOYEE base table: ALTER TABLE COMPANY. EMPLOYEE DROP ADDRESS CASCADE; It is also possible to alter a column definition by dropping an existing default clause or by defining a new default clause. The following examples illustrate this clause: ALTER TABLE COMPANY. DEPARTMENT ALTER MGRSSN DROP DEFAULT; ALTER TABLE COMPANY.DEPARTMENT ALTER MGRSSN SET DEFAULT "333445555"; One can also change the constraints specified on a table by adding or dropping a constraint. To be dropped, a constraint must have been given a name when it was specified. For example, to drop the constraint named EMPSUPERFK in Figure 8.2 from the EMPLOYEE relation, we write: ALTER TABLE COMPANY.EMPLOYEE DROP CONSTRAINT EMPSUPERFK CASCADE; Once this is done, we can redefine a replacement constraint by adding a new constraint to the relation, if needed. This is specified by using the ADD keyword in the ALTER TABLE statement followed by the new constraint, which can be named or unnamed and can be of any of the table constraint types discussed. The preceding subsections gave an overview of the schema evolution commands of SQL. There are many other details and options, and we refer the interested reader to the SQL documents listed in the bibliographical notes. The next two sections discuss the querying capabilities of SQL. 8.4 BASIC QUERIES IN SQL SQL has one basic statement for retrieving information from a database: the SELECT state? ment. The SELECT statement has no relationshiP to the SELECT operation of relational alge? bra, which was discussed in Chapter 6. There are many options and flavors to the SELECT statement in SQL, so we will introduce its features gradually. We will use example queries specified on the schema of Figure 5.5 and will refer to the sample database state shown in Figure 5.6 to show the results of some of the example queries. 8.4 Basic Queries in SQL Before proceeding, we must point out an important distinction between SQL and the formal relational model discussed in Chapter 5: SQL allows a table (relation) to have two or more tuples that are identical in all their attribute values. Hence, in general, an SQL table is not a set of tuples, because a set does not allow two identical members; rather, it is a multiset (sometimes called a bag) of tuples. Some SQL relations are constrained to be sets because a key constraint has been declared or because the DISTINCT option has been used with the SELECT statement (described later in this section). We should be aware of this distinction as we discuss the examples. 8.4.1 The SElECT-fROM-WHERE Structure of Basic SQl Queries Queries in SQL can be very complex. We will start with simple queries, and then progress to more complex ones in a step-by-step manner. The basic form of the SELECT statement, sometimes called a mapping or a select-from-where block, is formed of the three clauses SELECT, FROM, and WHERE and has the following form: SELECT FROM WHERE where ; • is a list of attribute names whose values are to be retrieved by the query. •
is a list of the relation names required to process the query. • is a conditional (Boolean) expression that identifies the tuples to be retrieved by the query. In SQL, the basic logical comparison operators for comparing attribute values with onethe relationalanother andalgebrawith literaloperatorsconstants=, <, are~, >,=, <,~, <=,and>,*,>=,respectively,and <>. Theseand tocorrespondthe c{c++to programming language operators =, <, <=, >, >=, and !=. The main difference is the not equal operator. SQL has many additional comparison operators that we shall present gradually as needed. We now illustrate the basic SELECT statement in SQL with some example queries. The queries are labeled here with the same query numbers that appear in Chapter 6 for easy cross reference. QUERY 0 Retrieve the birthdate and address of the ernploveeis) whose name is 'John B. Smith'. QO: SELECT BDATE, ADDRESS EMPLOYEE FROM WHERE FNAME='John' AND MINIT='B' AND LNAME='Smith'; I 219 220 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries This query involves only the EMPLOYEE relation listed in the FROM clause. The query selects the EMPLOYEE tuples that satisfy the condition of the WHERE clause, then projects the result on the BDATE and ADDRESS attributes listed in the SELECT clause. QO is similar to the following relational algebra expression, except that duplicates, if any, would not be eliminated: 1tBDATE,ADDRESS(C>FNAME=' John' AND MINH=' B' AND LNAME=' Smith' (EMPLOYEE)) Hence, a simple SQL query with a single relation name in the FROM clause is similar to a SELECT-PROJECT pair of relational algebra operations. The SELECT clause of SQL specifies the projection attributes, and the WHERE clause specifies the selection condition. The only difference is that in the SQL query we may get duplicate tuples in the result, because the constraint that a relation is a set is not enforced. Figure 8.3a shows the result of query QO on the database of Figure 5.6. The query QO is also similar to the following tuple relational calculus expression, except that duplicates, if any, would again not be eliminated in the SQL query: QO: {t.BDATE, t.ADDRESS I EMPLOYEE(t) AND t.FNAME='John' AND t.MINH='B' AND t. LNAME='Smith'} Hence, we can think of an implicit tuple variable in the SQL query ranging over each tuple in the EMPLOYEE table and evaluating the condition in the WHERE clause. Only those tuples that satisfy the condition-that is, those tuples for which the condition evaluates to TRUE after substituting their corresponding attribute values-are selected. QUERY 1 Retrieve the name and address of all employees who work for the 'Research'department. Ql: SELECT FNAME,LNAME,ADDRESS FROM EMPLOYEE,DEPARTMENT WHERE DNAME='Research' AND DNUMBER=DNO; Query Ql is similar to a SELECT-PROJECT-JOIN sequence of relational algebra operations. Such queries are often called select-project-join queries. In the WHERE clauseof Ql, the condition DNAME = 'Research'is a selection condition and corresponds to a SELECT operation in the relational algebra. The condition DNUMBER = DNO is a join condition, which corresponds to a JOIN condition in the relational algebra. The result of query Ql is shown in Figure 8.3b. In general, any number of select and join conditions may be specified in a single SQL query. The next example is a select-project-join query with two join conditions. QUERY 2 For every project located in 'Stafford', list theproject number, the controlling department number, and the department manager's last name, address, and birthdate. Q2: SELECT PNUMBER, DNUM, LNAME, ADDRESS, BDATE FROM PROJECT, DEPARTMENT, EMPLOYEE (a) BDATE 1965-01-09 (e) PNUMBER 10 30 ADDRESS 731 Fondren, Houston, TX (b) FNAME John Franklin Ramesh Joyce 8.4 Basic Queries in SQL I 221 DNUM 4 4 LNAME Wallace Wallace ADDRESS 291 Berry, Bellaire, TX 291 Berry, Bellaire, TX LNAME Smith Wong Narayan English ADDRESS 731 Fondren, Houston, TX 638 Voss, Houston, TX 975 FireOak,Humble, TX 5631 Rice,Houston, TX (d) E.FNAME E.LNAME John Franklin Alicia Jennifer Ramesh Joyce Ahmad Smith Wong Zelaya Wallace Narayan English Jabbar S.FNAME Franklin James Jennifer James Franklin Franklin Jennifer (e) SSN S.LNAME Wong Borg Wallace Borg Wong Wong Wallace 123456789 333445555 999887777 987654321 666884444 453453453 987987987 888665555 (g) FNAME John Franklin Ramesh Joyce MINIT LNAME B T K A Smith Wong Narayan English SSN BDATE 123456789 333445555 666884444 453453453 1965-09-01 1955-12-08 1962-09-15 1972-07-31 BDATE 1941-06-20 1941-06-20 ADDRESS (I) SSN 123456789 333445555 999887777 987654321 666884444 453453453 987987987 888665555 123456789 333445555 999887777 987654321 666884444 453453453 987987987 888665555 123456789 333445555 999887777 987654321 666884444 453453453 987987987 888665555 DNAME Research Research Research Research Research Research Research Research Administration Administration Administration Administration Administration Administration Administration Administration Headquarters Headquarters Headquarters Headquarters Headquarters Headquarters Headquarters Headquarters 731 Fondren, Houston, TX 638 Voss,Houston, TX 975 FireOak,Humble, TX 5631 Rice, Houston, TX SEX SALARY SUPERSSN M M M F DNO 30000 40000 38000 25000 333445555 888665555 333445555 333445555 5 5 5 5 FIGURE 8.3 Results of SQL queries when applied to the COMPANY database state shown in Figure 5.6. (a) QQ. (b) Ql. (c) Q2. (d) Q8. (e) Q9. (f) Ql O. (g) Ql C WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford'; The join condition DNUM = DNUMBER relates a project to its controlling department, whereas the join condition MGRSSN = SSN relates the controlling department to the employee who manages that department. The result of query Q2 is shown in Figure 8.3c. 222 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries 8.4.2 Ambiguous Attribute Names, Aliasing, and Tuple Variables In SQL the same name can be used for two (or more) attributes as long as the attributes are in different relations. If this is the case, and a query refers to two or more attributes with the same name, we must qualify the attribute name with the relation name to prevent ambigu? ity. This is done by prefixing the relation name to the attribute name and separating the two by a period. To illustrate this, suppose that in Figures 5.5 and 5.6 the DNO and LNAME attributes of the EMPLOYEE relation were called DNUMBER and NAME, and the DNAME attribute of DEPARTMENT was also called NAME; then, to prevent ambiguity, query Ql would be rephrased as shown in QIA. We must prefix the attributes NAME and DNUMBER in QIA to specify which ones we are referring to, because the attribute names are used in both relations: Q1A: SELECT FNAME, EMPLOYEE.NAME, ADDRESS FROM EMPLOYEE,DEPARTMENT DEPARTMENT.NAME='Research' AND WHERE DEPARTMENT.DNUMSER=EMPLOYEE.DNUMSER; Ambiguity also arises in the case of queries that refer to the same relation twice, as in the following example. QUERY 8 For each employee, retrieve the employee's first and lastname and the first and last name of his or her immediate supervisor. Q8: SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME FROM EMPLOYEE AS E, EMPLOYEE AS S WHERE E.SUPERSSN=S.SSN; In this case, we are allowed to declare alternative relation names E and 5, called aliases or tuple variables, for the EMPLOYEE relation. An alias can follow the keyword AS, as shown in Q8, or it can directly follow the relation name-for example, by writing EMPLOYEE E, EMPLOYEE 5 in the FROM clause of Q8. It is also possible to rename the relation attributes within the query in SQL by giving them aliases. For example, if we write EMPLOYEE AS E(FN, MI, LN, SSN, SD, ADDR, SEX, SAL, SSSN, DNO) in the FROM clause, FN becomes an alias for FNAME, MI for MINH, LN for LNAME, and so on. In Q8, we can think of E and 5 as two different copies of the EMPLOYEE relation; the first, E, represents employees in the role of supervisees; the second, S, represents employees in the role of supervisors. We can now join the two copies. Of course, in reality there is only one EMPLOYEE relation, and the join condition is meant to join the relation with itself by matching the tuples that satisfy the join condition E. SUPER55N = 5. 55N. Notice that this is an example of a one-level recursive query, as we discussed in Section 6.4.2. In earlier versions of SQL, as in relational algebra, it was not possible to specify a general recursive query, with 8.4 Basic Queries in SQL I 223 an unknown number of levels, in a single SQL statement. A construct for specifying recursive queries has been incorporated into sQL-99, as described in Chapter 22. The result of query Q8 is shown in Figure 8.3d. Whenever one or more aliases are given to a relation, we can use these names to represent different references to that relation. This permits multiple references to the same relation within a query. Notice that, if we want to, we can use this alias-naming mechanism in any SQL query to specify tuple variables for every table in the WHERE clause, whether or not the same relation needs to be referenced more than once. In fact, this practice is recommended since it results in queries that are easier to comprehend. For example, we could specify query Q1A as in Q1B: Q1B: SELECT E.FNAME, E.NAME, E.ADDRESS FROM EMPLOYEE E, DEPARTMENT D WHERE D.NAME='Research' AND D.DNUMBER=E.DNUMBER; If we specify tuple variables for every table in the WHERE clause, a select-project-join query in SQL closely resembles the corresponding tuple relational calculus expression (except for duplicate elimination). For example, compare Q1B with the following tuple relational calculus expression: Ql: {e.FNAME, e.LNAME, e.ADDRESS I EMPLOYEE(e) AND (3d) (DEPARTMENT(d) AND d.DNAME='Research' AND d.DNuMBER=e.DNo) Notice that the main difference-other than syntax-is that in the SQL query, the exis? tential quantifier is not specified explicitly. 8.4.3 Unspecified WHERE Clause and Use of the Asterisk We discuss two more features of SQL here. A missing WHERE clause indicates no condi? tion on tuple selection; hence, all tuples of the relation specified in the FROM clause qualify and are selected for the query result. If more than one relation is specified in theFROM clause and there is no WHERE clause, then the CROSS PRODUCT-all possible tuple combinations-of these relations is selected. For example, Query 9 selects all EMPLOYEE SSNS (Figure 8.3e), and Query 10 selects all combinations of an EMPLOYEE SSN and a DEPARTMENT DNAME (Figure 8.3f). QUERIES 9 AND 10 Select all EMPLOYEE SSNS (Q9), and all combinations of EMPLOYEE SSN and DEPARTMENT DNAME (Q10) in the database. Q9: SELECT SSN EMPLOYEE; FROM QlO: SELECT SSN, DNAME EMPLOYEE, DEPARTMENT; FROM 224 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries It is extremely important to specify every selection and join condition in the WHERE clause; if any such condition is overlooked, incorrect and very large relations may result. Notice that QI0 is similar to a CROSS PRODUCT operation followed by a PROJECT operation in relational algebra. If we specify all the attributes of EMPLOYEE and OEPARTMENT in QlO, we get the CROSS PRODUCT (except for duplicate elimination, if any). To retrieve all the attribute values of the selected tuples, we do not have to list the attribute names explicitly in SQL; we just specify an asterisk (*), which stands for all the attributes. For example, query QIC retrieves all the attribute values of any EMPLOYEE who works in DEPARTMENT number 5 (Figure 8.3g), query QID retrieves all the attributes of an EMPLOYEE and the attributes of the DEPARTMENT in which he or she works for every employee of the 'Research'department, and QlOA specifies the CROSS PRODUCT of the EMPLOYEE and DEPARTMENT relations. QIC: QID: QlOA: SELECT * FROM EMPLOYEE WHERE DNO=5; SELECT * FROM EMPLOYEE, DEPARTMENT WHERE DNAME='Research' AND DNO=DNUMBER; SELECT * FROM EMPLOYEE, DEPARTMENT; 8.4.4 Tables as Sets in SQl As we mentioned earlier, SQL usually treats a table not as a set but rather as a multiset; duplicate tuples can appear more than oncein a table, and in the result of a query. SQL does not automatically eliminate duplicate tuples in the results of queries, for the following reasons: • Duplicate elimination is an expensive operation. One way to implement it is to sort the tuples first and then eliminate duplicates. • The user may want to see duplicate tuples in the result of a query. • When an aggregate function (see Section 8.5.7) is applied to tuples, in most cases we do not want to eliminate duplicates. An SQL table with a key is restricted to being a set, since the key value must be dis? tinct in each tuple.f If we do want to eliminate duplicate tuples from the result of an SQL query, we use the keyword DISTINCT in the SELECT clause, meaning that only distinct tuples should remain in the result. In general, a query with SELECT DISTINCT eliminates duplicates, whereas a query with SELECT ALL does not. Specifying SELECT with neither ALL nor DISTINCT-as in our previous examples-is equivalent to SELECT ALL. For --- ~--~..--_.~.---~---_.. _--~._--~~~.--- 8. In general, an SQL table is not required to have a key, although in most cases there will be one. 8.4 Basic Queries in SQL I 225 example, Query 11 retrieves the salary of every employee; if several employees have the same salary, that salary value will appear as many times in the result of the query, as shown in Figure 8Aa. If we are interested only in distinct salary values, we want each value to appear only once, regardless of how many employees earn that salary. By using the keyword DISTINCT as in QIIA, we accomplish this, as shown in Figure 8Ab. QUERY 11 Retrieve the salary of every employee (Qll) and all distinct salary values (QllA). Qll: QIIA: SELECT ALL SALARY FROM EMPLOYEE; SELECT DISTINCT SALARY FROM EMPLOYEE; SQL has directly incorporated some of the set operations of relational algebra. There are set union (UNION), set difference (EXCEPT), and set intersection (INTERSECT) operations. The relations resulting from these set operations are sets of tuples; that is, duplicate tuples are eliminated from the result. Because these set operations apply only to union-compatible relations, we must make sure that the two relations on which we apply theoperation have the same attributes and that the attributes appear in the same order in both relations. The next example illustrates the use of UNION. QUERY 4 Make a list of all project numbers for projects that involve an employee whose last name is 'Smith',either as a worker or as a manager of the department that controls the project. Q4: (SELECT DISTINCT PNUMBER FROM PROJECT, DEPARTMENT, EMPLOYEE (a) (c) SALARY (b) 30000 40000 25000 43000 38000 25000 25000 55000 SALARY FNAME LNAME (d) 30000 40000 25000 43000 38000 55000 FNAME James LNAME Borg FIGURE 8.4 Results of additional SQL queries when applied to the COMPANY database state shown in Figure 5.6. (a) Q'll . (b) Q'll A. (c) Q16. (d) Q18. 226 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries WHERE UNION (SELECT DISTINCT PNUMBER FROM PROJECT, WORKS_ON, EMPLOYEE PNUMBER=PNO AND ESSN=SSN AND LNAME='Smith'); WHERE DNUM=DNUMBER AND MGRSSN=SSN AND LNAME='Smith') The first SELECT query retrieves the projects that involve a 'Smith'as manager of the department that controls the project, and the second retrieves the projects that involve a 'Smith' as a worker on the project. Notice that if several employees have the last name 'Smith',the project names involving any of them will be retrieved. Applying the UNION operation to the two SELECT queries gives the desired result. SQL also has corresponding multiset operations, which are followed by the keyword ALL (UNION ALL, EXCEPT ALL, INTERSECT ALL). Their results are multisets (duplicates are not eliminated). The behavior of these operations is illustrated by the examples in Figure 8.5. Basically, each tuple-whether it is a duplicate or not-is considered as a different tuple when applying these operations. 8.4.5 Substring Pattern Matching and Arithmetic Operators In this section we discuss several more features of SQL. The first feature allows comparison conditions on only parts of a character string, using the LIKE comparison operator. This (a) (b) 1a2 a2 s A a1 a1 a4 a2 a5 a3 (')~ (~~ A T a1 a1 a3 a1 a2 a2 a2 a2 a3 a4 a5 FIGURE 8.5 The results of SQL multiset operations. (a) Two tables, R(A) and S(A). (b) R(A) UNION ALL S(A). (c) R(A) EXCEPT ALL SiAl. (d) R(A) INTERSECT ALL S(A). 8.4 Basic Queries in SQL can be used for string pattern matching. Partial strings are specified using two reserved characters: % replaces an arbitrary number of zero or more characters, and the underscore U replaces a single character. For example, consider the following query. QUERY 12 Retrieve all employees whose address is in Houston, Texas. Q12: SELECT FNAME, LNAME FROM EMPLOYEE ADDRESS LIKE '%Houston,TX%'; WHERE To retrieve all employees who were born during the 1950s, we can use Query 12A. Here, '5' must bethe third character of the string (according to our format for date), so we use the value '__ 5 ', with each underscore serving as a placeholder for an arbitrary character. QUERY 12A Find all employees who were born during the 1950s. Q12A: SELECT FNAME, LNAME EMPLOYEE FROM BDATE LIKE '__ 5 WHERE '; If an underscore or % is needed as a literal character in the string, the character should be preceded by an escape character, which is specified after the string using the keyword ESCAPE. For example, 'AB\_CD\%EF' ESCAPE '\' represents the literal string 'AB_CD%EF', because \ is specified as the escape character. Any character not used in the string can be chosen as the escape character. Also, we need a rule to specify apostrophes or single quotation marks (") if they are to be included in a string, because they are used to begin and end strings. If an apostrophe (') is needed, it is represented as two consecutive apostrophes (") so that it will not be interpreted as ending the string. Another feature allows the use of arithmetic in queries. The standard arithmetic operators for addition (+), subtraction (-), multiplication (*), and division (/) can be applied tonumeric values or attributes with numeric domains. For example, suppose that we want to see the effect of giving all employees who work on the 'ProductX'project a 10 percent raise; we can issue Query 13 to see what their salaries would become. This example also shows how we canrename an attribute in the query result using AS in the SELECT clause. QUERY 13 Show the resulting salaries if every employee working on the 'ProductX'project is given a 10 percent raise. Q13: I 227 SELECT FNAME, LNAME, 1.1*SALARY AS INCREASED_SAL FROM EMPLOYEE, WORKS_ON, PROJECT 228 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries WHERE SSN=ESSN AND PNO=PNUMBER AND PNAME='ProductX'; For string data types, the concatenate operator I I can be used in a query to append two string values. For date, time, timestamp, and interval data types, operators include incrementing (+) or decrementing (-) a date, time, or timestamp by an interval. In addition, an interval value is the result of the difference between two date, time, or timestamp values. Another comparison operator that can be used for convenience is BETWEEN, which is illustrated in Query 14. QUERY 14 Retrieve all employees in department 5 whose salary is between $30,000 and $40,000. Q14: SELECT * FROM EMPLOYEE (SALARY BETWEEN 30000 AND 40000) AND DNO =5; WHERE The condition (SALARY BETWEEN 30000 AND 40000) in Q14 is equivalent to the condition ((SALARY >= 30000) AND (SALARY <= 40000». 8.4.6 Ordering of Query Results SQL allows the user to order the tuples in the result of a query by the values of one or more attributes, using the ORDER BY clause. This is illustrated by Query 15. QUERY 15 Retrieve a list of employees and the projects they are working on, ordered by depart? ment and, within each department, ordered alphabetically by last name, first name. Q15: SELECT FROM WHERE ORDER BY DNAME, LNAME, FNAME, PNAME DEPARTMENT, EMPLOYEE, WORKS_ON, PROJECT DNUMBER=DNO AND SSN=ESSN AND PNO=PNUMBER DNAME, LNAME, FNAME; The default order is in ascending order of values. We can specify the keyword DESCif we want to see the result in a descending order of values. The keyword ASC can be used to specify ascending order explicitly. For example, if we want descending order on DNAME and ascending order on LNAME, FNAME, the ORDER BY clause of Q15 can be written as ORDER BY DNAME DESC, LNAME ASC, FNAME ASC 8.5 More Complex SQL Queries I 229 8.5 MORE COMPLEX SQL QUERIES In the previous section, we described some basic types of queries in SQL. Because of the generality and expressive power of the language, there are many additional features that allow users to specify more complex queries. We discuss several of these features in this section. 8.5.1 Comparisons Involving NULL and Three-Valued Logic SQL has various rules for dealing with NULL values. Recall from Section 5.1.2 that NULL is used to represent a missing value, but that it usually has one of three different interpreta? tions-value unknown (exists but is not known), value not available (exists but is pur? posely withheld), or attribute not applicable (undefined for this tuple). Consider the following examples to illustrate each of the three meanings of NULL. 1. Unknown value: A particular person has a date of birth but it is not known, so it is represented by NULL in the database. 2. Unavailable or withheld value: A person has a home phone but does not want it to be listed, so it is withheld and represented as NULL in the database. 3. Not applicable attribute: An attribute LastCollegeDegree would be NULL for a per? son who has no college degrees, because it does not apply to that person. It is often not possible to determine which of the three meanings is intended; for example, a NULL for the home phone of a person can have any of the three meanings. Hence, SQLdoes not distinguish between the different meanings of NULL. In general, each NULL is considered to be different from every other NULL in the database. When a NULL is involved in a comparison operation, the result is considered to be UNKNOWN (it may be TRUE or it may be FALSE). Hence, SQL uses a three-valued logic with values TRUE, FALSE, and UNKNOWN instead of the standard two-valued logic with values TRUE or FALSE. It is therefore necessary to define the results of three-valued logical expressions when the logical connectives AND, OR, and NOT are used. Table 8.1 shows the resulting values. In select-project-join queries, the general rule is that only those combinations of tuples that evaluate the logical expression of the query to TRUE are selected. Tuple combinations that evaluate to FALSE or UNKNOWN are not selected. However, there are exceptions to that rule for certain operations, such as outer joins, as we shall see. SQL allows queries that check whether an attribute value is NULL. Rather than using = or<> to compare an attribute value to NULL, SQL uses IS or IS NOT. This is because SQL considers each NULL value as being distinct from every other NULL value, so equality comparison is not appropriate. It follows that when a join condition is specified, tuples with NULL values for the join attributes are not included in the result (unless it is an OUTER JOIN;see Section 8.5.6). Query 18 illustrates this; its result is shown in Figure 8Ad. 230 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries TRUE TRUE FALSE UNKNOWN FALSE FALSE FALSE FALSE UNKNOWN UNKNOWN FALSE UNKNOWN TABLE 8.1 LOGICAL CONNECTIVES IN THREE-VALUED LOGIC AND TRUE FALSE UNKNOWN OR TRUE FALSE UNKNOWN NOT TRUE FALSE UNKNOWN QUERY 18 TRUE TRUE TRUE TRUE FALSE TRUE FALSE UNKNOWN UNKNOWN TRUE UNKNOWN UNKNOWN FALSE TRUE UNKNOWN Retrieve the names of all employees who do not have supervisors. Q18: SELECT FNAME, LNAME FROM EMPLOYEE SUPERSSN IS NULL; WHERE 8.5.2 Nested Queries, Tuples, and Set/Multiset Comparisons Some queries require that existing values in the database be fetched and then used in a comparison condition. Such queries can be conveniently formulated by using nested que? ries, which are complete select-from-where blocks within the WHERE clause of another query. That other query is called the outer query. Query 4 is formulated in Q4 withouta nested query, but it can be rephrased to use nested queries as shown in Q4A. Q4A intro? duces the comparison operator IN, which compares a value v with a set (or multiset) of values V and evaluates to TRUE if v is one of the elements in V WHERE Q4A: SELECT DISTINCT PNUMBER PROJECT FROM PNUMBERIN (SELECT PNUMBER WHERE PROJECT, DEPARTMENT, FROM EMPLOYEE DNUM=DNUMBER AND 8.5 More Complex SQL Queries I 231 MGRSSN=SSN AND LNAME='Smith') OR PNUMBERIN (SELECT PNO FROM WORKS_ON, EMPLOYEE WHERE ESSN=SSN AND LNAME='Smith'); The first nested query selects the project numbers of projects that have a 'Smith' involved as manager, while the second selects the project numbers of projects that have a 'Smith'involved as worker. In the outer query, we use the OR logical connective to retrieve a PROJECT tuple if the PNUMBER value of that tuple is in the result of either nested query. Ifa nested query returns a single attribute and a single tuple, the query result will be a single (scalar) value. In such cases, it is permissible to use = instead of IN for the comparison operator. In general, the nested query will return a table (relation), which is a set or multiset of tuples. SQL allows the use of tuples of values in comparisons by placing them within parentheses. To illustrate this, consider the following query: SELECT DISTINCT ESSN WORKS_ON FROM WHERE (PNO, HOURS) IN (SELECT PNO, HOURS FROM WORKS_ON WHERE SSN='123456789'); This query will select the social security numbers of all employees who work the same (project, hours) combination on some project that employee 'John Smith' (whose SSN = '123456789')works on. In this example, the IN operator compares the subtuple of values in parentheses (PNO, HOURS) for each tuple in WORKS_ON with the set of union-compatible tuples produced by the nested query. In addition to the IN operator, a number of other comparison operators can be used to compare a single value v (typically an attribute name) to a set or multiset V (typically a nested query). The = ANY (or = SOME) operator returns TRUE if the value v is equal to some value in the set V and is hence equivalent to IN. The keywords ANY and SOME have thesame meaning. Other operators that can be combined with ANY (or SOME) include >, >=, <, <=, and < >. The keyword ALL can also be combined with each of these operators. For example, the comparison condition (v > ALL V) returns TRUE if the value v is greater than all the values in the set (or multiset) V. An example is the following query, which returns the names of employees whose salary is greater than the salary of all the employees indepartment 5: SELECT LNAME, FNAME EMPLOYEE FROM WHERE SALARY> ALL (SELECT SALARY FROM EMPLOYEE WHERE DNO=5); 232 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries In general, we can have several levels of nested queries. We can once again be faced with possible ambiguity among attribute names if attributes of the same name exist-one in a relation in the FROM clause of the outer query, and another in a relation in the FROM clause of the nested query. The rule is that a reference to an unqualified attribute refers to the relation declared in the innermost nested query. For example, in the SELECT clause and WHERE clause of the first nested query of Q4A, a reference to any unqualified attribute of the PROJECT relation refers to the PROJECT relation specified in the FROM clause of the nested query. To refer to an attribute of the PROJECT relation specified in the outer query, we can specify and refer to an alias (tuple variable) for that relation. These rules are similar to scope rules for program variables in most programming languages that allow nested procedures and functions. To illustrate the potential ambiguity of attribute names in nested queries, consider Query 16, whose result is shown in Figure 8.4c. QUERY 16 Retrieve the name of each employee who has a dependent with the same first name and same sex as the employee. Q16: SELECT E.FNAME, E.LNAME EMPLOYEE AS E FROM (SELECT ESSN E.SSN IN WHERE DEPENDENT FROM E.FNAME=DEPENDENT_NAME WHERE AND E.SEX=SEX); In the nested query of Q16, we must qualify E. SEXbecause it refers to the SEXattribute of EMPLOYEE from the outer query, and DEPENDENT also has an attribute called SEX. All unqualified references to SEX in the nested query refer to SEX of DEPENDENT. However, we do not have to qualify FNAME and SSN because the DEPENDENT relation does not have attributes called FNAME and SSN, so there is no ambiguity. It is generally advisable to create tuple variables (aliases) for all the tables referenced in an SQL query to avoid potential errors and ambiguities. 8.5.3 Correlated Nested Queries Whenever a condition in the WHEREclause of a nested query references some attribute of a relation declared in the outer query, the two queries are said to be correlated. We can understand a correlated query better by considering that the nested queryis evaluated once for each tuple (or combination of tuples) in the outer query. For example, we can think of Q16 as follows: For each EMPLOYEE tuple, evaluate the nested query, which retrieves the ESSN values for all DEPENDENT tuples with the same sex and name as that EMPLOYEE tuple; if the SSN value of the EMPLOYEE tuple is in the result of the nested query, then select that EMPLOYEE tuple. In general, a query written with nested select-from-where blocks and using the = or IN comparison operators can always be expressed as a single block query. For example, Q16 may be written as in Q16A: 8.5 More Complex SQL Queries I 233 Q16A: SELECT E.FNAME, E.LNAME EMPLOYEE AS E, DEPENDENT AS D FROM E.SSN=D.ESSN AND E.SEX=D.SEX AND WHERE E.FNAME=D.DEPENDENT_NAME; The original SQL implementation on SYSTEM R also had a CONTAINS comparison operator, which was used to compare two sers or multisets. This operator was subsequently dropped from the language, possibly because of the difficulty of implementing it efficiently. Most commercial implementations of SQL do not have this operator. The CONTAINS operator compares two sets of values and returns TRUE if one set contains all values in the other set. Query 3 illustrates the use of the CONTAINS operator. QUERY 3 Retrieve the name of each employee who works on allthe projects controlled by department number 5. Q3: SELECT FNAME, LNAME FROM EMPLOYEE (SELECT WHERE ( FROM WHERE CONTAINS (SELECT FROM WHERE PNO WORKS_ON SSN=ESSN) PNUMBER PROJECT DNUM=5) ); In Q3, the second nested query (which is not correlated with the outer query) retrieves the project numbers of all projects controlled by department 5. For each employee tuple, the first nested query (which is correlated) retrieves the project numbers onwhich the employee works; if these contain all projects controlled by department 5, theemployee tuple is selected and the name of that employee is retrieved. Notice that the CONTAINS comparison operator has a similar function to the DIVISION operation of the relational algebra (see Section 6.3.4) and to universal quantification in relational calculus (see Section 6.6.6). Because the CONTAINS operation is not part of SQL, we have to use other techniques, such as the EXISTS function, to specify these types of queries, as described in Section 8.5.4. 8.5.4 The EXISTS and UNIQUE Functions in SQL The EXISTS function in SQL is used to check whether the result of a correlated nested query is empty (contains no tuples) or not. We illustrate the use of EXISTS-and NOT 234 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries EXISTS-with some examples. First, we formulate Query 16 in an alternative form that uses EXISTS. This is shown as QI6B: Q16B:SELECT E.FNAME, E.LNAME EMPLOYEE AS E FROM WHERE EXISTS (SELECT * FROM DEPENDENT E.SSN=ESSN AND E.SEX=SEX WHERE AND E.FNAME=DEPENDENT_NAME); EXISTS and NOT EXISTS are usually used in conjunction with a correlated nested query. In QI6B, the nested query references the SSN, FNAME, and SEX attributes of the EMPLOYEE relation from the outer query. We can think of Q16B as follows: For each EMPLOYEE tuple, evaluate the nested query, which retrieves all DEPENDENT tuples with the same social security number, sex, and name as the EMPLOYEE tuple; if at least one tuple EXISTS in the result of the nested query, then select that EMPLOYEE tuple. In general, EXISTS(Q) returns TRUE if there is at least one tuple in the result of the nested query Q, and it returns FALSE otherwise. On the other hand, NOT EXISTS(Q) returns TRUE if there are no tuples in the result of nested query Q, and it returns FALSE otherwise. Next, we illustrate the use of NOT EXISTS. QUERY 6 Retrieve the names of employees who have no dependents. Q6: SELECT FNAME, LNAME FROM EMPLOYEE WHERE NOT EXISTS (SELECT * FROM DEPENDENT WHERE SSN=ESSN); In Q6, the correlated nested query retrieves all DEPENDENT tuples related to a particular EMPLOYEE tuple. If none exist, the EMPLOYEE tuple is selected. We can explain Q6 as follows: For each EMPLOYEE tuple, the correlated nested query selects all DEPENDENT tuples whose ESSN value matches the EMPLOYEE SSN; if the result is empty, no dependents are related to the employee, so we select that EMPLOYEE tuple and retrieve its FNAME and LNAME. QUERY 7 List the names of managers who have at least one dependent. Q7: (SELECT * FROM DEPENDENT WHERE SSN=ESSN) SELECT FNAME, LNAME FROM EMPLOYEE WHERE EXISTS 8.5 More Complex SQL Queries I 235 AND EXISTS (SELECT * FROM DEPARTMENT WHERE SSN=MGRSSN); One way to write this query is shown in Q7, where we specify two nested correlated queries; the first selects all DEPENDENT tuples related to an EMPLOYEE, and the second selects all DEPARTMENT tuples managed by the EMPLOYEE. If at least one of the first and at least one of the second exists, we select the EMPLOYEE tuple. Can you rewrite this query using only a single nested query or no nested queries? Query 3 ("Retrieve the name of each employee who works on all the projects controlled by department number 5," see Section 8.5.3) can be stated using EXISTS and NOTEXISTS in SQL systems. There are two options. The first is to use the well-known set theory transformation that (51 CONTAINS 52) is logically equivalent to (52 EXCEPT 51) is emptv,''This option is shown as Q3A. Q3A: SELECT FNAME, LNAME FROM EMPLOYEE NOT EXISTS WHERE (SELECT PNUMBER ( FROM PROJECT WHERE DNUM=5) EXCEPT (SELECT FROM WHERE PNO WORKS_ON SSN=ESSN) ); In Q3A, the first subquery (which is not correlated) selects all projects controlled by department 5, and the second subquery (which is correlated) selects all projects that the particular employee being considered works on. If the set difference of the first subquery MINUS (EXCEPT) the second subquery is empty, it means that the employee works on all theprojects and is hence selected. The second option is shown as Q3B. Notice that we need two-level nesting in Q3B and that this formulation is quite a bit more complex than Q3, which used the CONTAINS comparison operator, and Q3A, which uses NOT EXISTS and EXCEPT. However, CONTAINS is not part of SQL, and not all relational systems have the EXCEPT operator even though it is part of sQL-99. Q3B: SELECT LNAME, FNAME FROM EMPLOYEE 9.Recall that EXCEPT is the set difference operator. 236 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries WHERE NOT EXISTS (SELECT * FROM WORKS_ON B (B.PNO IN (SELECT PNUMBER WHERE FROM PROJECT WHERE DNUM=5) ) AND NOT EXISTS (SELECT * FROM WORKS_ON C WHERE C.ESSN=SSN AND C.PNO=B.PNO) ); In Q3B, the outer nested query selects any WORKS_ON (B) tuples whose PNO is of a project controlled by department 5, if there is not a WORKS_ON (C) tuple with the same PNO and the same SSN as that of the EMPLOYEE tuple under consideration in the outer query. Ifno such tuple exists, we select the EMPLOYEE tuple. The form of Q3B matches the following rephrasing of Query 3: Select each employee such that there does not exist a project controlled by department 5 that the employee does not work on. It corresponds to the way we wrote this query in tuple relation calculus in Section 6.6.6. There is another SQL function, UNIQUE(Q), which returns TRUE if there are no duplicate tuples in the result of query Q; otherwise, it returns FALSE. This can be used to test whether the result of a nested query is a set or a multiset. 8.5.5 Explicit Sets and Renaming of Attributes in SQL We have seen several queries with a nested query in the WHERE clause. It is also possible to use an explicit set of values in the WHERE clause, rather than a nested query. Such a set is enclosed in parentheses in SQL. QUERY 17 Retrieve the social security numbers of all employees who work on project numbers 1,2, or 3. Q17: SELECT DISTINCT ESSN FROM WORKS_ON PNO IN (1, 2, 3); WHERE In SQL, it is possible to rename any attribute that appears in the result of a query by adding the qualifier AS followed by the desired new name. Hence, the AS construct can be used to alias both attribute and relation names, and it can be used in both the SELECTand FROM clauses. For example, Q8A shows how query Q8 can be slightly changed to retrieve the last name of each employee and his or her supervisor, while renaming the resulting 8.5 More Complex SQL Queries I 237 attribute names as EMPLOYEE_NAME and SUPERVISOR_NAME. The new names will appear as column headers in the query result. Q8A: SELECT E.LNAME AS EMPLOYEE_NAME, S.LNAME AS SUPERVISOR_NAME EMPLOYEE AS E, EMPLOYEE AS S E.SUPERSSN=S.SSN; FROM WHERE 8.5.6 Joined Tables in SQL The concept of a joined table (or joined relation) was incorporated into SQL to permit users to specify a table resulting from a join operation in the FROM clause of a query. This construct may be easier to comprehend than mixing together all the select and join con? ditions in the WHERE clause. For example, consider query Ql, which retrieves the name and address of every employee who works for the 'Research'department. It may be easier first to specify the join of the EMPLOYEE and DEPARTMENT relations, and then to select the desired tuples and attributes. This can be written in SQL as in QIA: QIA: SELECT FNAME, LNAME, ADDRESS (EMPLOYEE JOIN DEPARTMENT ON DNO=DNUMBER) FROM WHERE DNAME='Research'; The FROM clause in Q IA contains a single joinedtable. The attributes of such a table are all the attributes of the first table, EMPLOYEE, followed by all the attributes of the second table, DEPARTMENT. The concept of a joined table also allows the user to specify different types of join, such as NATURAL JOIN and various types of OUTER JOIN. In a NATURAL JOIN ontwo relations Rand S, no join condition is specified; an implicit equijoin condition for each pair of attributes with the same name from Rand S is created. Each such pair of attributes is included only once in the resulting relation (see Section 6.4.3). Ifthe names of the join attributes are not the same in the base relations, it is possible to rename the attributes so that they match, and then to apply NATURAL JOIN. In this case, the AS construct can be used to rename a relation and all its attributes in the FROM clause. This is illustrated in QIB, where the DEPARTMENT relation is renamed as DEPT and its attributes are renamed as DNAME, DNO (to match the name of the desired join attribute DNO in EMPLOYEE), MSSN, and MSDATE. The implied join condition for this NATURAL JOIN is EMPLOYEE. DNO = DEPT. DNO, because this is the only pair of attributes with the same name after renaming. Q1B: SELECT FNAME, LNAME, ADDRESS FROM (EMPLOYEE NATURAL JOIN (DEPARTMENT AS DEPT (DNAME, DNO, MSSN, MSDATE))) WHERE DNAME='Research; The default type of join in a joined table is an inner join, where a tuple is included in the result only if a matching tuple exists in the other relation. For example, in query 238 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries Q8A, only employees that have a supervisor are included in the result; an EMPLOYEE tuple whose value for SUPERSSN is NULL is excluded. If the user requires that all employees be included, an OUTER JOIN must be used explicitly (see Section 6.4.3 for the definition of OUTER JOIN). In SQL, this is handled by explicitly specifying the OUTER JOIN in a joined table, as illustrated in Q8B: Q8B: SELECT E.LNAME AS EMPLOYEE_NAME, S.LNAME AS SUPERVISOR_NAME (EMPLOYEE AS E LEFT OUTER JOIN EMPLOYEE AS S ON E.SUPERSSN=S.SSN); FROM The options available for specifying joined tables in SQL include INNER JOIN (same as JOIN), LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN. In the latter three options, the keyword OUTER may be omitted. If the join attributes have the same name, one may also specify the natural join variation of outer joins by using the keyword NATURAL before the operation (for example, NATURAL LEFT OUTER JOIN). The keyword CROSS JOIN is used to specify the Cartesian product operation (see Section 6.2.2), although this should be used only with the utmost care because it generates all possible tuple combinations. It is also possible to nest join specifications; that is, one of the tables in a join may itself be a joined table. This is illustrated by Q2A, which is a different way of specifying query Q2, using the concept of a joined table: Q2A: SELECT PNUMBER, DNUM, LNAME, ADDRESS, BDATE ((PROJECT JOIN DEPARTMENT ON DNUM=DNUMBER) FROM JOIN EMPLOYEE ON MGRSSN=SSN) PLOCATION='Stafford'; WHERE 8.5.7 Aggregate Functions in SQL In Section 6.4.1, we introduced the concept of an aggregate function as a relational opera? tion. Because grouping and aggregation are required in many database applications, SQL has features that incorporate these concepts. A number of built-in functions exist: COUNT, SUM, MAX, MIN, and AVG. lOThe COUNT function returns the number of tuples or values as specified in a query. The functions SUM, MAX, MIN, and AVG are applied to a set or mul? tiset of numeric values and return, respectively, the sum, maximum value, minimum value, and average (mean) of those values. These functions can be used in the SELECT clause or in a HAVING clause (which we introduce later). The functions MAX and MIN can also be used with attributes that have nonnumeric domains if the domain values have a total ordering among one another. I I We illustrate the use of these functions with example queries. 10. Additional aggregate functions for more advanced statistical calculation have been addedin sQL·99. 11.Total order means that for any two values in the domain, it can be determined that one appears before the other in the definedorder; for example, DATE, TIME, and TIMESTAMP domains have total orderingson their values, as do alphabetic strings. QUERY 19 8.5 More Complex SQL Queries I 239 Find the sum of the salaries of all employees, the maximum salary, the minimum sal? ary, and the average salary. Q19: SELECT SUM (SALARY), MAX (SALARY), MIN (SALARY), AVG (SALARY) EMPLOYEE; FROM If we want to get the preceding function values for employees of a specific department-say, the 'Research'department-we can write Query 20, where the EMPLOYEE tuples are restricted by the WHERE clause to those employees who work for the 'Research' department. QUERY 20 Findthe sum of the salaries of all employees of the 'Research'department, as well as the maximum salary, the minimum salary, and the average salary in this department. Q20: SELECT SUM (SALARY), MAX (SALARY), MIN (SALARY), AVG (SALARY) (EMPLOYEE JOIN DEPARTMENT ON DNO=DNUMBER) DNAME='Research'; FROM WHERE QUERIES 21 AND 22 Retrieve the total number of employees in the company (Q21) and the number of employees in the 'Research'department (Q22). Q21: SELECT COUNT (*) FROM EMPLOYEE; Q22: SELECT COUNT (*) FROM EMPLOYEE,DEPARTMENT DNO=DNUMBER AND DNAME='Research'; WHERE Herethe asterisk (*) refers to the rows (tuples), so COUNT (*) returns the number of rows in the result of the query. We may also use the COUNT function to count values in a column rather than tuples, as in the next example. QUERY 23 Count the number of distinct salary values in the database. Q23: SELECT COUNT (DISTINCT SALARY) FROM EMPLOYEE; 240 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries If we write COUNT(SALARY) instead of COUNT(orSTINCT SALARY) in Q23, then duplicate values will not be eliminated. However, any tuples with NULL for SALARY will not be counted. In general, NULL values are discarded when aggregate functions are applied to a particular column (attribute). The preceding examples summarize a whole relation (QI9, Q21, Q23) or a selected subset of tuples (Q20, Q22), and hence all produce single tuples or single values. They illustrate how functions are applied to retrieve a summary value or summary tuple from the database. These functions can also be used in selection conditions involving nested queries. We can specify a correlated nested query with an aggregate function, and then use the nested query in the WHERE clause of an outer query. For example, to retrieve the names of all employees who have two or more dependents (Query 5), we can write the following: Q5: SELECT LNAME, FNAME EMPLOYEE FROM (SELECT WHERE FROM WHERE COUNT (*) DEPENDENT SSN=ESSN) >= 2', The correlated nested query counts the number of dependents that each employee has;if this is greater than or equal to two, the employee tuple is selected. 8.5.8 Grouping: The GROUP BY and HAVING Clauses In many cases we want to apply the aggregate functions to subgroups of tuples in a relation, where the subgroups are based on some attribute values. For example, we may want to find the average salary of employees in each department or the number of employees who work on eachproject. In these cases we need to partition the relation into nonoverlapping subsets (or groups) of tuples. Each group (partition) will consist of the tuples that have the same value of some attributcf s), called the grouping attributets). We can then apply the function to each such group independently. SQL has a GROUP BY clause for this pur? pose. The GROUP BY clause specifies the grouping attributes, which should also appear in the SELECT clause, so that the value resulting from applying each aggregate function to a group of tuples appears along with the value of the grouping attributels). QUERY 24 For each department, retrieve the department number, the number of employees in the department, and their average salary. DNa, COUNT (*), AVG (SALARY) Q24: SELECT FROM EMPLOYEE GROUP BY DNa; In Q24, the EMPLOYEE tuples are partitioned into groups-each group having the same value for the grouping attribute DNO. The COUNT and AVG functions are applied to each 8.5 More Complex SQL Queries I 241 such group of tuples. Notice that the SELECT clause includes only the grouping attribute and the functions to be applied on each group of tuples. Figure 8.6a illustrates how grouping works on Q24j it also shows the result of Q24. IfNULLs exist in the grouping attribute, then a separate group is created for all tuples with a NULL value in the grouping attribute. For example, if the EMPLOYEE table had some tuples that had NULL for the grouping attribute DNa, there would be a separate group for those tuples in the result of Q24. QUERY 25 Foreach project, retrieve the project number, the project name, and the number of employees who work on that project. Q25: SELECT PNUMBER, PNAME, COUNT (*) FROM PROJECT, WORKS_ON WHERE PNUMBER=PNO GROUP BY PNUMBER, PNAME; Q25 shows how we can use a join condition in conjunction with GROUP BY. In this case, the grouping and functions are applied after the joining of the two relations. Sometimes we want to retrieve the values of these functions only for groups that satisfy certain conditions. For example, suppose that we want to modify Query 25 so that only projects with more than two employees appear in the result. SQL provides a HAVING clause, which can appear in conjunction with a GROUP BY clause, for this purpose. HAVING provides a condition on the group of tuples associated with each value of the grouping attributes. Only the groups that satisfy the condition are retrieved in the result ofthe query. This is illustrated by Query 26. QUERY 26 Foreach project on whichmore chan two employees work, retrieve the project number, the project name, and the number of employees who work on the project. PNUMBER, PNAME, COUNT (*) Q26: SELECT FROM PROJECT, WORKS_ON WHERE PNUMBER=PNO GROUP BY PNUMBER, PNAME COUNT (*) > 2; HAVING Notice that, while selection conditions in the WHERE clause limit the tuples to which functions are applied, the HAVING clause serves to choose whole groups. Figure 8.6b illustrates the use of HAVING and displays the result of Q26. 242 I Chapter B SQL-99: Schema Definition, Basic Constraints, and Queries (a) FNAME John Franklin Ramesh Joyce Alicia Jennifer Ahmad James (b) MINIT LNAME SSN B T K ... A SALARY SUPERSSN DNO J S V E PNAME ProductX Productx ProductY ProductY ProductY ProductZ ProductZ Computerization Computerization Computerization Reorganization Reorganization Reorganization Newbenefits Newbenefits Newbenefits 333445555 123456789 30000 888665555 40000 333445555 666884444 ... 38000 333445555 333445555 25000 453453453 987654321 999887777 25000 888665555 43000 987654321 987987987 987654321 4 25000 Result of 024. null 1 55000 888665555 Grouping EMPLOYEE tuplesby thevalueof DNa. 453453453 1 20.040.0 r.}~ selectedby the HAVING ESSN HOURS PNO 32.5 123456789 1 2 123456789 7.5 20.0 } .> These groupsare not 2 453453453 2 10.0 333445555 3 666884444 ... 333445555 condition of 026. 10.0 3 10.0 } 333445555 10.0 10 10 999887777 987987987 35.0 10 15.0 } 10.0 333445555 20 987654321 20 null 888665555 20 20.0 } 987987987 30 5.0 30 987654321 30 999887777 30.0 Smith Wong Narayan English Zelaya Wallace Jabbar Bong PNUMBER 1 1 2 2 2 3 3 10 10 10 20 20 20 30 30 30 Afterapplying the WHERE clausebutbeforeapplying HAVING PNAME ProductY ProductY ProductY Computerization Computerization Computerization Reorganization Reorganization Reorganization Newbenefits Newbenefits Newbenefits PNUMBER 2 ESSN 2 2 10 10 10 20 20 20 30 30 30 5 5 5 5 5 4 4 123456789 453453453 ... 333445555 333445555 999887777 987987987 333445555 987654321 888665555 987987987 987654321 999887777 Afterapplying the HAVING clauseconoition. PNO 2 2 2 10 10 10 20 20 20 30 30 30 HOURS 7.5 20.0 10.0 10.0 10.0 35.0 10.0 15.0 null 20.0 } 5.0 30.0 DNO COUNT(") 4 AVG (SALARY) 4 3 1 1 33250 31000 55000 PNAME ProductY Computerization Reorganization Newbenefits COUNT(") 3 3 3 Result of 026 (PNUMBER notshown). FIGURE 8.6 Results of GROUP BY and HAVING. (a) Q24. (b) Q26. 3 8.5 More Complex SQL Queries I 243 QUERY 27 Foreach project, retrieve the project number, the project name, and the number of employees from department 5 who work on the project. Q27: SELECT PNUMBER, PNAME, COUNT (*) FROM PROJECT, WORKS_ON, EMPLOYEE WHERE PNUMBER=PNO AND SSN=ESSN AND DNO=5 GROUP BY PNUMBER, PNAME; Herewe restrict the tuples in the relation (and hence the tuples in each group) to those that satisfy the condition specified in the WHERE clause-namely, that they work in department number 5. Notice that we must be extra careful when two different conditions apply (one to the function in the SELECT clause and another to the function in the HAVING clause). For example, suppose that we want to count the total number of employees whose salaries exceed $40,000 in each department, but only for departments where more than five employees work. Here, the condition (SALARY> 40000) applies only to the COUNT function inthe SELECT clause. Suppose that we write the following incorrect query: DNAME, COUNT (*) SELECT FROM DEPARTMENT, EMPLOYEE WHERE DNUMBER=DNO AND SALARY>40000 GROUP BY DNAME HAVING COUNT (*) > 5; This is incorrect because it will select only departments that have more than five employees whoeach earn more than$40,000. The rule is that the WHERE clause is executed first, to select individual tuples; the HAVING clause is applied later, to select individual groups of tuples. Hence, the tuples are already restricted to employees who earn more than $40,000, before the function in the HAVING clause is applied. One way to write this query correctly is to use a nested query, as shown in Query 28. QUERY 28 Foreach department that has more than five employees, retrieve the department number and the number of its employees who are making more than $40,000. Q28: SELECT FROM WHERE DNUMBER, COUNT (*) DEPARTMENT, EMPLOYEE DNUMBER=DNO AND SALARY>40000 AND (SELECT DNO IN DNO FROM EMPLOYEE GROUP BY DNO COUNT (*) > 5) HAVING GROUP BY DNUMBER; 244 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries 8.5.9 Discussion and Summary of SQL Queries A query in SQL can consist of up to six clauses, but only the first two-SELECT and FROM-are mandatory. The clauses are specified in the following order, with the clauses between square brackets [ ... ] being optional: SELECT FROM
[WHERE ] [GROUP BY ] [ORDER BY ]; The SELECT clause lists the attributes or functions to be retrieved. The FROM clause specifies all relations (tables) needed in the query, including joined relations, but not those in nested queries. The WHERE clause specifies the conditions for selection of tuples from these relations, including join conditions if needed. GROUP BY specifies grouping attributes, whereas HAVING specifies a condition on the groups being selected rather than on the individual tuples. The built-in aggregate functions COUNT, SUM, MIN, MAX, and AVG are used in conjunction with grouping, but they can also be applied to all the selected tuples in a query without a GROUP BY clause. Finally, ORDER BY specifies an order for displaying the result of a query. A query is evaluated conceptually12 by first applying the FROM clause (to identify all tables involved in the query or to materialize any joined tables), followed by the WHERE clause, and then by GROUP BY and HAVING. Conceptually, ORDER BY is applied at the end to sort the query result. If none of the last three clauses (GROUP BY, HAVING, and ORDER BY) are specified, we can think conceptually of a query as being executed as follows: For each combination of tuples-one from each of the relations specified in the FROM clause? evaluate the WHERE clause; if it evaluates to TRUE, place the values of the attributes specified in the SELECT clause from this tuple combination in the result of the query. Of course, this is not an efficient way to implement the query in a real system, and each DBMS has special query optimization routines to decide on an execution plan that is efficient. We discuss query processing and optimization in Chapters 15 and 16. In general, there are numerous ways to specify the same query in SQL. This flexibility in specifying queries has advantages and disadvantages. The main advantage is that users can choose the technique with which they are most comfortable when specifying a query. For example, many queries may be specified with join conditions in the WHERE clause, or by using joined relations in the FROM clause, or with some form of nested queries and the IN comparison operator. Some users may be more comfortable with one approach, whereas others may be more comfortable with another. From the programmer's and the --~--~----~------ -~---- 12. The actual order of query evaluation is implementation dependent; this is just a wayto concep? tuallv view a query in order to correctly formulate it. 8.6 Insert, Delete, and Update Statements in SQL I 245 system's point of view regarding query optimization, it is generally preferable to write a query with as little nesting and implied ordering as possible. The disadvantage of having numerous ways of specifying the same query is that this may confuse the user, who may not know which technique to use to specify particular types of queries. Another problem is that it may be more efficient to execute a query specified in one way than the same query specified in an alternative way. Ideally, this should not be the case: The DBMS should process the same query in the same way regardless of how the query is specified. But this is quite difficult in practice, since each DBMS has different methods for processing queries specified in different ways. Thus, an additional burden on the user is to determine which of the alternative specifications is the most efficient. Ideally, the user should worry only about specifying the query correctly. It is the responsibility of the DBMS to execute the query efficiently. In practice, however, it helps if the user is aware of which types of constructs in a query are more expensive to process than others (see Chapter 16). 8.6 INSERT, DELETE, AND UPDATE STATEMENTS IN SQL In SQL, three commands can be used to modify the database: INSERT, DELETE, and UPDATE. We discuss each of these in turn. 8.6.1 The INSERT Command In its simplestform, INSERT is used to add a single tuple to a relation. We must specify the relation name and a list of values for the tuple. The values should be listed in the same order in which the corresponding attributes were specified in the CREATE TABLE com? mand. For example, to add a new tuple to the EMPLOYEE relation shown in Figure 5.5 and specified in the CREATE TABLE EMPLOYEE ••• command in Figure 8.1, we can use U1: VI: INSERT INTO VALUES EMPLOYEE ('Richard', 'K', 'Marini', '653298653', '1962-12-30', '98 Oak Forest,Katy,TX', 'M', 37000, '987654321', 4); A second form of the INSERT statement allows the user to specify explicit attribute names that correspond to the values provided in the INSERT command. This is useful if a relation has many attributes but only a few of those attributes are assigned values in the new tuple. However, the values must include all attributes with NOT NULL specification and no default value. Attributes with NULL allowed or DEFAULT values are the ones that can be left out. For example, to enter a tuple for a new EMPLOYEE for whom we know only the FNAME, LNAME, DNa, and SSN attributes, we can use U1A: VIA: INSERT INTO VALUES EMPLOYEE (FNAME, LNAME, DNO, SSN) ('Richard', 'Marini', 4, '653298653'); 246 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries Attributes not specified in U lA are set to their DEFAULT or to NULL, and the values are listed in the same order as the attributes are listed in the INSERT command itself. It is also possible to insert into a relation multiple tuples separated by commas in a single INSERT command. The attribute values forming each tuple are enclosed in parentheses. A DBMS that fully implements sQL-99 should support and enforce all the integrity constraints that can be specified in the DOL. However, some DBMSs do not incorporate all the constraints, in order to maintain the efficiency of the DBMS and because of the complexity of enforcing all constraints. If a system does not support some constraint-say, referential integrity-the users or programmers must enforce the constraint. For example, if we issue the command in U2 on the database shown in Figure 5.6, a DBMS not supporting referential integrity will do the insertion even though no DEPARTMENT tuple exists in the database with DNUMBER = 2. It is the responsibility of the user to check that any such constraints whose checks are not implemented by the DBMS are not violated. However, the DBMS must implement checks to enforce all the SQL integrity constraints it supports. A DBMS enforcing NOT NULL will reject an INSERT command in which an attribute declared to be NOT NULL does not have a value; for example, U2A would be rejected because no SSN value is provided. U2: INSERT INTO EMPLOYEE (FNAME, LNAME, SSN, DNO) VALUES ('Robert', 'Hatcher', '980760540', 2); (* U2 is rejected if referential integrity checking is provided by dbms *) U2A: INSERT INTO EMPLOYEE (FNAME, LNAME, DNO) VALUES ('Robert', 'Hatcher', 5); (* U2A is rejected if not null checking is provided by dbms *) A variation of the INSERT command inserts multiple tuples into a relation in conjunction with creating the relation and loading it with the result of a query. For example, to create a temporary table that has the name, number of employees, and total salaries for each department, we can write the statements in U3A and U3B: U3A: CREATE TABLE DEPTS_INFO (DEPT_NAME VARCHAR(15), NO_OF_EMPS INTEGER, TOTAL_SAL INTEGER); U3B: INSERT INTO DEPTS_INFO (DEPT_NAME, NO_OF_EMPS, TOTAL_SAL) DNAME, COUNT (*), SUM (SALARY) (DEPARTMENT JOIN EMPLOYEE ON DNUMBER=DNO) DNAME; SELECT FROM GROUP BY A table DEPTS_INFO is created by U3A and is loaded with the summary information retrieved from the database by the query in U3B. We can now query DEPTS_INFO as we 8.6 Insert, Delete, and Update Statements in SQL would any other relation; when we do not need it any more, we can remove it by using theDROP TABLE command. Notice that the DEPTS_INFO table may not be up to date; that is, if we update either the DEPARTMENT or the EMPLOYEE relations after issuing U3B, the information in DEPTS_INFO becomes outdated. We have to create a view (see Section 9.2) to keep such a table up to date. 8.6.2 The DELETE Command The DELETE command removes tuples from a relation. It includes a WHERE clause, similar to that used in an SQL query, to select the tuples to be deleted. Tuples are explicitly deleted from only one table at a time. However, the deletion may propagate to tuples in other relations if referential triggered actions are specified in the referential integrity con? straints of the DOL (see Section 8.2.2).13 Depending on the number of tuples selected by the condition in the WHERE clause, zero, one, or several tuples can be deleted by a single DELETE command. A missing WHERE clause specifies that all tuples in the relation are to be deleted; however, the table remains in the database as an empty table.l" The DELETE commands in U4A to U4D, if applied independently to the database of Figure 5.6, will delete zero, one, four, and all tuples, respectively, from the EMPLOYEE relation: U4A: DELETE FROM EMPLOYEE WHERE LNAME='Brown'; U4B: DELETE FROM EMPLOYEE WHERE SSN='123456789'; U4C: DELETE FROM EMPLOYEE DNO IN (SELECT WHERE FROM WHERE DNUMBER DEPARTMENT DNAME='Research'); U4D: DELETE FROM EMPLOYEE; 8.6.3 The UPDATE Command The UPDATE command is used to modify attribute values of one or more selected tuples. As in the DELETE command, a WHERE clause in the UPDATE command selects the tuples tobemodified from a single relation. However, updating a primary key value may propa? gate to the foreign key values of tuples in other relations if such a referential triggered action is specified in the referential integrity constraints of the DOL (see Section 8.2.2). An addi? tional SET clause in the UPDATE command specifies the attributes to be modified and I 247 13. Other actions can be automatically applied through triggers (see Section 24.1) and other mechanisms. 14. We must use the DROP TABLE command to remove the table definition (see Section 8.3.1). 248 I Chapter 8 SQL-99: Schema Definition, Basic Constraints, and Queries their new values. For example, to change the location and controlling department num? ber of project number 10 to 'Bellaire'and 5, respectively, we use US: U5: UPDATE PROJECT SET PLOCATION = 'Bellaire', DNUM = 5 PNUMBER=10; WHERE Several tuples can be modified with a single UPDATE command. An example is to give all employees in the 'Research'department a 10 percent raise in salary, as shown in U6. In this request, the modified SALARY value depends on the original SALARY value in each tuple, so two references to the SALARY attribute are needed. In the SET clause, the reference to the SALARY attribute on the right refers to the old SALARY value before modification, and the one on the left refers to the new SALARY value aftermodification: U6: UPDATE EMPLOYEE SET SALARY = SALARY *1.1 DNUMBER DNO IN (SELECT WHERE FROM DEPARTMENT WHERE DNAME='Research'); It is also possible to specify NULL or DEFAULT as the new attribute value. Notice that each UPDATE command explicitly refers to a single relation only. To modify multiple relations, we must issue several UPDATE commands. 8.7 ADDITIONAL FEATURES OF SQL SQL has a number of additional features that we have not described in this chapter but discuss elsewhere in the book. These are as follows: • SQL has the capability to specify more general constraints, called assertions, using the CREATE ASSERTION statement. This is described in Section 9.1. • SQL has language constructs for specifying views, also known as virtual tables, using the CREATE VIEW statement. Views are derived from the base tables declared through the CREATE TABLE statement, and are discussed in Section 9.2. • SQL has several different techniques for writing programs in various programming languages that can include SQL statements to access one or more databases. These include embedded (and dynamic) SQL, SQL/CLI (Call Language Interface) and its pre· decessor ODBC (Open Data Base Connectivity), and SQL/PSM (Program Stored Mod? ules). We discuss the differences among these techniques in Section 9.3, then discuss each technique in Sections 9.4 through 9.6. We also discuss how to access SQL data? bases through the Java programming language using ]DBe and SQL]. • Each commercial RDBMS will have, in addition to the SQL commands, a set of corn? mands for specifying physical database design parameters, file structures for relations, and access paths such as indexes. We called these commands a storage definition lan· 8.8 Summary I 249 guage (SOL) in Chapter 2. Earlier versions of SQL had commands for creating indexes, but these were removed from the language because they were not at the conceptual schema level (see Chapter 2). • SQL has transaction control commands. These are used to specify units of database processing for concurrency control and recovery purposes. We discuss these com? mands in Chapter 17 after we discuss the concept of transactions in more detail. • SQL has language constructs for specifying the granting and revoking of privileges to users. Privileges typically correspond to the right to use certain SQL commands to access certain relations. Each relation is assigned an owner, and either the owner or the DBA staff can grant to selected users the privilege to use an SQL statement-such asSELECT, INSERT, DELETE, or UPDATE-to access the relation. In addition, the DBA staff can grant the privileges to create schemas, tables, or views to certain users. These SQL commands-called GRANT and REVOKE-are discussed in Chapter 23 where we discuss database security and authorization. • SQL has language constructs for creating triggers. These are generally referred to as active database techniques, since they specify actions that are automatically trig? gered by events such as database updates. We discuss these features in Section 24.1, where we discuss active database concepts. • SQL has incorporated many features from object-oriented models to have more pow? erful capabilities, leading to enhanced relational systems known as object-relational. Capabilities such as creating complex-structured attributes (also called nested rela? tions), specifying abstract data types (called DDTs or user-defined types) for attributes and tables, creating object identifiers for referencing tuples, and specifying opera? tions on types are discussed in Chapter 22. • SQL and relational databases can interact with new technologies such as XML (eXtended Markup Language; see Chapter 26) and OLAP (On Line Analytical Pro? cessing for Data Warehouses; see Chapter 28). 8.8 SUMMARY In thischapter we presented the SQL database language. This language or variations of it have been implemented as interfaces to many commercial relational DBMSs, including Oracle, IBM's DB2 and SQL/DS, Microsoft's SQL Server and ACCESS, INGRES, INFORMIX, and SYBASE. The original version of SQL was implemented in the experimental DBMS called SYSTEM R, which was developed at IBM Research. SQL is designed to be a compre? hensive language that includes statements for data definition, queries, updates, view defi? nition, and constraint specification. We discussed many of these in separate sections of this chapter. In the final section we discussed additional features that are described else? where in the book. Our emphasis was on the sQL-99 standard. Table 8.2 summarizes the syntax (or structure) of various SQL statements. This summary is not meant to be comprehensive nor to describe every possible SQL construct; rather, it is meant to serve as a quick reference to the major types of constructs available 250 I Chapter 8 sQL-99: Schema Definition, Basic Constraints, and Queries TABLE 8.2 SUMMARY OF SQL SYNTAX CREATE TABLE
(ccolumn name> [} {, [ {,
}]) DROP TABLE
ALTER TABLE
ADD SELECT [DISTINCT] FROM «table name> { } I { } IoRoup BY [HAVING ) ) [ORDER BY [corder> {, [corder»] }] ::= (* I ( I «[DIsTINCT] I *»)) {,( I «(DIsTINCT} I *» } ) ) ::= { , ::= (ASC I DESC) INSERT INTO
«{, } ) ] (VALUES ( , { } H,({,})} I
( [ I {. [ ) } ) [CLUSTER) DROP INDEX CREATE VIEW [ ( { • } ) ) AS
[ FOR EACH ROW1 [ WHEN 1 ; ::= {OR } ::=INSERT I DELETEI UPDATE[OF {, ::= FIGURE 24.3 A syntax summary for specifying triggers in the Oracle system (main options only). 6. Assuming that an appropriate external procedure has been declared. This is a feature that is now available in SQL. 762 I Chapter 24 Enhanced Data Models for Advanced Applications In addition to creating rules, an active database system should allow users to activate, deactivate, and drop rules by referring to their rule names. A deactivated rule will not be triggered by the triggering event. This feature allows users to selectively deactivate rules for certain periods of time when they are not needed. The activate command will make the rule active again. The drop command deletes the rule from the system. Another option is to group rules into named rule sets, so the whole set of rules could be activated, deactivated, or dropped. It is also useful to have a command that can trigger a rule or rule set via an explicit PROCESS RULES command issued by the user. The second issue concerns whether the triggered action should be executed before, after, or concurrently with the triggering event. A related issue is whether the action being executed should be considered as a separate transaction or whether it should be part of the same transaction that triggered the rule. We will first try to categorize the various options. It is important to note that not all options may be available for a particular active database system. In fact, most commercial systems are limited to oneor two of the options that we will now discuss. Let us assume that the triggering event occurs as part of a transaction execution. We should first consider the various options for how the triggering event is related to the evaluation of the rule's condition. The rule condition evaluation is also known as rule consideration, since the action is to be executed only after considering whether the condition evaluates to true or false. There are three main possibilities for rule consideration: 1. Immediate consideration: The condition is evaluated as part of the same transaction as the triggering event, and is evaluated immediately. This case can be further cat? egorized into three options: • Evaluate the condition before executing the triggering event. • Evaluate the condition after executing the triggering event. • Evaluate the condition instead of executing the triggering event. 2. Deferred consideration: The condition is evaluated at the end of the transaction that included the triggering event. In this case, there could be many triggered rules waiting to have their conditions evaluated. 3. Detached consideration: The condition is evaluated as a separate transaction, spawned from the triggering transaction. The next set of options concerns the relationship between evaluating the rule condition and executing the rule action. Here, again, three options are possible: immediate, deferred, and detached execution. However, most active systems use the first option. That is, as soon as the condition is evaluated, if it returns true, the action is immediately executed. The Oracle system (see Section 24.1.1) uses the immediate consideration model, but it allows the user to specify for each rule whether the before or after option is to be used with immediate condition evaluation. It also uses the immediate execution model. The STARBURST system (see Section 24.1.3) uses the deferred consideration option, meaning that all rules triggered by a transaction wait until the triggering transaction reaches its end and issues its COMMIT WORK command before the rule conditions are evaluated.I ------------------------ - ---- ---- -------- 7. STARBURST alsoallows the userto explicitly start ruleconsideration via a PROCESSRULES command. 24.1 Active Database Concepts and Triggers I 763 Another issue concerning active database rules is the distinction between row-level rules versus statement-level rules. Because SQL update statements (which act as triggering events) can specify a set of tuples, one has to distinguish between whether the rule should be considered once for the whole statement or whether it should be considered separately for eachrow (that is, tuple) affected by the statement. The sQL-99 standard (see Section 24.1.5) and the Oracle system (see Section 24.1.1) allow the user to choose which of the above two options is to be used for each rule, whereas STARBURST uses statement-level semantics only. We will give examples of how statement-level triggers can be specified in Section 24.1.3. One of the difficulties that may have limited the widespread use of active rules, in spite of their potential to simplify database and software development, is that there are no easy-to-use techniques for designing, writing, and verifying rules. For example, it is quite difficult to verify that a set of rules is consistent, meaning that two or more rules in the set do not contradict one another. It is also difficult to guarantee termination of a set of rules under all circumstances. To briefly illustrate the termination problem, consider the rules in Figure 24.4. Here, rule Rl is triggered by an INSERT event on TABLEl and its action includes an update event on ATTRIBUTEl of TABLE2. However, rule R2's triggering event is an UPDATE event on ATTRIBUTEl of TABLE2, and its action includes an INSERT event on TABLEl. It is easy to see in this example that these two rules can trigger one another indefinitely, leading to nontermination. However, if dozens of rules are written, it is very difficult to determine whether termination is guaranteed or not. If active rules are to reach their potential, it is necessary to develop tools for the design, debugging, and monitoring of active rules that can help users in designing and debugging their rules. 24.1.3 Examples of Statement-level Active Rules in STARBURST We now give some examples to illustrate how rules can be specified in the STARBURST experimental DBMS. This will allow us to demonstrate how statement-level rules can be written, since these are the only types of rules allowed in STARBURST. RI: R2: CREATE TRIGGER T1 AFTER INSERT ON TABLE1 FOR EACH ROW UPDATE TABLE2 SET ATIRIBUTE1=... ; CREATE TRIGGER T2 AFTER UPDATE OF ATIRIBUTE1 ON TABLE2 FOR EACH ROW INSERT INTO TABLE1 VALUES (...); FIGURE 24.4 An example to illustrate the termination problem for active rules. 764 I Chapter 24 Enhanced Data Models for Advanced Applications The three active rules RlS, R2S, and R3S in Figure 24.5 correspond to the first three rules in Figure 24.2, but use STARBURST notation and statement-level semantics. We can explain the rule structure using rule RlS. The CREATE RULE statement specifies a rule name-TOTALSALl for RlS. The ON-clause specifies the relation on which the rule is specified-EMPLOYEE for RlS. The WHEN-clause is used to specify the events that trigger the rule.f The optional IF-clause is used to specify any conditions that need to be checked, RIS: CREATE RULE TOTALSAL1 ON EMPLOYEE WHEN INSERTED EXISTS(SELECT· FROM INSERTED WHERE DNO IS NOT NULL) IF DEPARTMENT AS D THEN UPDATE D.TOTAL_SAL=D.TOTAL_SAL + SET (SELECT SUM(I.SALARY) FROM INSERTED AS I WHERE D.DNO = I.ONO) D.DNO IN (SELECT DNO FROM INSERTED); R2S: R3S: WHERE CREATE RULE TOTALSAL2 ON EMPLOYEE WHEN UPDATED (SALARY) EXISTS(SELECT· FROM NEW·UPDATED WHERE DNO IS NOT NULL) IF OR EXISTS(SELECT· FROM OLD·UPDATED WHERE DNO IS NOT NULL) UPDATE DEPARTMENT AS D D.TOTAL_SAL=D.TOTAL_SAL + SET (SELECT SUM(N.SALARY) FROM NEW-UPDATED AS N WHERE D.DNO =N,DNO) - (SELECT SUM(O,SALARY) FROM OLD-UPDATED AS 0 WHERE D.DNO=O.DNO) D.DNO IN (SELECT DNO FROM NEW-UPDATED) OR D,DNO IN (SELECT DNO FROM OLD-UPDATED); THEN WHERE CREATE RULE TOTALSAL3 ON EMPLOYEE WHEN UPDATED(DNO) UPDATE DEPARTMENT AS D THEN D.TOTAL_SAL=D.TOTAL_SAL + SET (SELECT SUM(N.SALARY) FROM NEW-UPDATED AS N WHERE D.DNO=N.DNO) D.DNO IN (SELECT DNO FROM NEW-UPDATED); WHERE UPDATE DEPARTMENT AS D SET D.TOTAL_SAL=D.TOTAL_SAL- (SELECT SUM(O.SALARY) FROM OLD-UPDATED AS 0 WHERE O.DNO=O.DNO) D.DNO IN (SELECT DNO FROM OLD-UPDATED); WHERE FIGURE 24.5 Active rules using statement-level semantics in STARBURST notation. 8. Note that the WHEN keyword specifies events in STARBURST but is used to specify the rule condi? tion in SQLand Oracle triggers. 24.1 Active Database Concepts and Triggers I 765 Finally, the THEN-clause is used to specify the action (or actions) to be taken, which are typically one or more SQL statements. In STARBURST, the basic events that can be specified for triggering the rules are the standard SQL update commands: INSERT, DELETE, and UPDATE. These are specified by the keywords INSERTED, DELETED, and UPDATED in STARBURST notation. Second, the rule designer needs to have a way to refer to the tuples that have been modified. The keywords INSERTED, DELETED, NEW-UPDATED, and OLD-UPDATED are used in STARBURST notation to refer to four transition tables (relations) that include the newly inserted tuples, the deleted tuples, the updated tuples before they were updated, and the updated tuples after they were updated, respectively. Obviously, depending on the triggering events, only some of these transition tables may be available. The rule writer can refer to these tables when writing the condition and action parts of the rule. Transition tables contain tuples of the same type as those in the relation specified in the ON-clause of the rule-for RlS, R2S, and R3S, this is the EMPLOYEE relation. In statement-level semantics, the rule designer can only refer to the transition tables as a whole and the rule is triggered only once, so the rules must be written differently than for row-level semantics. Because multiple employee tuples may be inserted in a single insert statement, we have to check if at least one of the newly inserted employee tuples is related to a department. In RlS, the condition EXISTSCSELECT * FROM INSERTED WHERE DNO IS NOT NULL) is checked, and if it evaluates to true, then the action is executed. The action updates in a single statement the DEPARTMENT tupleis) related to the newly inserted emploveets) by add? ing their salaries to the TOTAL_SAL attribute of each related department. Because more than one newly inserted employee may belong to the same department, we use the SUM aggre? gate function to ensure that all their salaries are added. Rule R2S is similar to RlS, but is triggered by an UPDATE operation that updates the salary of one or more employees rather than by an INSERT. Rule R3S is triggered by an update to the DNO attribute of EMPLOYEE, which signifies changing one or more employees' assignment from one department to another. There is no condition in R3S, so the action is executed whenever the triggering event occurs.l' The action updates both the old departmentfs) and new departmentts) of the reassigned employees by adding their salary to TOTAL_SAL of each new department and subtracting their salary from TOTAL_SAL of each old department. In our example, it is more complex to write the statement-level rules than the row? level rules, as can be illustrated by comparing Figures 24.2 and 24.5. However, this is not a general rule, and other types of active rules may be easier to specify using statement? level notation than when using row-level notation. The execution model for active rules in STARBURST uses deferred consideration. That is, all the rules that are triggered within a transaction are placed in a set---ealled the conflict 9. As in the Oracle examples, rules R1S and R2S can be written without a condition. However, they may be more efficient to execute with the condition since the action is not invoked unless it is required. 766 I Chapter 24 Enhanced Data Models for Advanced Applications set-which is not considered for evaluation of conditions and execution until the transaction ends (by issuing its COMMIT WORK command). STARBURST also allows the user to explicitly start rule consideration in the middle of a transaction via an explicit PROCESS RULES command. Because multiple rules must be evaluated, it is necessary to specify an order among the rules. The syntax for rule declaration in STARBURST allows the specification of ordering among the rules to instruct the system about the order in which a set of rules should be considered.l" In addition, the transition tables-INSERTED, DELETED, NEW-UPDATED, and OLD? UPDATED------eontain the net effect of all the operations within the transaction that affected each table, since multiple operations may have been applied to each table during the transaction. 24.1.4 Potential Applications for Active Databases We now briefly discuss some of the potential applications of active rules. Obviously, one important application is to allow notification of certain conditions that occur. For exam? ple, an active database may be used to monitor, say, the temperature of an industrial fur? nace. The application can periodically insert in the database the temperature reading records directly from temperature sensors, and active rules can be written that are trig? gered whenever a temperature record is inserted, with a condition that checks if the tem? perature exceeds the danger level, and the action to raise an alarm. Active rules can also be used to enforce integrity constraints by specifying the types of events that may cause rhe constraints to be violated and then evaluating appropriate conditions that check whether the constraints are actually violated by the event or not. Hence, complex application constraints, often known as business rules may be enforced that way. For example, in the UNIVERSITY database application, one rule may monitor the grade point average of students whenever a new grade is entered, and it may alert the advisor if the CPA of a student falls below a certain threshold; another rule may check that course prerequisites are satisfied before allowing a student to enroll in a course; and so on. Other applications include the automatic maintenance of derived data, such as the examples of rules R1 through R4 that maintain the derived attribute TOTAL_SAL whenever individual employee tuples are changed. A similar application is to use active rules to maintain the consistency of materialized views (see Chapter 9) whenever the base relations are modified. This application is also relevant to the new data warehousing technologies (see Chapter 28). A related application is to maintain replicated tables consistent by specifying rules that modify the replicas whenever the master table is modified. 24.1.5 Triggers in SQL-99 Triggers in the sQL-99 standard are quite similar to the examples we discussed in Section 24.1.1, with some minor syntactic differences. The basic events that can be specified for triggering the rules are the standard SQL update commands: INSERT, DELETE, and UPDATE. -~~---~~~~~~--~--~~----_._------~~------ 10. If no order is specified between a pair of rules, the system default order is based on placing the rule declared first ahead of the other rule. 24.2 Temporal Database Concepts I 767 In the case of UPDATE one may specify the attributes to be updated. Both row-level and statement-level triggers are allowed, indicated in the trigger by the clauses FOR EACH ROWand FOR EACH 5TATEMENT, respectively. One syntactic difference is that the trigger may specify particular tuple variable names for the old and new tuples instead of using the keywords NEW and OLD as in Figure 24.1. Trigger Tl in Figure 24.6 shows how the row? level trigger R2 from Figure 24.1(a) may be specified in 5QL-99. Inside the REFERENCING clause, we named tuple variables (aliases) 0 and N to refer to the OLD tuple (before mod? ification) and NEW tuple (after modification), respectively. Trigger T2 in Figure 24.6 shows how the statement-level trigger R2S from Figure 24.5 may be specified in 5QL-99. For a statement-level trigger, the REFERENCING clause is used to refer to the table of all new tuples (newly inserted or newly updated) as N, whereas the table of all old tuples (deleted tuples or tuples before they were updated) is referred to as O. 24.2 TEMPORAL DATABASE CONCEPTS Temporal databases, in the broadest sense, encompass all database applications that require some aspect of time when organizing their information. Hence, they provide a good example to illustrate the need for developing a set of unifying concepts for applica? tion developers to use. Temporal database applications have been developed since the early days of database usage. However, in creating these applications, it was mainly left to T1: T2: CREATE TRIGGER TOTALSAL1 AFTER UPDATE OF SALARY ON EMPLOYEE REFERENCING OLD ROW AS 0, NEW ROW AS N FOR EACH ROW WHEN (N.DNO IS NOT NULL) UPDATE DEPARTMENT SET TOTAL_SAL = TOTAL SAL + N.SALARY - O.SALARY WHERE DNO = N.DNO; CREATE TRIGGER TOTALSAL2 AFTER UPDATE OF SALARY ON EMPLOYEE REFERENCING OLD TABLE AS 0, NEW TABLE AS N FOR EACH STATEMENT WHEN EXISTS(SELECT * FROM N WHERE N.DNO IS NOT NULL) OR EXISTS(SELECT * FROM 0 WHERE O.DNO IS NOT NULL) UPDATE DEPARTMENT AS D SET D.TOTAL_SAL = D.TOTAL_SAL + (SELECT SUM(N.SALARY) FROM N WHERE D.DNO=N.DNO) - (SELECT SUM(O.SALARY) FROM 0 WHERE D.DNO=O.DNO) WHERE DNO IN ((SELECT DNO FROM N) UNION (SELECT DNO FROM 0)); FIGURE 24.6 Trigger T1 illustrating the syntax for defining triggers in sQL-99. 768 I Chapter 24 Enhanced Data Models for Advanced Applications the application designers and developers to discover, design, program, and implement the temporal concepts they need. There are many examples of applications where some aspect of time is needed to maintain the information in a database. These include health? care, where patient histories need to be maintained; insurance, where claims and accident histories are required as well as information on the times when insurance policies are in effect; reservation systems in general (hotel, airline, car rental, train, etc.}, where informa? tion on the dates and times when reservations are in effect are required; scientific data? bases, where data collected from experiments includes the time when each data is measured; an so on. Even the two examples used in this book may be easily expanded into temporal applications. In the COMPANY database, we may wish to keep SALARY, JOB, and PROJECT histories on each employee. In the UNIVERSITY database, time is already included in the SEMESTER and YEAR of each SECTION of a COURSE; the grade history of a STUDENT; and the informa? tion on research grants. In fact, it is realistic to conclude that the majority of database applications have some temporal information. Users often attempted to simplify or ignore temporal aspects because of the complexity that they add to their applications. In this section, we will introduce some of the concepts that have been developed to deal with the complexity of temporal database applications. Section 24.2.1 gives an overview of how time is represented in databases, the different types of temporal information, and some of the different dimensions of time that may be needed. Section 24.2.2 discusses how time can be incorporated into relational databases. Section 24.2.3 gives some additional options for representing time that are possible in database models that allow complex-structured objects, such as object databases. Section 24.2.4 introduces operations for querying temporal databases, and gives a brief overview of the TSQL2 language, which extends SQL with temporal concepts. Section 24.2.5 focuses on time series data, which is a type of temporal data that is very important in practice. 24.2.1 Time Representation, Calendars, and Time Dimensions For temporal databases, time is considered to be an ordered sequence of points in some granularity that is determined by the application. For example, suppose that some tempo? ral application never requires time units that are less than one second. Then, each time point represents one second in time using this granularity. In reality, each second is a (short) time duration, not a point, since it may be further divided into milliseconds, micro? seconds, and so on. Temporal database researchers have used the term chronon instead of point to describe this minimal granularity for a particular application. The main conse? quence of choosing a minimum granularity-say, one second-is that events occurring within the same second will be considered to be simultaneous events, even though in real? ity they may not be. Because there is no known beginning or ending of time, one needs a reference point from which to measure specific time points. Various calendars are used by various cultures (such as Gregorian (Western), Chinese, Islamic, Hindu, Jewish, Coptic, etc.) with different reference points. A calendar organizes time into different time units for convenience. Most 24.2 Temporal Database Concepts I 769 calendars group 60 seconds into a minute, 60 minutes into an hour, 24 hours into a day (based on the physical time of earth's rotation around its axis), and 7 days into a week. Further grouping of days into months and months into years either follow solar or lunar natural phenomena, and are generally irregular. In the Gregorian calendar, which is used in most Western countries, days are grouped into months that are either 28,29,30, or 31 days, and 12 months are grouped into a year. Complex formulas are used to map the different time units to one another. In sQL2, the temporal data types (see Chapter 8) include DATE (specifying Year, Month, and Day as YYYY-MM-DD), TIME (specifying Hour, Minute, and Second as HH:MM:SS), TIMESTAMP (specifying a Date/Time combination, with options for including sub-second divisions if they are needed), INTERVAL (a relative time duration, such as 10 days or 250 minutes), and PERIOD (an anchored time duration with a fixed starting point, such as the lO-day period from January 1, 1999, to January 10, 1999, inclusive).ll Event Information Versus Duration (or State) Information. A temporal database will store information concerning when certain events occur, or when certain facts are considered to be true. There are several different types of temporal information. Point events or facts are typically associated in the database with a single time point in some granularity. For example, a bank deposit event may be associated with the timestamp when the deposit was made, or the total monthly sales of a product (fact} may be associated with a particular month (say, February 1999). Note that even though such events or facts may have different granularities, each is still associated with a single time value in the database. This type of information is often represented as time series data as we shall discuss in Section 24.2.5. Duration events or facts, on the other hand, are associated with a specific time period in the database.l/ For example, an employee may have worked in a company from August 15, 1993, till November 20, 1998. A time period is represented by its start and end time points [START-TIME, END-TIME]. For example, the above period is represented as [1993-08-15, 1998-11-20]. Such a time period is often interpreted to mean the set of all time points from start-time to end-time, inclusive, in the specified granularity. Hence, assuming day granularity, the period [1993? 08-15, 1998-11-20] represents the set of all days from August 15, 1993, until November 20, 1998, inclusive. 13 11. Unfortunately, the terminology has not been used consistently. For example, the term intervalis often used to denote an anchored duration. For consistency, we shall use the SQL terminology. 12. This is the same as an anchored duration. It has also been frequently called a time interval, but to avoid confusion we will use period to be consistent with SQL terminology. 13. The representation [1993-08-15, 1998-11-20] is called a closed interval representation. One can also use an open interval, denoted [1993-08-15, 1998-11-21), where the set of points does not include the end point. Although the latter representation is sometimes more convenient, we shall use closed intervals throughout to avoid confusion. 770 I Chapter 24 Enhanced Data Models for Advanced Applications Valid Time and Transaction Time Dimensions. Given a particular event or fact that is associated with a particular time point or time period in the database, the association may be interpreted to mean different things. The most natural interpretation is that the associated time is the time that the event occurred, or the period during which the fact was considered to be true in the real world. If this interpretation is used, the associated time is often referred to as the valid time. A temporal database using this interpretation is called a valid time database. However, a different interpretation can be used, where the associated time refers to the time when the information was actually stored in the database; that is, it is the value of the system time clock when the information is valid in the system. 14 In this case, the associated time is called the transaction time. A temporal database using this interpretation is called a transaction time database. Other interpretations can also be intended, but these two are considered to be the most common ones, and they are referred to as time dimensions. In some applications, only one of the dimensions is needed and in other cases both time dimensions are required, in which case the temporal database is called a bitemporal database. If other interpretations are intended for time, the user can define the semantics and program the applications appropriately, and it is called a user-defined time. The next section shows with examples how these concepts can be incorporated into relational databases, and Section 24.2.3 shows an approach to incorporate temporal concepts into object databases. 24.2.2 Incorporating Time in Relational Databases Using Tuple Versioning Valid Time Relations. Let us now see how the different types of temporal databases may be represented in the relational model. First, suppose that we would like to include the history of changes as they occur in the real world. Consider again the database in Figure 24.1, and let us assume that, for this application, the granularity is day. Then, we could convert the two relations EMPLOYEE and DEPARTMENT into valid time relations by adding the attributes VST (Valid Start Time) and VET (Valid End Time), whose data type is DATE in order to provide day granularity. This is shown in Figure 24.7a, where the relations have been renamed EMP_VT and DEPT_VT, respectively. Consider how the EMP_VT relation differs from the nontemporal EMPLOYEE relation (Figure 24.1) .15 In EMP_VT, each tuple V represents a version of an employee's information that is valid (in the real world) only during the time period [v. VST, V. VET], whereas in EMPLOYEE each tuple represents only the current state or current version of each employee. In EMP_VT, the current version of each employee typically has a special value, now, as its 14. The explanation is more involved, as we shall see in Section 24.2.3. 15. A nontemporal relation is also called a snapshot relation as it shows only the currentsnapshot or current stateof the database. 24.2 Temporal Database Concepts I 771 (a) EMP_VT (c) SUPERVISOR_SSN DEPT_VT I DNAME ~ TOTAL_SAL I MANAGER_SSN ~ SUPERVISOR_SSN DEPT_TT I DNAME ~ TOTAL_SAL I MANAGER_SSN ~ EMP_BT DEPT_BT SUPERVISOR_SSN FIGURE 24.7 Different types of temporal relational databases. (a) Valid time data? base schema. (b) Transaction time database schema. (c) Bitemporal database schema. valid end time. This special value, now, is a temporal variable that implicitly represents the current time as time progresses. The nontemporal EMPLOYEE relation would only include those tuples from the EMP_VT relation whose VET is now. Figure 24.8 shows a few tuple versions in the valid-time relations EMP_VT and OEPT_VT. There are two versions of Smith, three versions of Wong, one version of Brown, and one version of Narayan. We can now see how a valid time relation should behave when information is changed. Whenever one or more attributes of an employee are updated, rather than actually overwriting the old values, as would happen in a nontemporal relation, the system should create a new version and close the current version by changing its VET to the end time. Hence, when the user issued the command to update the salary of Smith effective on June 1, 2003, to $30000, the second version of Smith was created (see Figure 24.8). At the time of this update, the first version of Smith was the current version, with now as its VET, but after the update now was changed to May 31, 2003 (one less than June 1, 2003, in day granularity), to indicate that the version has become a closed or history version and that the new (second) version of Smith is now the current one. 772 I Chapter 24 Enhanced Data Models for Advanced Applications EMP_VT 123456789 Smith 123456789 Smith 333445555 Wong 333445555 Wong 333445555 Wong 222447777 Brown Narayan 666884444 25000 30000 25000 30000 40000 28000 38000 SUPERVISOR_SSN DEPT_VT I DNAME DNO Research Research 5 5 5 5 4 5 5 4 5 333445555 333445555 999887777 999887777 888665555 999887777 333445555 2002-06-15 2003-05-31 2003-06-01 now 1999-08-20 2001-01-31 2001-02-01 2002-03-31 2002-04-01 now 2001-05-01 2002-08-10 2003-08-01 now MANAGER_SSN VST VET 888665555 333445555 2001-09-20 2002-04-01 2002-03-31 now FIGURE 24.8 Some tuple versions in the valid time relations EMP_VT and DEPT_VT. It is important to note that in a valid time relation, the user must generally provide the valid time of an update. For example, the salary update of Smith may have been entered in the database on May 15, 2003, at 8:52:12 A.M., say, even though the salary change in the real world is effective on June 1, 2003. This is called a proactive update, since it is applied to the database before it becomes effective in the real world. If the update was applied to the database after it became effective in the real world, it is called a retroactive update. An update that is applied at the same time when it becomes effective is called a simultaneous update. The action that corresponds to deleting an employee in a nontemporal database would typically be applied to a valid time database by closing the current version of the employee being deleted. For example, if Smith leaves the company effective January 19, 2004, then this would be applied by changing VET of the current version of Smith from now to 2004-01-19. In Figure 24.8, there is no current version for Brown, because he presumably left the company on 2002-08-10 and was logically deleted. However, because the database is temporal, the old information on Brown is still there. The operation to insert a new employee would correspond to creating the first tuple version for that employee, and making it the current version, with the VST being the effective (real world) time when the employee starts work. In Figure 24.7, the tuple on Narayan illustrates this, since the first version has not been updated yet. Notice that in a valid time relation, the nontemporal key, such as SSN in EMPLOYEE, is no longer unique in each tuple (version). The new relation key for EMP_VT is a combination of the nontemporal key and the valid start time attribute VST,16 so we use (SSN, vsr) as 16. A combination of the nontemporal key and the valid end time attribute VET could also be used. 24.2 Temporal Database Concepts I 773 primary key. This is because, at any point in time, there should be at most one validversion of each entity. Hence, the constraint that any two tuple versions representing the same entity should have nonintersecting valid time periods should hold on valid time relations. Notice that if the nontemporal primary key value may change over time, it is important to have a unique surrogate key attribute, whose value never changes for each real world entity, in order to relate together all versions of the same real world entity. Valid time relations basically keep track of the history of changes as they become effective in the real world. Hence, if all real-world changes are applied, the database keeps a history of the real-world states that are represented. However, because updates, insertions, and deletions may be applied retroactively or proactively, there is no record of the actual database state at any point in time. If the actual database states are more important to an application, then one should use transaction time relations. Transaction Time Relations. In a transaction time database, whenever a change is applied to the database, the actual timestamp of the transaction that applied the change (insert, delete, or update) is recorded. Such a database is most useful when changes are applied simultaneously in the majority of cases-for example, real-time stock trading or banking transactions. If we convert the nontemporal database of Figure 24.1 into a transaction time database, then the two relations EMPLOYEE and DEPARTMENT are converted into transaction time relations by adding the attributes TST (Transaction Start Time) and TET (Transaction End Time), whose data type is typically TIMESTAMP. This is shown in Figure 24.7b, where the relations have been renamed EMP_TT and DEPT_TT, respectively. In EMP_TI, each tuple v represents a version of an employee's information that was created at actual time v. TST and was (logically) removed at actual time v. TET (because the information was no longer correct). In EMP_TI, the currentversion of each employee typically has a special value, uc (Until Changed), as its transaction end time, which indicates that the tuple represents correct information until it is changed by some other transaction.l" A transaction time database has also been called a rollback database.l'' because a user can logically roll back to the actual database state at any past point in time T by retrieving all tuple versions v whose transaction time period [v. TST, V. TET] includes time point T. Bitemporal Relations. Some applications require both valid time and transaction time, leading to bitemporal relations. In our example, Figure 24.7c shows how the EMPLOYEE and DEPARTMENT non-temporal relations in Figure 24.1 would appear as bitemporal relations EMP_BT and DEPT_BT, respectively. Figure 24.9 shows a few tuples in these relations. In these tables, tuples whose transaction end time TET is uc are the ones representing currently valid information, whereas tuples whose TET is an absolute timestamp are tuples that were valid until (just before) that timestamp. Hence, the tuples with uc in Figure 24.9 correspond to the valid time tuples in Figure 24.7. The transaction start time attribute TST in each tuple is the timestamp of the transaction that created that tuple. 17. The uc variable in transaction time relations corresponds to the now variable in valid time rela? tions. The semantics are slightly different though. 18. The term rollback here does not have the same meaning as transaction rollback (see Chapter 19) during recovery, where the transaction updates are physically undone. Rather, here the updates can be logically undone, allowing the user to examine the database as it appeared at a previous time point. 774 I Chapter 24 Enhanced Data Models for Advanced Applications EMP_BT ~ SSN 123456789 Smith Smith 123456789 Smith 123456789 Wong 333445555 Wong 333445555 Wong 333445555 Wong 333445555 Wong 333445555 Brown 222447777 Brown 222447777 Narayan 666884444 SALARY ~ SUPERVISOR_SSN 5 333445555 25000 25000 333445555 333445555 30000 999887777 25000 999887777 25000 999887777 30000 999887777 30000 888665555 40000 28000 999887777 28000 999887777 333445555 38000 I 5 5 DNAME DEPT_VT I I Research Research Research TST DNO 5 5 5 TET 4 4 5 5 5 4 4 5 MANAGER_SSN VST I VET 2002-06-15 now 2002-06-08,13:05:58 2003-06-04,08:56:12 uc 2002-06-15 1998-05-31 2003-06-, ordered by valid start time. Whenever an attribute is changed in this model, the current attribute version is closed and a new attribute version for this attribute only is appended to the list. This allows attributes to change asynchronously. The current value for each attribute has now for its VALID_END_TIME. When using attribute versioning, it is useful to include a lifespan temporal attribute associated with the whole object whose value is one or more valid time periods that indicate the valid time of existence for the whole object. Logical deletion of the object is implemented by closing the lifespan. The constraint that any time period of an attribute within an object should be a subset of the object's lifespan should be enforced. 21. Attribute versioning can also be used in the nested relational model (see Chapter 22). 24.2 Temporal Database Concepts I 777 class Temporal_Salary { attribute attribute attribute Date Date float valid_start_time; valid_end_time; salary; }; class Temporal Dept { }; attribute attribute attribute Date valid_start_time; Date valid_end_time; Department_VT dept; class Temporal_Supervisor { }; attribute attribute attribute Date Date Employee_VT class Temporal_Lifespan { }; attribute attribute Date Date class Employee_VT extent employees) ( { attribute attribute attribute attribute attribute attribute }; valid_start_time; valid_end_time; supervisor; valid_ start_time; valid_end_time; list string string llst-cTemporal Balary» llst«Temporal_Dept> list lifespan; name; ssn; sal_history; dept_history; supervisor_history; FIGURE 24.10 Possible ODL schema for a temporal valid time Employee_VT object class using attribute versioning. For bitemporal databases, each attribute version would have a tuple with five components: The object lifespan would also include both valid and transaction time dimensions. The full capabilities of bitemporal databases can hence be available with attribute versioning. Mechanisms similar to those discussed earlier for updating tuple versions can be applied to updating attribute versions. 778 I Chapter 24 Enhanced Data Models for Advanced Applications 24.2.4 Temporal Querying Constructs and the TSQL2 Language So far, we have discussed how data models may be extended with temporal constructs. We now give a brief overview of how query operations need to be extended for temporal que? rying. Then we briefly discuss the TSQL2 language, which extends SQL for querying valid time, transaction time, and bitemporal relational databases. In nontemporal relational databases, the typical selection conditions involve attribute conditions, and tuples that satisfy these conditions are selected from the set of current tuples. Following that, the attributes of interest to the query are specified by a projection operation (see Chapter 5). For example, in the query to retrieve the names of all employees working in department 5 whose salary is greater than 30000, the selection condition would be: ((SALARY > 30000) AND (DNa = 5)) The projected attribute would be NAME. In a temporal database, the conditions may involve time in addition to attributes. A pure time condition involves only time-for example, to select all employee tuple versions that were valid on a certain time point T or that were valid duringa certain time period [T1, T2]. In this case, the specified time period T. VET], and only is compared with the valid time period of each tuple version [T. VST, those tuples that satisfy the condition are selected. In these operations, a period is considered to be equivalent to the set of time points from T1 to T2 inclusive, so the standard set comparison operations can be used. Additional operations, such as whether one time period ends before another starts are also needed. 22 Some of the more common operations used in queries are as follows: [t.VST, t.VET] INCLUDES [d, t2] [t.VST, t.VET] INCLUDED_IN [tl , t2] [t.VST, t.VET] OVERLAPS [d, t2] [t.VST, t.VET] BEFORE [d, t2] [t.VST, t.VET] AFTER [d, t2] [t.VST, t.VET] MEETS_BEFORE [tl, t2] [t.VST, t.VET] MEETS_AFTER [rl, t2] Equivalent to t l 2: t.VST AND t2 :s t.VET Equivalent to t l :s t.VST AND t2 2: t.VET Equivalent to (rl :s t.VET AND t2 2: t.VST)23 Equivalent to t l 2: t.VET Equivalent to t2 :s t.VST Equivalent to tl = t.VET + 124 Equivalent to t2 + 1 = t.VST In addition, operations are needed to manipulate time periods, such as computing the union or intersection of two time periods. The results of these operations may not themselves be periods, but rather temporal eIements-a collection of one or more disjoint time periods such that no two time periods in a temporal element are directly adjacent. -_.---_._---- 22. A complete set of operations, known as Allen's algebra, has been defined for comparing time periods. 23. This operation returns true if the mrersecnon of the two periods is not empty; it has also been called INTERSECTS_WITH. 24. Here, I (one) refers to one time point in the specified granularity. The MEETS operations basi? cally specify if one period starts immediately after the orher period ends. That is, for any two time periods [Tl, T2] and [T3, following three conditions must hold: 24.2 Temporal Database Concepts I 779 T4] in a temporal element, the • [Tl, T2] intersection [T3, T4] is empty. • T3 is not the time point following T2 in the given granularity. • Tl is not the time point following T4 in the given granularity. The latter conditions are necessary to ensure unique representations of temporal elements. If two time periods [Tl, T2] and [T3, T4] are adjacent, they are combined into a single time period [Tl, T4]. This is called coalescing of time periods. Coalescing also combines intersecting time periods. To illustrate how pure time conditions can be used, suppose a user wants to select all employee versions that were valid at any point during 2002. The appropriate selection condition applied to the relation in Figure 24.8 would be [T.VST, T.VET] OVERLAPS [2002-01-01, 2002-12-31] Typically, most temporal selections are applied to the valid time dimension. For a bitemporal database, one usually applies the conditions to the currently correct tuples with uc as their transaction end times. However, if the query needs to be applied to a previous database state, an AS_OF T clause is appended to the query, which means that the query is applied to the valid time tuples that were correct in the database at time T. In addition to pure time conditions, other selections involve attribute and time conditions. For example, suppose we wish to retrieve all EMP_VT tuple versions T for employees who worked in department 5 at any time during 2002. In this case, the condition is ([T.VST, T.VET] OVERLAPS [2002-01-01, 2002-12-31]) AND (T.DNO = 5) Finally, we give a brief overview of the TSQL2 query language, which extends SQL with constructs for temporal databases. The main idea behind TSQL2 is to allow users to specify whether a relation is nontemporal (that is, a standard SQL relation) or temporal. The CREATE TABLE statement is extended with an optional As-clause to allow users to declare different temporal options. The following options are available: • AS VALID STATE (valid time relation with valid time period) • AS VALID EVENT (valid time relation with valid time point) • AS TRANSACTION (transaction time relation with transaction time period) • AS VALID STATE AND TRANSACTION (bitemporal relation, valid time period) • AS VALID EVENT AND TRANSACTION (bitemporal relation, valid time point) The keywords STATE and EVENT are used to specify whether a time period or time point is associated with the valid time dimension. In TSQL2, rather than have the user actually see how the temporal tables are implemented (as we discussed in the previous sections), the TSQL2 language adds query language constructs to specify various types of temporal selections, temporal projections, temporal aggregations, transformation among granularities, and many other concepts. The book by Snodgrass et al. (1995) describes the language. 780 I Chapter 24 Enhanced Data Models for Advanced Applications 24.2.5 Time Series Data Time series data is used very often in financial, sales, and economics applications. They involve data values that are recorded according to a specific predefined sequence of time points. They are hence a special type of valid event data, where the event time points are predetermined according to a fixed calendar. Consider the example of closing daily stock prices of a particular company on the New York Stock Exchange. The granularity here is day, but the days that the stock market is open are known (nonholiday weekdays). Hence, it has been common to specify a computational procedure that calculates the particular calendar associated with a time series. Typical queries on time series involve temporal aggregation over higher granularity intervals-for example, finding the average or maxi? mum weekly closing stock price or the maximum and minimum monthly closing stock price from the daily information. As another example, consider the daily sales dollar amount at each store of a chain of stores owned by a particular company. Again, typical temporal aggregates would be retrieving the weekly, monthly, or yearly sales from the daily sales information (using the sum aggregate function), or comparing same store monthly sales with previous monthly sales, and so on. Because of the specialized nature of time series data, and the lack of support in older DBMSs, it has been common to use specialized time series management systems rather than general purpose DBMSs for managing such information. In such systems, it has been common to store time series values in sequential order in a file, and apply specialized time series procedures to analyze the information. The problem with this approach is that the full power of high-level querying in languages such as SQL will not be available in such systems. More recently, some commercial DBMS packages are offering time series extensions, such as the time series datablade of Informix Universal Server (see Chapter 22). In addition, the TSQL2 language provides some support for time series in the form of event tables. 24.3 MULTIMEDIA DATABASES Because the two topics discussed in this section are very broad, we can give only a very brief introduction to these fields. Section 24.3.1 introduces spatial databases, and Section 24.3.2 briefly discusses multimedia databases. 24.3.1 Introduction to Spatial Database Concepts Spatial databases provide concepts for databases that keep track of objects in a multi? dimensional space. For example, cartographic databases that store maps include two? dimensional spatial descriptions of their objects-from countries and states to rivers, cities, roads, seas, and so on. These applications are also known as Geographical Informa? tion Systems (GIS), and are used in areas such as environmental, emergency, and battle management. Other databases, such as meteorological databases for weather information, are three-dimensional, since temperatures and other meteorological information are 24.3 Multimedia Databases I 781 related to three-dimensional spatial points. In general, a spatial database stores objects that have spatial characteristics that describe them. The spatial relationships among the objects are important, and they are often needed when querying the database. Although a spatial database can in general refer to an n-dimensional space for any n, we will limit our discussion to two dimensions as an illustration. The main extensions that are needed for spatial databases are models that can interpret spatial characteristics. In addition, special indexing and storage structures are often needed to improve performance. Let us first discuss some of the model extensions for two-dimensional spatial databases. The basic extensions needed are to include two? dimensional geometric concepts, such as points, lines and line segments, circles, polygons, and arcs, in order to specify the spatial characteristics of objects. In addition, spatial operations are needed to operate on the objects' spatial characteristics-for example, to compute the distance between two objects-c-as well as spatial Boolean conditions-for example, to check whether two objects spatially overlap. To illustrate, consider a database that is used for emergency management applications. A description of the spatial positions of many types of objects would be needed. Some of these objects generally have static spatial characteristics, such as streets and highways, water pumps (for fire control), police stations, fire stations, and hospitals. Other objects have dynamic spatial characteristics that change over time, such as police vehicles, ambulances, or fire trucks. The following categories illustrate three typical types of spatial queries: • Rangequery: Finds the objects of a particular type that are within a given spatial area or within a particular distance from a given location. (For example, finds all hospitals within the Dallas city area, or finds all ambulances within five miles of an accident location.) • Nearest neighbor query: Finds an object of a particular type that is closest to a given location. (For example, finds the police car that is closest to a particular location.) • Spatial joins or overlays: Typically joins the objects of two types based on some spatial condition, such as the objects intersecting or overlapping spatially or being within a certain distance of one another. (For example, finds all cities that fall on a major highway or finds all homes that are within two miles of a lake.) For these and other types of spatial queries to be answered efficiently, special techniques for spatial indexing are needed. One of the best known techniques is the use of Rvtrees and their variations. Rvtrees group together objects that are in close spatial physical proximity on the same leaf nodes of a tree-structured index. Since a leaf node can point to only a certain number of objects, algorithms for dividing the space into rectangular subspaces that include the objects are needed. Typical criteria for dividing the space include minimizing the rectangle areas, since this would lead to a quicker narrowing of the search space. Problems such as having objects with overlapping spatial areas are handled in different ways by the many different variations of Rvtrees. The intemal nodes of Rvtrees are associated with rectangles whose area covers all the rectangles in its subtree. Hence, Rvtrees can easily answer queries, such as find all objects in a given area by limiting the tree search to those subtrees whose rectangles intersect with the area given in the query. 782 I Chapter 24 Enhanced Data Models for Advanced Applications Other spatial storage structures include quadtrees and their variations. Quadtrees generally divide each space or subspace into equally sized areas, and proceed with the subdivisions of each subspace to identify the positions of various objects. Recently, many newer spatial access structures have been proposed, and this area is still an active research area. 24.3.2 Introduction to Multimedia Database Concepts Multimedia databases provide features that allow users to store and query different types of multimedia information, which includes images (such as photos or drawings), video clips (such as movies, newsreels, or home videos), audioclips (such as songs, phone messages, or speeches), and documents (such as books or articles). The main types of database queries that are needed involve locating multimedia sources that contain certain objects of inter? est. For example, one may want to locate all video clips in a video database that include a certain person in them, say Bill Clinton. One may also want to retrieve video clips based on certain activities included in them, such as a video clips where a goal is scored in a soccer game by a certain player or team. The above types of queries are referred to as content-based retrieval, because the multimedia source is being retrieved based on its containing certain objects or activities. Hence, a multimedia database must use some model to organize and index the multimedia sources based on their contents. Identifying the contents of multimedia sources is a difficult and time-consuming task. There are two main approaches. The first is based on automatic analysis of the multimedia sources to identify certain mathematical characteristics of their contents. This approach uses different techniques depending on the type of multimedia source (image, text, video, or audio). The second approach depends on manual identification of the objects and activities of interest in each multimedia source and on using this information to index the sources. This approach can be applied to all the different multimedia sources, but it requires a manual preprocessing phase where a person has to scan each multimedia source to identify and catalog the objects and activities it contains so that they can be used to index these sources. In the remainder of this section, we will very briefly discuss some of the characteristics of each type of multimedia source-images, video, audio, and text sources, in that order. An image is typically stored either in raw form as a set of pixel or cell values, or in compressed form to save space. The image shape descriptor describes the geometric shape of the raw image, which is typically a rectangle of cells of a certain width and height. Hence, each image can be represented by an m by n grid of cells. Each cell contains a pixel value that describes the cell content. In black/white images, pixels can be one bit. In gray scale or color images, a pixel is multiple bits. Because images may require large amounts of space, they are often stored in compressed form. Compression standards, such as GlF or JPEG, use various mathematical transformations to reduce the number of cells stored but still maintain the main image characteristics. The mathematical transforms 24.3 Multimedia Databases I 783 that can be used include Discrete Fourier Transform (OFT), Discrete Cosine Transform (OCT), and wavelet transforms. To identify objects of interest in an image, the image is typically divided into homogeneous segments using a homogeneity predicate. For example, in a color image, cells that are adjacent to one another and whose pixel values are close are grouped into a segment. The homogeneity predicate defines the conditions for how to automatically group those cells. Segmentation and compression can hence identify the main characteristics of an image. A typical image database query would be to find images in the database that are similar to a given image. The given image could be an isolated segment that contains, say, a pattern of interest, and the query is to locate other images that contain that same pattern. There are two main techniques for this type of search. The first approach uses a distance function to compare the given image with the stored images and their segments. If the distance value returned is small, the probability of a match is high. Indexes can be created to group together stored images that are close in the distance metric so as to limit the search space. The second approach, called the transformation approach, measures image similarity by having a small number of transformations that can transform one image's cells to match the other image. Transformations include rotations, translations, and scaling. Although the latter approach is more general, it is also more time consuming and difficult. A video source is typically represented as a sequence of frames, where each frame is a still image. However, rather than identifying the objects and activities in every individual frame, the video is divided into video segments, where each segment is made up of a sequence of contiguous frames that includes the same objects/activities. Each segment is identified by its starting and ending frames. The objects and activities identified in each video segment can be used to index the segments. An indexing technique called frame segment trees has been proposed for video indexing. The index includes both objects, such as persons, houses, cars, and activities, such as a person delivering a speech or two people talking. Videos are also often compressed using standards such as MPEG. A text/document source is basically the full text of some article, book, or magazine. These sources are typically indexed by identifying the keywords that appear in the text and their relative frequencies. However, filler words are eliminated from that process. Because there could be too many keywords when attempting to index a collection of documents, techniques have been developed to reduce the number of keywords to those that are most relevant to the collection. A technique called singular value decompositions (SVO), which is based on matrix transformations, can be used for this purpose. An indexing technique called telescoping vector trees, or TV-trees, can then be used to group similar documents together. Audio sources include stored recorded messages, such as speeches, class presentations, or even surveillance recording of phone messages or conversations by law enforcement. Here, discrete transforms can be used to identify the main characteristics of a certain person's voice in order to have similarity based indexing and retrieval. Audio characteristic features include loudness, intensity, pitch, and clarity. 784 I Chapter 24 Enhanced Data Models for Advanced Applications 24.4 INTRODUCTION TO DEDUCTIVE DATABASES 24.4.1 Overview of Deductive Databases In a deductive database system, we typically specify rules through a declarative language-a language in which we specify what to achieve rather than how to achieve it. An inference engine (or deduction mechanism) within the system can deduce new facts from the data? base by interpreting these rules. The model used for deductive databases is closely related to the relational data model, and particularly to the domain relational calculus formalism (see Section 6.6). It is also related to the field of logic programming and the Prolog language. The deductive database work based on logic has used Prolog as a starting point. A variation of Prolog called Datalog is used to define rules declaratively in conjunction with an existing set of relations, which are themselves treated as literals in the language. Although the lan? guage structure of Datalog resembles that of Prolog, its operational semantics-that is, how a Datalog program is to be executed-is still different. A deductive database uses two main types of specifications: facts and rules. Facts are specified in a manner similar to the way relations are specified, except that it is not necessary to include the attribute names. Recall that a tuple in a relation describes some real-world fact whose meaning is partly determined by the attribute names. In a deductive database, the meaning of an attribute value in a tuple is determined solely by its position within the tuple. Rules are somewhat similar to relational views. They specify virtual relations that are not actually stored but that can be formed from the facts by applying inference mechanisms based on the rule specifications. The main difference between rules and views is that rules may involve recursion and hence may yield virtual relations that cannot be defined in terms of basic relational views. The evaluation of Prolog programs is based on a technique called backward chaining, which involves a top-down evaluation of goals. In the deductive databases that use Datalog, attention has been devoted to handling large volumes of data stored in a relational database. Hence, evaluation techniques have been devised that resemble those for a bottom-up evaluation. Prolog suffers from the limitation that the order of specification of facts and rules is significant in evaluation; moreover, the order of literals (defined later in Section 24.4.3) within a rule is significant. The execution techniques for Datalog programs attempt to circumvent these problems. 24.4.2 Prolog/Datalog Notation The notation used in Prolog/Datalog is based on providing predicates with unique names. A predicate has an implicit meaning, which is suggested by the predicate name, and a fixed number of arguments. If the arguments are all constant values, the predicate simply states that a certain fact is true. If, on the other hand, the predicate has variables as argu? ments, it is either considered as a query or as part of a rule or constraint. Throughout this chapter, we adopt the Prolog convention that all constant values in a predicate are either numeric or character strings; they are represented as identifiers (or names) starting with lowercase letters only, whereas variable names always start with an uppercase letter. 24.4 Introduction to Deductive Databases I 785 Consider the example shown in Figure 24.11, which is based on the relational data? base of Figure 5.6, but in a much simplified form. There are three predicate names: super? vise, superior, and subordinate. The supervi se predicate is defined via a set of facts, each of which has two arguments: a supervisor name, followed by the name of a direct supervi? see (subordinate) of that supervisor. These facts correspond to the actual data that is stored in the database, and they can be considered as constituting a set of tuples in a rela? tion SUPERVISE with two attributes whose schema is SUPERVISE (Supe rvi so r, Supe rvi see) Thus, supervi se(X, Y) states the fact that "X supervises Y." Notice the omission of the attribute names in the Prolog notation. Attribute names are only represented by vir? tue of the position of each argument in a predicate: the first argument represents the supervisor, and the second argument represents a direct subordinate. The other two predicate names are defined by rules. The main contribution of deduc? tive databases is the ability to specify recursive rules, and to provide a framework for infer? ring new information based on the specified rules. A rule is of the form head :- body, where :- is read as "if and only if." A rule usually has a single predicate to the left of the :? symbol-called the head or left-hand side (LHS) or conclusion of the rule-and one or more predicates to the right of the :- symbol-called the body or right-hand side (RHS) or premisets) of the rule. A predicate with constants as arguments is said to be ground; we also refer to it as an instantiated predicate. The arguments of the predicates that appear in a rule typically include a number of variable symbols, although predicates can also contain constants as arguments. A rule specifies that, if a particular assignment or binding of constant values to the variables in the body (RHS predicates) makes allthe RHS predicates true, it also makes the head (LHS predicate) true by using the same assignment of constant values to variables. Hence, a rule provides us with a way of generating new facts that are instantiations of the head of the rule. These new facts are based on facts that (a) (b) Facts supervise(franklin,john). supervise(franklin,ramesh). supervise(franklin,joyce). superviseUennifer,alicia). superviseUennifer,ahmad). superviseUames,franklin). superviseUames,jennifer). Rules superior(X,Y) :- supervise(X,Y). superior(X,Y) :- supervise(X,Z), superior(Z,Y). subordinate(X,Y) :- superior(Y,X). james /~ alicia.>.ahmad jennifer franklin /\~ john ramesh joyce Queries superionjarnes.Y)? superiorUames,joyce)? FIGURE 24.11 (a) Prolog notation. (b) The supervisory tree. 786 I Chapter 24 Enhanced Data Models for Advanced Applications already exist, corresponding to the instantiations (or bindings) of predicates in the body of the rule. Notice that by listing multiple predicates in the body of a rule we implicitly apply the logical and operator to these predicates. Hence, the commas between the RHS predicates may be read as meaning "and." Consider the definition of the predicate supe r i 0 r in Figure 24.11, whose first argu? ment is an employee name and whose second argument is an employee who is either a direct or an indirect subordinate of the first employee. By indirect subordinate, we mean the subordinate of some subordinate down to any number of levels. Thus super; or(X, Y) stands for the fact that "X is a superior of Y" through direct or indirect supervision. We can write two rules that together specify the meaning of the new predicate. The first rule under Rules in the figure states that, for every value of X and Y, if supe rv i se(X, Y)-the rule body-is true, then supe r i or(X, Y)-the rule head-is also true, since Y would be a direct subordinate of X (at one level down). This rule can be used to generate all direct superior/subordinate relationships from the facts that define the supervise predicate. The second recursive rule states that, if supervi se fx , Z) and supe r t o r Cz , Y) are both true, then supe r i 0 r (X, Y) is also true. This is an example of a recursive rule, where one of the rule body predicates in the RHS is the same as the rule head predicate in the LHS. In general, the rule body defines a number of premises such that, if they are all true, we can deduce that the conclusion in the rule head is also true. Notice that, if we have two (or more) rules with the same head (LHS predicate), it is equivalent to saying that the predi? cate is true (that is, that it can be instantiated) if eitherone of the bodies is true; hence, it is equivalent to a logical or operation. For example, if we have two rules X : - Y and X : - Z, they are equivalent to a rule X : - Y or z. The latter form is not used in deduc? tive systems, however, because it is not in the standard form of rule, called a Horn clause, as we discuss in Section 24.4.4. A Prolog system contains a number of built-in predicates that the system can inter? pret directly. These typically include the equality comparison operator =(X, Y), which returns true if X and Yare identical and can also be written as X=Y by using the standard infix notation.i'' Other comparison operators for numbers, such as <, <=, >, and >=, can be treated as binary predicates. Arithmetic functions such as +, -, *, and / can be used as arguments in predicates in Prolog. In contrast, Datalog (in its basic form) doesnot allow functions such as arithmetic operations as arguments; indeed, this is one of the main differences between Prolog and Datalog. However, later extensions to Datalog have been proposed to include functions. A query typically involves a predicate symbol with some variable arguments, and its meaning (or "answer") is to deduce all the different constant combinations that, when bound (assigned) to the variables, can make the predicate true. For example, the first query in Figure 24.11 requests the names of all subordinates of "james" at any level. A dif? ferent type of query, which has only constant symbols as arguments, returns either a true or a false result, depending on whether the arguments provided can be deduced from 25. A Prolog system typically has a number of different equality predicates that have different inter? pretations. 24.4 Introduction to Deductive Databases I 787 the facts and rules. For example, the second query in Figure 24.11 returns true, since superior(james, joyce) can be deduced. 24.4.3 Datalog Notation In Datalog, as in other logic-based languages, a program is built from basic objects called atomic formulas. It is customary to define the syntax of logic-based languages by describ? ing the syntax of atomic formulas and identifying how they can be combined to form a program. In Datalog, atomic formulas are literals of the form p(al , a 2 , •.• , an), where p is the predicate name and n is the number of arguments for predicate p. Different predi? cate symbols can have different numbers of arguments, and the number of arguments n of predicate p is sometimes called the arity or degree of p. The arguments can be either con? stant values or variable names. As mentioned earlier, we use the convention that con? stant values either are numeric or start with a lowercase character, whereas variable names always start with an uppercase character. A number of built-in predicates are included in Datalog, which can also be used to construct atomic formulas. The built-in predicates are of two main types: the binary comparison predicates <(less), <=(less_or_equal), >(greater), and >= (greater_ or_equal) over ordered domains; and the comparison predicates = (equal) and /= (not_equa1) over ordered or unordered domains. These can be used as binary predi? cates with the same functional syntax as other predicates-for example by writing less(X, 3)--or they can be specified by using the customary infix notation X<3. Norice that, because the domains of these predicates are potentially infinite, they should be used with care in rule definitions. For example, the predicate 9 reate r (X, 3), if used alone, generates an infinite set of values for X that satisfy the predicate (all inte? ger numbers greater than 3). A literal is either an atomic formula as defined earlier-called a positive literal-or an atomic formula preceded by not. The latter is a negated atomic formula, called a nega? tive literal. Datalog programs can be considered to be a subset of the predicate calculus formulas, which are somewhat similar to the formulas of the domain relational calculus (see Section 6.7). In Datalog, however, these formulas are first converted into what is known as clausal form before they are expressed in Datalog; and only formulas given in a restricted clausal form, called Horn clauses.i''can be used in Datalog. 24.4.4 Clausal Form and Horn Clauses Recall from Section 6.6 that a formula in the relational calculus is a condition that includes predicares called atoms (based on relation names). In addition, a formula can have quantifiers-namely, the universal quantifier (for all) and the existential quantifier 26. Named after the mathematician Alfred Horn. 788 I Chapter 24 Enhanced Data Models for Advanced Applications (there exists). In clausal form, a formula must be transformed into another formula with the following characteristics: • All variables in the formula are universally quantified. Hence, it is not necessary to include the universal quantifiers (for all) explicitly; the quantifiers are removed, and all variables in the formula are implicitly quantified by the universal quantifier. • In clausal form, the formula is made up of a number of clauses, where each clause is composed of a number of literals connected by OR logical connectives only. Hence, each clause is a disjunction of literals. • The clauses themselves are connected by AND logical connectives only, to form a for? mula. Hence, the clausal form of a formula is a conjunction of clauses. It can be shown that any formula can be converted into clausal form. For our purposes, we are mainly interested in the form of the individual clauses, each of which is a disjunc? tion of literals. Recall that literals can be positive literals or negative literals. Consider a clause of the form: (1) This clause has n negative literals and mpositive literals. Such a clause can be trans? formed into the following equivalent logical formula: (2) PI AND Pz AND ... AND P, => Q I OR Qz OR ... OR Q rn where => is the implies symbol. The formulas (1) and (2) are equivalent, meaning that their truth values are always the same. This is the case because, if all the Pi literals (i = 1,2, ... ,n) are true, the formula (2) is true only if at least one of the Q/s is true, which is the meaningof the => (implies) symbol. For formula (1), if all the Pi literals (i = 1,2, ... , n) are true, their negations are all false; so in this case formula (1) is true only if at least one of the Q/s is true.In Datalog, rules are expressed as a restricted form of clauses called Hom clauses, in which a clause can contain at most one positive literal. Hence, a Hom clause is either of the form not(P I) OR not(Pz) OR ... OR not(Pn ) OR Q or of the form not(PI) OR not(Pz) OR ... OR not(Pn ) The Horn clause in (3) can be transformed into the clause (3) (4) PI AND Pz AND ... AND P; => Q which is written in Datalog as the following rule (5) Q:- PI' Pz, ... , Pn · The Horn clause in (4) can be transformed into (6) PI AND Pz AND ... AND P, => which is written in Datalog as follows: (7) (8) 24.4 Introduction to Deductive Databases I 789 A Datalog rule, as in (6), is hence a Horn clause, and its meaning, based on formula (5), is that if the predicates p) and Pz and ... and Pn are all true for a particular binding to their variable arguments, then Q is also true and can hence be inferred. The Datalog expression (8) can be considered as an integrity constraint, where all the predicates must be true to satisfy the query. In general, a query in Datalog consists of two components: • A Datalog program, which is a finite set of rules. • A literal PiX), Xz, ... , Xn ), where each Xi is a variable or a constant. A Prolog or Datalog system has an internal inference engine that can be used to process and compute the results of such queries. Prolog inference engines typically return one result to the query (that is, one set of values for the variables in the query) at a time and must be prompted to return additional results. On the contrary, Datalog returns results set-at-a-time. 24.4.5 Interpretations of Rules There are two main alternatives for interpreting the theoretical meaning of rules: proof? theoretic and model-theoretic. In practical systems, the inference mechanism within a sys? tem defines the exact interpretation, which may not coincide with either of the two theo? retical interpretations. The inference mechanism is a computational procedure and hence provides a computational interpretation of the meaning of rules. In this section, we first discuss the two theoretical interpretations. Inference mechanisms are then discussed briefly as a way of defining the meaning of rules. In the proof-theoretic interpretation of rules, we consider the facts and rules to be true statements, or axioms. Ground axioms contain no variables. The facts are ground axioms that are given to be true. Rules are called deductive axioms, since they can be used to deduce new facts. The deductive axioms can be used to construct proofs that derive new facts from existing facts. For example, Figure 24.12 shows how to prove the fact superior(james, ahmad) from the rules and facts given in Figure 24.11. The proof? theoretic interpretation gives us a procedural or computational approach for computing an answer to the Datalog query. The process of proving whether a certain fact (theorem) holds is known as theorem proving. 1. superior(X,Y) :- supervise(X,Y). 2. superior(X,Y) :- supervise(X,Z), superior(Z,Y). (rule 1) (rule 2) 3. supervisefjennifer.ahrnad). 4. supervlsetjamss.jennlter). 5. superiortjennifer.ahrnad). 6. superiortjames.ahrnad). FIGURE 24.12 Proving a new fact. (ground axiom, given) (ground axiom, given) (apply rule 1 on 3) (apply rule 2 on 4 and 5) 790 I Chapter 24 Enhanced Data Models for Advanced Applications The second type of interpretation is called the model-theoretic interpretation. Here, given a finite or an infinite domain of constant values,27 we assign to a predicate every possible combination of values as arguments. We must then determine whether the predicate is true or false. In general, it is sufficient to specify the combinations of arguments that make the predicate true, and to state that all other combinations make the predicate false. If this is done for every predicate, it is called an interpretation of the set of predicates. For example, consider the interpretation shown in Figure 24.13 for the predicates supe rvi se and superi or. This interpretation assigns a truth value (true or false) to every possible combination of argument values (from a finite domain) for the two predicates. An interpretation is called a model for a specific set of rules if those rules are always true under that interpretation; that is, for any values assigned to the variables in the rules, the head of the rules is true when we substitute the truth values assigned to the predicates Rules superior(X,Y) :- supervise(X,Y). superior(X,Y) :- supervise(X,Z), superior(Z,Y). Interpretation Known Facts: supervise(franklin,john) is true. supervise(franklin,ramesh) is true. supervise(franklin,joyce) is true. superviseUennifer,alicia) is true. superviseUennifer,ahmad) is true. superviseUames,franklin) is true. superviseUames,jennifer) is true. supervise(X,Y) is false for all other possible (X,Y) combinations. Derived Facts: superior(franklin,john) is true. superior(franklin,ramesh) is true. superior(franklin,joyce) is true. superiorUennifer,alicia) is true. superiorUennifer,ahmad) is true. superiorjjames.franklin) is true. superiorfjarnes.jennifer) is true. superiorUames,john) is true. superiorQames,ramesh) is true. superiorUames,joyce) is true. superiorjjarnes.alicia) is true. superlortjarnes.ahrnad) is true. superior(X,Y) is false for all other possible (X,Y) combinations. FIGURE 24.13 An interpretation that is a minimal model. 27. The most commonly chosen domain is finite and is called the Herbrand Universe. 24.4 Introduction to Deductive Databases I 791 in the body of the rule by that interpretation. Hence, whenever a particular substitution (binding) to the variables in the rules is applied, if all the predicates in the body of a rule are true under the interpretation, the predicate in the head of the rule must also be true. The interpretation shown in Figure 24.13 is a model for the two rules shown, since it can never cause the rules to be violated. Notice that a rule is violated if a particular binding of constants to the variables makes all the predicates in the rule body true but makes the predicate in the rule head false. For example, if supe rv i se(a, b) and super; or(b, c) are both true under some interpretation, but supe r; 0 r (a, c) is not true, the interpretation cannot be a model for the recursive rule: superior(X,Y) :- supervise(X,Z), superior(Z,Y) In the model-theoretic approach, the meaning of the rules is established by providing a model for these rules. A model is called a minimal model for a set of rules if we cannot change any fact from true to false and still get a model for these rules. For example, con? sider the interpretation in Figure 24.13, and assume that the supervise predicate is defined by a set of known facts, whereas the superior predicate is defined as an interpretation (model) for the rules. Suppose that we add the predicate super-i or Cjames , bob) to the true predicates. This remains a model for the rules shown, but it is not a minimal model, since changing the truth value of super-tor-Cjames , bob) from true to false still provides us with a model for the rules. The model shown in Figure 24.13 is the minimal model for the set of facts that are defined by the supervise predicate. In general, the minimal model that corresponds to a given set of facts in the model? theoretic interpretation should be the same as the facts generated by the proof-theoretic interpretation for the same original set of ground and deductive axioms. However, this is generally true only for rules with a simple structure. Once we allow negation in the speci? fication of rules, the correspondence between interpretations does not hold. In fact, with negation, numerous minimal models are possible for a given set of facts. A third approach to interpreting the meaning of rules involves defining an inference mechanism that is used by the system to deduce facts from the rules. This inference mech? anism would define a computational interpretation to the meaning of the rules. The Pro? log logic programming language uses its inference mechanism to define the meaning of the rules and facts in a Prolog program. Not all Prolog programs correspond to the proof? theoretic or model-theoretic interpretations; it depends on the type of rules in the pro? gram. However, for many simple Prolog programs, the Prolog inference mechanism infers the facts that correspond either to the proof-theoretic interpretation or to a minimal model under the model-theoretic interpretation. 24.4.6 Datalog Programs and Their Safety There are two main methods of defining the truth values of predicates in actual Datalog programs. Fact-defined predicates (or relations) are defined by listing all the combina? tions of values (the tuples) that make the predicate true. These correspond to base rela? tions whose contents are stored in a database system. Figure 24.14 shows the fact-defined predicates employee, male, female, department, supervise, project, and workson, 792 I Chapter 24 Enhanced Data Models for Advanced Applications ernployeeqohn). employee(franklin). employee(alicia). employeeUennifer). employee(ramesh). employeeUoyce). employee(ahmad). employee(james). rnaletjohn). male(franklin). male(ramesh). male(ahmad). maletjamss). salaryUohn,30000). salary(franklin,40000). salary(alicia,25000). salaryUennifer,43000). salary(ramesh,38000). sataryuoyce.zsooo). salary(ahmad,25000). salaryUames,55000). departrnenttjohn, research). department(franklin,research). department(alicia,administration). departmentUennifer,administration). department(ramesh, research). departmentUoyce,research). department(ahmad,administration). departmentUames,headquarters). female(alicia). femaleUennifer). femaletjoyce). project(productx). project{producty). project(productz). project(computerization). project(reorganization). project(newbenefits). supervise(franklin,john). supervise(franklin,ramesh). supervise(franklin,joyce). superviseUennifer,alicia). superviseUennifer,ahmad). supervisetjarnes,franklin). superviseUames,jennifer) . worksonUohn,productx,32). worksonUohn,producty,8). workson(ramesh,productz,40). worksonUoyce,productx,20). worksonUoyce,producty,20). workson(franklin,producty, 10). workson(franklin,productz, 10). workson(franklin,computerization,10). workson(franklin,reorganization,10). workson(alicia,newbenefits,30). workson(alicia,computerization, 10). workson(ahmad,computerization,35). workson(ahmad,newbenefits,5). worksonUennifer,newbenefits,20). worksonUennifer,reorganization,15). worksonUames,reorganization,10). FIGURE 24.14 Fact predicates for part of the database from Figure 5.6. which correspond to part of the relational database shown in Figure 5.6. Rule-defined predicates (or views) are defined by being the head (LHS) of one or more Datalog rules; they correspond to virtual relations whose contents can be inferred by the inference engine. Figure 24.15 shows a number of rule-defined predicates. A program or a rule is said to be safe if it generates a finite set of facts. The general theoretical problem of determining whether a set of rules is safe is undecidable. However, one can determine the safety of restricted forms of rules. For example, the rules shown in Figure 24.16 are safe. One situation where we get unsafe rules that can generate an infi? nite number of facts arises when one of the variables in the rule can range over an infinite domain of values, and that variable is not limited to ranging over a finite relation. For example, consider the rule big_salary(Y) :- Y>60000 Here, we can get an infinite result if Y ranges over all possible integers. But suppose that we change the rule as follows: big_salary(Y) r- employee(X), salary(X,Y), Y>60000 superior(X,Y) :- supervise(X,Y). superior(X,Y) :- supervise(X,Z), superior(Z,Y). 24.4 Introduction to Deductive Databases I 793 sUbordinate(X,Y) :- superior(Y,X). supervisor(X) :- employee(X), supervise(X,Y). over_40K_emp(X) :- empioyee(X), salary(X,Y), Y>=40000. under_40K_supervisor(X) :- supervisor(X), not(over_ 40_K_emp(X)). main_productx_emp(X) :- employee(X), workson(X,productx,Y), Y>=20. president(X) :- employee(X), not(supervise(Y,X)). FIGURE 24.15 Rule-defined predicates. In the second rule, the result is not infinite, since the values that Y can be bound to are now restricted to values that are the salary of some employee in the database-presum? ably, a finite set of values. We can also rewrite the rule as follows: big_salary(Y) :- Y>60000, employee(X), salary(X,Y) In this case, the rule is still theoretically safe. However, in Prolog or any other system that uses a top-down, depth-first inference mechanism, the rule creates an infinite loop, since we first search for a value for Y and then check whether it is a salary of an employee. The result is generation of an infinite number of Y values, even though these, after a certain point, cannot lead to a set of true RHS predicates. One definition of Datalog considers both rules to be safe, since it does not depend on a particular inference mechanism. Nonetheless, it is generally advisable to write such a rule in the safest form, with the pred? icates that restrict possible bindings of variables placed first. As another example of an unsafe rule, consider the following rule: has_something(X,Y) :- employee(X) Here, an infinite number of Y values can again be generated, since the variable Y appears only in the head of the rule and hence is not limited to a finite set of values. To define safe rules more formally, we use the concept of a limited variable. A variable X is limited in a rule if (1) it appears in a regular (not built-in) predicate in the body of the rule; (2) it appears in a predicate of the form X=c or c=X or (c1<=X and X<=c2) in the rule body, where c, cl , and c2 are constant values; or (3) it appears in a predicate of the form X=Y or Y=X in the rule body, where Y is a limited variable. A rule is said to be safe if all its variables are limited. 24.4.7 Use of Relational Operations It is straightforward to specify many operations of the relational algebra in the form of Datalog rules that define the result of applying these operations on the database relations (fact predicates). This means that relational queries and views can easily be specified in Datalog. The additional power that Datalog provides is in the specification of recursive 794 I Chapter 24 Enhanced Data Models for Advanced Applications queries, and views based on recursive queries. In this section, we show how some of the standard relational operations can be specified as Datalog rules. Our examples will use the base relations (fact-defined predicates) rel_one, rel_two, and rel_three, whose sche? mas are shown in Figure 24.16. In Datalog, we do not need to specify the attribute names as in Figure 24.16; rather, the arity (degree) of each predicate is the important aspect. In a practical system, the domain (data type) of each attribute is also important for operations such as UNION, INTERSECTION, and JOIN, and we assume that the attribute types are com? patible for the various operations, as discussed in Chapter 5. Figure 24.16 illustrates a number of basic relational operations. Notice that, if the Datalog model is based on the relational model and hence assumes that predicates (fact relations and query results) specify sets of tuples, duplicate tuples in the same predicate are automatically eliminated. This mayor may not be true, depending on the Datalog inference engine. However, it is definitely not the case in Prolog, so any of the rules in Figure 24.16 that involve duplicate elimination are not correct for Prolog. For example, if we want to specify Prolog rules for the UNION operation with duplicate elimination, we must rewrite them as follows: union_one_two(X,Y,Z) :- rel_one(X,Y,Z). union_one_two(X,Y,Z) :- rel_two(X,Y,Z), not(rel_one(X,Y,Z)). However, the rules shown in Figure 24.16 should work for Datalog, if duplicates are auto? matically eliminated. Similarly, the rules for the PROJECT operation shown in Figure ret,one(A, B,C). reUwo(D,E,F). reUhree(G,H,I,J). seleccone_A_eq_c(X,Y,Z) :- reLone(c,Y,Z). selecCone_B_less_5(X,Y,Z) :- rel_one(X,Y,Z), Y<5. seleccone_A_eq_c_and_B_less_5(X,Y,Z) :- rel_one(c,Y,Z), Y<5. select_one_A_eq_c_ocB_less_5(X,Y,Z) :- reLone(c,Y,Z). selecCone_A_eq_c_or_B_less_5(X,Y,Z) :- rel_one(X,Y,Z), Y<5. projecUhree_on_G_H(W,X) :- reUhree(W,X,Y,Z). union_one_two(X,Y,Z) :- reLone(X,Y,Z). union_one_two(X,Y,Z) :- reLtwo(X,Y,Z). intersecCone_two(X,Y,Z) :- reLone(X,Y,Z), rel_two(X,Y,Z). difference_two_one(X,Y,Z) :- rel_two(X,Y,Z), not(rel_one(X,Y,Z)). carCprod_one_three(T,U,V,W,X,Y,Z) :- reLone(T,U,V), reUhree(W,X,Y,Z). naturaijoin_one_three_C_eq_G(U,V,W,X,Y,Z) :? reLone(U,V,W), reUhree(W,X,Y,Z). FIGURE 24.16 Predicates for illustrating relational operations. 24.4 Introduction to Deductive Databases I 795 24.16 should work for Datalog in this case, but they are not correct for Prolog, since dupli? cates would appear in the latter case. 24.4.8 Evaluation of Nonrecursive Datalog Queries In order to use Datalog as a deductive database system, it is appropriate to define an infer? ence mechanism based on relational database query processing concepts. The inherent strategy involves a bottom-up evaluation, starting with base relations; the order of opera? tions is kept flexible and subject to query optimization. In this section, we discuss an inference mechanism based on relational operations that can be applied to nonrecursive Datalog queries. We use the fact and rule base shown in Figures 24.14 and 24.15 to illus? trate our discussion. If a query involves only fact-defined predicates, the inference becomes one of search? ing among the facts for the query result. For example, a query such as department(X,research) ? is a selection of all employee names X who work for the research department. In rela? tional algebra, it is the query: 1T$! (U$2 = "Research" (departmentl) which can be answered by searching through the fact-defined predicate depart? ment(X, V). The query involves relational SELECT and PROJECT operations on a base rela? tion, and it can be handled by the database query processing and optimization techniques discussed in Chapter 15. When a query involves rule-defined predicates, the inference mechanism must com? pute the result based on the rule definitions. If a query is nonrecursive and involves a predicate P that appears as the head of a rule P : - Pl. Pz, ...• Pn, the strategy is first to compute the relations corresponding to Pl' Pz, ...• p, and then to compute the rela? tion corresponding to p. It is useful to keep track of the dependency among the predicates of a deductive database in a predicate dependency graph. Figure 24.17 shows the graph for the fact and rule predicates shown in Figures 24.14 and 24.15. The dependency graph contains a node for each predicate. Whenever a predicate A is specified in the body (RHS) of a rule, and the head (LHS) of that rule is the predicate B, we say that B depends on A, and we draw a directed edge from A to B. This indicates that, in order to compute the facts for the predicate B (the rule head), we must first compute the facts for all the predi? cates A in the rule body. If the dependency graph has no cycles, we call the rule set non? recursive. If there is at least one cycle, the rule set is called recursive. In Figure 24.17, there is one recursively defined predicate-namely, superior-which has a recursive edge pointing back to itself. In addition, because the predicate subordinate depends on supe? rior, it also requires recursion in computing its result. A query that includes only nonrecursive predicates is called a nonrecursive query. In this section, we discuss only inference mechanisms for nonrecursive queries. In Figure 24.17, any query that does not involve the predicates subordinate or superior is nonrecur? sive. In the predicate dependency graph, the nodes corresponding to fact-defined 796 I Chapter 24 Enhanced Data Models for Advanced Applications supervisor~under, 40K_supervisor main-producCemp workson I subordinate CT employee salary supervise department project female male FIGURE 24.17 Predicate dependency graph for Figures 24.14 and 24.15. predicates do not have any incoming edges, since all fact-defined predicates have their facts stored in a database relation. The contents of a fact-defined predicate can be com? puted by directly retrieving the tuples in the corresponding database relation. The main function of an inference mechanism is to compute the facts that corre? spond to query predicates. This can be accomplished by generating a relational expres? sion involving relational operators as SELECT, PROJECT, JOIN, UNION, and SET DIFFERENCE (with appropriate provision for dealing with safety issues) that, when executed, provides the query result. The query can then be executed by utilizing the internal query process? ing and optimization operations of a relational database management system. Whenever the inference mechanism needs to compute the fact set corresponding to a nonrecursive rule-defined predicate p, it first locates all the rules that have p as their head. The idea is to compute the fact set for each such rule and then to apply the UNION operation to the results, since UNION corresponds to a logical OR operation. The dependency graph indi? cates all predicates q on which each p depends, and since we assume that the predicate is nonrecursive, we can always determine a partial order among such predicates q. Before computing the fact set for p, we first compute the fact sets for all predicates q on which p depends, based on their partial order. For example, if a query involves the predicate under_40K_supervi sor, we must first compute both supervisor and over_40K_emp. Since the latter two depend only on the fact-defined predicates employee, salary, and super? vi se, they can be computed directly from the stored database relations. This concludes our introduction to deductive databases. Additional material may be found at the book Web site, where the complete Chapter 25 from the third edition is available. This includes a discussion on algorithms for recursive query processing. 24.5 Summary I 797 24.5 SUMMARY In this chapter, we introduced database concepts for some of the common features that are needed by advanced applications: active databases, temporal databases, and spatial and multimedia databases. It is important to note that each of these topics is very broad and warrants a complete textbook. We first introduced the topic of active databases, which provide additional functionality for specifying active rules. We introduced the event-condition-action or ECA model for active databases. The rules can be automatically triggered by events that occur-such as a database update-and they can initiate certain actions that have been specified in the rule declaration if certain conditions are true. Many commercial packages already have some of the functionality provided by active databases in the form of triggers. We discussed the different options for specifying rules, such as row-level versus statement-level, before versus after, and immediate versus deferred. We gave examples of row-level triggers in the Oracle commercial system, and statement-level rules in the STARBURST experimental system. The syntax for triggers in the sQL-99 standard was also discussed. We briefly discussed some design issues and some possible applications for active databases. We then introduced some of the concepts of temporal databases, which permit the database system to store a history of changes and allow users to query both current and past states of the database. We discussed how time is represented and distinguished between the valid time and transaction time dimensions. We then discussed how valid time, transaction time, and bitemporal relations can be implemented using tuple versioning in the relational model, with examples to illustrate how updates, inserts, and deletes are implemented. We also showed how complex objects can be used to implement temporal databases using attribute versioning. We then looked at some of the querying operations for temporal relational databases and gave a very brief introduction to the TSQL2 language. We then turned to spatial and multimedia databases. Spatial databases provide concepts for databases that keep track of objects that have spatial characteristics, and they require models for representing these spatial characteristics and operators for comparing and manipulating them. Multimedia databases provide features that allow users to store and query different types of multimedia information, which includes images (such as pictures or drawings), video clips (such as movies, news reels, or home videos), audio clips (such as songs, phone messages, or speeches), and documents (such as books or articles). We gave a very brief overview of the various types of media sources and how multimedia sources may be indexed. We concluded the chapter with an introduction to deductive databases and Datalog. Review Questions 24.1. What are the differences between row-level and statement-level active rules? 24.2. What are the differences among immediate, deferred, and detached consideration of active rule conditions? 24.3. What are the differences among immediate, deferred, and detached execution of active rule actions? 798 I Chapter 24 Enhanced Data Models for Advanced Applications 24.4. Briefly discuss the consistency and termination problems when designing a set of active rules. 24.5. Discuss some applications of active databases. 24.6. Discuss how time is represented in temporal databases and compare the different time dimensions. 24.7. What are the differences between valid time, transaction time, and bitemporal relations? 24.8. Describe how the insert, delete, and update commands should be implemented on a valid time relation. 24.9. Describe how the insert, delete, and update commands should be implemented on a bitemporal relation. 24.10. Describe how the insert, delete, and update commands should be implemented on a transaction time relation. 24.1 L What are the main differences between tuple versioning and attribute versioning? 24.12. How do spatial databases differ from regular databases? 24.13. What are the different types of multimedia sources? 24.14. How are multimedia sources indexed for content-based retrieval? Exercises 24.15. Consider the COMPANY database described in Figure 5.6. Using the syntax of Oracle triggers, write active rules to do the following: a. Whenever an employee's project assignments are changed, check if the total hours per week spent on the employee's projects are less than 30 or greater than 40; if so, notify the employee's direct supervisor. b. Whenever an EMPLOYEE is deleted, delete the PROJECT tuples and DEPENDENT tuples related to that employee, and if the employee is managing a department or supervising any employees, set the MGRSSN for that department to null and set the SUPERSSN for those employees to nulL 24.16. Repeat 24.15 but use the syntax of STARBURST active rules. 24.17. Consider the relational schema shown in Figure 24.18. Write active rules for keeping the SUM_COMMISSIONS attribute of SALES_PERSON equal to the sum of the COM? MISSION attribute in SALES for each sales person. Your rules should also check if rhe SALES ~ COMMISSION I SALESPERSON ID SUM COMMISSIONS FIGURE 24.18 Database schema for sales and salesperson commissions in Exercise 24.17. SUM_COMMISSIONS exceeds 100000; if it does, call a procedure NOTIFY_MANAGER(S_ID). Write both statement-level rules in STARBURST notation and row-level rules in Oracle. 24.18. Consider the UNIVERSITY EER schema of Figure 4.10. Write some rules (in English) that could be implemented via active rules to enforce some common integrity constraints that you think are relevant to this application. 24.19. Discuss which of the updates that created each of the tuples shown in Figure 24.9 were applied retroactively and which were applied proactively. 24.20. Show how the following updates, if applied in sequence, would change the con? tents of the bitemporal EMP_8T relation in Figure 24.9. For each update, state whether it is a retroactive or proactive update. a. On 2004-03-10,17:30:00, the salary of NARAYAN is updated to 40000, effective on 2004-03-01- b. On 2003-07-30,08:31:00, the salary of SMITH was corrected to show that it should have been entered as 31000 (instead of 30000 as shown), effective on 2003-06-01- c. On 2004-03-18,08: 31: 00, the database was changed to indicate that NARAYAN was leaving the company (i.e., logically deleted) effective 2004-03-31- d. On 2004-04-20,14: 07: 33, the database was changed to indicate the hiring of a new employee called JOHNSON, with the tuple <' JOHNSON', '334455667', 1, NULL> effective on 2004-04-20. e. On 2004-04-28,12: 54: 02, the database was changed to indicate that WONG was leaving the company (i.e., logically deleted) effective 2004-06-01. f. On 2004-05-05,13: 07: 33, the database was changed to indicate the rehiring of BROWN, with the same department and supervisor but with salary 35000 effec? tive on 2004-05-01- 24.21. Show how the updates given in Exercise 24.20, if applied in sequence, would change the contents of the valid time EMP_VT relation in Figure 24.8. 24.22. Add the following facts to the example database in Figure 24.3: supervise (ahmad,bob) , supervise (franklin,gwen). First modify the supervisory tree in Figure 24.1b to reflect this change. Then mod? ify the diagram in Figure 24.4 showing the top-down evaluation of the query superior(james,Y). 24.23. Consider the following set of facts for the relation parent(X, V), where Y is the parent of X: parent(a,aa), parent(a,ab), parent(aa,aaa), parent(aa,aab), parent(aaa,aaaa), parent(aaa,aaab). Consider the rules r1: ancestor(X,Y) r2: ancestor(X,Y) parent(X,Y) parent(X,Z), ancestor(Z,Y) which define ancestor Y of X as above. Exercises I 799 800 I Chapter 24 Enhanced Data Models for Advanced Applications a. Show how to solve the Datalog query ancestor(aa,X)? using the naive strategy. Show your work at each step. b. Show the same query by computing only the changes in the ancestor relation and using that in rule 2 each time. [This question is derived from Bancilhon and Ramakrishnan (1986).] 24.24. Consider a deductive database with the following rules: ancestor(X,Y) :- father(X,Y) ancestor(X,Y) :- father(X,Z), ancestor(Z,Y) Notice that "father(X,Y)" means that Y is the father of X; "ancestor(X,Y)" means that Yis the ancestor of X. Consider the fact base father(HarrY,Issac) , father(Issac,John) , father(John,Kurt). a. Construct a model theoretic interpretation of the above rules using the given facts. b. Consider that a database contains the above relations father(X, V), another relation brothe r (X, Y), and a third relation bi rth (X, B), where B is the birth? date of person X. State a rule that computes the first cousins of the following variety: their fathers must be brothers. c. Show a complete Datalog program with fact-based and rule-based literals that computes the following relation: list of pairs of cousins, where the first person is born after 1960 and the second after 1970. You may use "greater than" as a built-in predicate. (Note: Sample facts for brother, birth, and person must also be shown.) 24.25. Consider the following rules: reachable(X,Y) :- flight(X,Y) reachable(X,Y) :- flight(X,Z), reachable(Z,Y) where reachable (X, Y) means that city Y can be reached from city X, and fl i ght (X, Y) means that there is a flight to city Yfrom city X. a. Construct fact predicates that describe the following: i. Los Angeles, New York, Chicago, Atlanta, Frankfurt, Paris, Singapore, Sydney are cities. ii. The following flights exist: LA to NY, NY to Atlanta, Atlanta to Frankfurt, Frankfurt to Atlanta, Frankfurt to Singapore, and Singapore to Sydney. (Note: No flight in reverse direction can be automatically assumed.) b. Is the given data cyclic? If so, in what sense? c. Construct a model theoretic interpretation (that is, an interpretation similar to the one shown in Figure 25.3) of the above facts and rules. d. Consider the query reachable(Atlanta,Sydney)? How will this query be executed using naive and seminaive evaluation? List the series of steps it will go through. Selected Bibliography I 801 e. Consider the following rule-defined predicates: round-trip-reachable(X,Y) :- reachable(X,Y), reachable(Y,X) duration(X,Y,Z) Draw a predicate dependency graph for the above predicates. (Note: dura? t i on(X,Y,Z) means that you can take a flight from Xto Yin Z hours.) f. Consider the following query: What cities are reachable in 12 hours from Atlanta? Show how to express it in Datalog. Assume built-in predicates like greater-than(X, V). Can this be converted into a relational algebra state? ment in a straightforward way? Why or why not? g. Consider the predicate population(X, Y) where Y is the population of city X. Consider the following query: List all possible bindings of the predicate pai r (X,V), where Y is a city that can be reached in two flights from city X, which has over 1 million people. Show this query in Datalog, Draw a corre? sponding query tree in relational algebraic terms. Selected Bibliography The book by Zaniolo et al. (1997) consists of several parts, each describing an advanced database concept such as active, temporal, and spatial/text/multimedia databases. Widom and Ceri (1996) and Ceri and Fraternali (1997) focus on active database concepts and systems. Snodgrass et al. (1995) describe the TSQL2 language and data model. Khoshafian and Baker (1996), Faloutsos (1996), and Subrahmanian (1998) describe multimedia database concepts. Tansel et al. (1992) is a collection of chapters on temporal databases. STARBURST rules are described in Widom and Finkelstein (1990). Early work on active databases includes the HiPAC project, discussed in Chakravarthy et al. (1989) and Chakravarthy (1990). A glossary for temporal databases is given in Jensen et al. (1994). Snodgrass (1987) focuses on TQuel, an early temporal query language. Temporal normalization is defined in N avathe and Ahmed (1989). Paton (1999) and Paton and Diaz (1999) survey active databases. Chakravarthy et al. (1994) describe SENTINEL, and object-based active systems. Lee et al. (1998) discuss time series management. The early developments of the logic and database approach are surveyed by Gallaire et al. (1984). Reiter (1984) provides a reconstruction of relational database theory, while Levesque (1984) provides a discussion of incomplete knowledge in light of logic. Gallaire and Minker (1978) provide an early book on this topic. A detailed treatment oflogic and databases appears in Ullman (1989, vol. 2), and there is a related chapter in Volume 1 (1988). Ceri, Gottlob, and Tanca (1990) present a comprehensive yet concise treatment of logic and databases. Das (1992) is a comprehensive book on deductive databases and logic programming. The early history of Datalog is covered in Maier and Warren (1988). Clocks in and Mellish (1994) is an excellent reference on Prolog language. Aho and Ullman (1979) provide an early algorithm for dealing with recursive queries, using the least fixed-point operator. Bancilhon and Ramakrishnan (1986) give an excellent and detailed description of the approaches to recursive query processing, with detailed examples of the naive and seminaive approaches. Excellent survey articles on 802 I Chapter 24 Enhanced Data Models for Advanced Applications deductive databases and recursive query processing include Warren (1992) and Ramakrishnan and Ullman (1993). A complete description of the seminaive approach based on relational algebra is given in Bancilhon (1985). Other approaches to recursive query processing include the recursive query/subquery strategy of Vieille (1986), which is a top-down interpreted strategy, and the Henschen-N aqvi (1984) top-down compiled iterative strategy. Balbin and Rao (1987) discuss an extension of the seminaive differential approach for multiple predicates. The original paper on magic sets is by Bancilhon et at. (1986). Beeri and Ramakrishnan (1987) extend it. Mumick et at. (1990) show the applicability of magic sets to nonrecursive nested SQL queries. Other approaches to optimizing rules without rewriting them appear in Vieille (1986, 1987). Kifer and Lozinskii (1986) propose a different technique. Bry (1990) discusses how the top-down and bottom-up approaches can be reconciled. Whang and Navathe (1992) describe an extended disjunctive normal form technique to deal with recursion in relational algebra expressions for providing an expert system interface over a relational DBMS. Chang (1981) describes an early system for combining deductive rules with relational databases. The LOL system prototype is described in Chimenti et at. (1990). Krishnamurthy and Naqvi (1989) introduce the "choice" notion in LDL. Zaniolo (1988) discusses the language issues for the LOL system. A language overview of CORAL is provided in Ramakrishnan et at. (1992), and the implementation is described in Ramakrishnan et at. (1993). An extension to support object-oriented features, called CORAL++, is described in Srivastava et at. (1993). Ullman (1985) provides the basis for the NAIL! system, which is described in Morris et at. (1987). Phipps et at. (1991) describe the GLUE-NAIL! deductive database system. Zaniolo (1990) reviews the theoretical background and the practical importance of deductive databases. Nicolas (1997) gives an excellent history of the developments leading up to OOOOs. Falcone et at. (1997) survey the 0000 landscape. References on the VALIDITY system include Friesen et at. (1995), Vieille (1997), and Dietrich et at. (1999). Distributed Databases and Client-Server Architectures In this chapter we tum our attention to distributed databases (DDBs), distributed data? base management systems (DDBMSs), and how the client-server architecture is used as a platform for database application development. The DDB technology emerged as a merger of two technologies: (1) database technology, and (2) network and data communication technology. The latter has made tremendous strides in terms of wired and wireless technologies-from satellite and cellular communications and Metropolitan Area Net? works (MANs) to the standardization of protocols like Ethernet, TCPjIP, and the Asyn? chronous Transfer Mode (ATM) as well as the explosion of the Internet. While early databases moved toward centralization and resulted in monolithic gigantic databases in the seventies and early eighties, the trend reversed toward more decentralization and autonomy of processing in the late eighties. With advances in distributed processing and distributed computing that occurred in the operating systems arena, the database research community did considerable work to address the issues of data distribution, dis? tributed query and transaction processing, distributed database rnetadata management, and other topics, and developed many research prototypes. However, a full-scale compre? hensive DDBMS that implements the functionality and techniques proposed in DDB research never emerged as a commercially viable product. Most major vendors redirected their efforts from developing a "pure" DDBMS product into developing systems based on client-server, or toward developing technologies for accessing distributed heterogeneous data sources. 803 804 I Chapter 25 Distributed Databases and Client-Server Architectures Organizations, however, have been very interested in the decentralization of processing (at the system level) while achieving an integmtion of the information resources (at the logical level) within their geographically distributed systems of databases, applications, and users. Coupled with the advances in communications, there is now a general endorsement of the client-server approach to application development, which assumes many of the DDB issues. In this chapter we discuss both distributed databases and client-server architectures.' in the development of database technology that is closely tied to advances in communications and network technology. Details of the latter are outside our scope; the reader is referred to a series of texts on data communications and networking (see the Selected Bibliography at the end of this chapter). Section 25.1 introduces distributed database management and related concepts. Detailed issues of distributed database design, involving fragmenting of data and distributing it over multiple sites with possible replication, are discussed in Section 25.2. Section 25.3 introduces different types of distributed database systems, including federated and multidatabase systems and highlights the problems of heterogeneity and the needs of autonomy in federated database systems, which will dominate for years to come. Sections 25.4 and 25.5 introduce distributed database query and transaction processing techniques, respectively. Section 25.6 discusses how the client-server architectural concepts are related to distributed databases. Section 25.7 elaborates on future issues in client-server architectures. Section 25.8 discusses distributed database features of the Oracle RDBMS. For a short introduction to the topic, only sections 25.1,25.3,and 25.6 may be covered. 25.1 DISTRIBUTED DATABASE CONCEPTS Distributed databases bring the advantages of distributed computing to the database man? agement domain. A distributed computing system consists of a number of processing ele? ments, not necessarily homogeneous, that are interconnected by a computer network, and that cooperate in performing certain assigned tasks. As a general goal, distributed comput? ing systems partition a big, unmanageable problem into smaller pieces and solve it effi? ciently in a coordinated manner. The economic viability of this approach stems from two reasons: (l) more computer power is harnessed to solve a complex task, and (2) each auton? omous processing element can be managed independently and develop its own applications. We can define a distributed database (OOB) as a collection of multiple logically interrelated databases distributed over a computer network, and a distributed database management system (OOBMS) as a software system that manages a distributed database while making the distribution transparent to the user.l A collection of files stored at different nodes of a network and the maintaining of interrelationships among them via hyperlinks has become a common organization on the Internet, with files of Web pages. 1. The reader should review the introduction to client-server architecture in Section 2.5. 2. This definition and some of the discussion in this section are based on Ozsu and Valduriez (1999). 25.1 Distributed Database Concepts I 805 The common functions of database management, including uniform query processing and transaction processing, do not apply to this scenario yet. The technology is, however, moving in a direction such that distributed World Wide Web (WWW) databases will become a reality in the near future. We shall discuss issues of accessing databases on the Web in Chapter 26. None of those qualifies as DDB by the definition given earlier. 25.1.1 Parallel Versus Distributed Technology Turning our attention to parallel system architectures, there are two main types of multi? processor system architectures that are commonplace: • Shared memory (tightly coupled) architecture: Multiple processors share secondary (disk) storage and also share primary memory. • Shared disk (loosely coupled) architecture: Multiple processors share secondary (disk) storage but each has their own primary memory. These architectures enable processors to communicate without the overhead of exchanging messages over a network.:' Database management systems developed using the above types of architectures are termed parallel database management systems rather than DDBMS, since they utilize parallel processor technology. Another type of multiprocessor architecture is called shared nothing architecture. In this architecture, every processor has its own primary and secondary (disk) memory, no common memory exists, and the processors communicate over a high-speed interconnection network (bus or switch). Although the shared nothing architecture resembles a distributed database computing environment, major differences exist in the mode of operation. In shared nothing multiprocessor systems, there is symmetry and homogeneity of nodes; this is not true of the distributed database environment where heterogeneity of hardware and operating system at each node is very common. Shared nothing architecture is also considered as an environment for parallel databases. Figure 25.1 contrasts these different architectures. 25.1.2 Advantages of Distributed Databases Distributed database management has been proposed for various reasons ranging from organizational decentralization and economical processing to greater autonomy. We high? light some of these advantages here. 1. Management of distributed data with different levels of transparency: Ideally, a DBMS should be distribution transparent in the sense of hiding the details of where each file (table, relation) is physically stored within the system. Consider the company database in Figure 5.5 that we have been discussing throughout the ------- --------- ----- --------- 3. If both primary and secondary memories are shared, the architecture is also known as shared everything architecture. 806 I Chapter 25 Distributed Databases and Client-Server Architectures (a) Computer System 1 Switch (b) Site (San Francisco) Computer System 2 (c) Computer System n Central Site (Chicago) Communications Network Site (New York) Site (Los Angeles) Communications Network Site (Atlanta) fIGURE 25.1 Some different database system architectures. (a) Shared nothing architecture. (b) A networked architecture with a centralized database at one of the sites. (c) A truly distributed database architecture. 25.1 Distributed Database Concepts I 807 book. The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented horizontally (that is, into sets of rows, as we shall discuss in Section 25.2) and stored with pos? sible replication as shown in Figure 25.2. The following types of transparencies are possible: • Distribution or network transparency: This refers to freedom for the user from the operational details of the network. It may be divided into location transparency and naming transparency. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unam? biguously without additional specification. • Replication transparency: As we show in Figure 25.2, copies of data may be stored at multiple sites for better availability, performance, and reliability. Replication transparency makes the user unaware of the existence of copies. • Fragmentation transparency: Two types offragmentation are possible. Horizontal fragmentation distributes a relation into sets of tuples (rows). Vertical fragmen? tation distributes a relation into subrelations where each subrelation is defined by a subset of the columns of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments. EMPLOYEES-San Francisco and Los Angeles PROJECTs- San Francisco WORKS_ON- San Francisco Employees San Francisco EMPLOYEES-All PROJECTS- All WORKS_ON-AII Communications Network Los Angeles New York Atlanta EMPLOYEES-New York PROJECTS- All WORKS_ON- NewYork Employees EMPLOYEES-los Angeles PROJECTS- Los Angeles and San Francisco WORKs_ON-Los Angeles Employees EMPLOYEES-Atlanta PROJECTS- Atlanta WORKS_ON- Atlanta Employees FIGURE 25.2 Data distribution and replication among distributed databases 808 I Chapter 25 Distributed Databases and Client-Server Architectures 2. Increased reliability and availability: These are two of the most common potential advantages cited for distributed databases. Reliability is broadly defined as the probability that a system is running (not down) at a certain time point, whereas availability is the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites, one site may fail while other sites continue to operate. Only the data and software that exist at the failed site cannot be accessed. This improves both reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site. In a centralized system, failure at a single site makes the whole system unavailable to all users. In a distributed database, some of the data may be unreachable, but users may still be able to access other parts of the database. 3. Improved performance: A distributed DBMS fragments the database by keeping the data closer to where it is needed most. Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide area networks. When a large database is distributed over multiple sites, smaller databases exist at each site. As a result, local queries and transactions accessing data at a single site have better performance because of the smaller local databases. In addition, each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Moreover, interquery and intraquery parallelism can be achieved by executing multiple que? ries at different sites, or by breaking up a query into a number of subqueries that execute in parallel. This contributes to improved performance. 4. Easier expansion: In a distributed environment, expansion of the system in terms of adding more data, increasing database sizes, or adding more processors is much easier. The transparencies we discussed in (1) above lead to a compromise between ease of use and the overhead cost of providing transparency. Total transparency provides the global user with a view of the entire DDBS as if it is a single centralized system. Transparency is provided as a complement to autonomy, which gives the users tighter control over their own local databases. Transparency features may be implemented as a part of the user language, which may translate the required services into appropriate operations. In addition, transparency impacts the features that must be provided by the operating system and the DBMS. 25.1.3 Additional Functions of Distributed Databases Distribution leads to increased complexity in the system design and implementation. To achieve the potential advantages listed previously, the DDBMS software must be able to provide the following functions in addition to those of a centralized DBMS: • Keeping track of data: The ability to keep track of the data distribution, fragmenta? tion, and replication by expanding the DDBMS catalog. 25.1 Distributed Database Concepts I 809 • Distributed query processing: The ability to access remote sites and transmit queries and data among the various sites via a communication network. • Distributed transaction management: The ability to devise execution strategies for que' ries and transactions that access data from more than one site and to synchronize the access to distributed data and maintain integrity of the overall database. • Replicated data management: The ability to decide which copy of a replicated data item to access and to maintain the consistency of copies of a replicated data item. • Distributed database recovery: The ability to recover from individual site crashes and from new types of failures such as the failure of a communication links. • Security: Distributed transactions must be executed with the proper management of the security of the data and the authorization/access privileges of users. • Distributed directory (catalog) management: A directory contains information (meta? data) about data in the database. The directory may be global for the entire DDB, or local for each site. The placement and distribution of the directory are design and policy issues. These functions themselves increase the complexity of a DDBMS over a centralized DBMS. Before we can realize the full potential advantages of distribution, we must find satisfactory solutions to these design issues and problems. Including all this additional functionality is hard to accomplish, and finding optimal solutions is a step beyond that. At the physical hardware level, the following main factors distinguish a DDBMS from a centralized system: • There are multiple computers, called sites or nodes. • These sites must be connected by some type of communication network to transmit data and commands among sites, as shown in Figure 25.1c. The sites may all be located in physical proximity-say, within the same building or group of adjacent buildings-and connected via a local area network, or they may be geographically distributed over large distances and connected via a long-haul or wide area network. Local area networks typically use cables, whereas long-haul networks use telephone lines or satellites. It is also possible to use a combination of the two types of networks. Networks may have different topologies that define the direct communication paths among sites. The type and topology of the network used may have a significant effect on performance and hence on the strategies for distributed query processing and distributed database design. For high-level architectural issues, however, it does not matter which type of network is used; it only matters that each site is able to communicate, directly or indirectly, with every other site. For the remainder of this chapter, we assume that some type of communication network exists among sites, regardless of the particular topology. We will not address any network specific issues, although it is important to understand that for an efficient operation of a DDBS, network design and performance issues are very critical. 810 I Chapter 25 Distributed Databases and Client-Server Architectures 25.2 DATA FRAGMENTATION, REPLICATION, AND ALLOCATION TECHNIQUES FOR DISTRIBUTED DATABASE DESIGN In this section we discuss techniques that are used to break up the database into logical units, called fragments, which may be assigned for storage at the various sites. We also discuss the use of data replication, which permits certain data to be stored in more than one site, and the process of allocating fragments-or replicas of fragments-for storage at the various sites. These techniques are used during the process of distributed database design. The information concerning data fragmentation, allocation, and replication is stored in a global directory that is accessed by the DDBS applications as needed. 25.2.1 Data Fragmentation In a DDB, decisions must be made regarding which site should be used to store which por? tions of the database. For now, we will assume that there is no replication; that is, each relation-or portion of a relation-is to be stored at only one site. We discuss replication and its effects later in this section. We also use the terminology of relational databases? similar concepts apply to other data models. We assume that we are starting with a rela? tional database schema and must decide on how to distribute the relations over the vari? ous sites. To illustrate our discussion, we use the relational database schema in Figure 5.5. Before we decide on how to distribute the data, we must determine the logical units of the database that are to be distributed. The simplest logical units are the relations themselves; that is, each whole relation is to be stored at a particular site. In our example, we must decide on a site to store each of the relations EMPLOYEE, DEPARTMENT, PROJECT, WORKS_ON, and DEPENDENT of Figure 5.5. In many cases, however, a relation can be divided into smaller logical units for distribution. For example, consider the company database shown in Figure 5.6, and assume there are three computer sites-one for each department in the cornpanv," We may want to store the database information relating to each department at the computer site for that department. A technique called horizontal fragmentation can be used to partition each relation by department. Horizontal Fragmentation. A horizontal fragment of a relation is a subset of the tuples in that relation. The tuples that belong to the horizontal fragment are specified by a condition on one or more attributes of the relation. Often, only a single attribute is involved. For example, we may define three horizontal fragments on the EMPLOYEE relation of Figure 5.6 with the following conditions: (DNO = 5), (DNO = 4), and (DNO = l)-each fragment contains the EMPLOYEE tuples working for a particular department. Similarly, we may define three horizontal fragments for the PROJECT relation, with the conditions (DNUM = 5), (DNUM = 4), 4. Of course, in an actual situation, there will be many more tuples in the relations than those shown in Figure 5.6. 25.2 Data Fragmentation, Replication, and Allocation Techniques I 811 and (DNUM = I )--each fragment contains the PROJ ECT tuples controlled by a particular department. Horizontal fragmentation divides a relation "horizontally" by grouping rows to create subsets of tuples, where each subset has a certain logical meaning. These fragments can then be assigned to different sites in the distributed system. Derived horizontal fragmentation applies the partitioning of a primary relation (DEPARTMENT in our example) to other secondary relations (EMPLOYEE and PROJECT in our example), which are related to the primary via a foreign key. This way, related data between the primary and the secondary relations gets fragmented in the same way. Vertical Fragmentation. Each site may not need all the attributes of a relation, which would indicate the need for a different type of fragmentation. Vertical fragmentation divides a relation "vertically" by columns. A vertical fragment of a relation keeps only certain attributes of the relation. For example, we may want to fragment the EMPLOYEE relation into two vertical fragments. The first fragment includes personal information-NAME, BDATE, ADDRESS, and sEx-and the second includes work-related informarion-s-sss, SALARY, SUPERSSN, DNO. This vertical fragmentation is not quite proper because, if the two fragments are stored separately, we cannot put the original employee tuples back together, since there is no common attribute between the two fragments. It is necessary to include the primary key or some candidate key attribute in every vertical fragment so that the full relation can be reconstructed from the fragments. Hence, we must add the SSN attribute to the personal information fragment. Notice that each horizontal fragment on a relation R can be specified by a (JCi(R) operation in the relational algebra. A set of horizontal fragments whose conditions CI, C2, ... , Cn include all the tuples in R-that is, every tuple in R satisfies (CI OR C2 OR... OR Cn)-is called a complete horizontal fragmentation of R. In many cases a complete horizontali *" j. Ourfragmentationtwo earlier examplesis also disjoint;of horizontalthat is, nofragmentationtuple in R satisfiesfor the(CiEMPLOYEEANDandCj) forPROJECTany relations were both complete and disjoint. To reconstruct the relation R from a complete horizontal fragmentation, we need to apply the UNION operation to the fragments. A vertical fragment on a relation R can be specified by a 7TLi (R) operation in the relational algebra. A set of vertical fragments whose projection lists L1, L2, ... , Ln include all the attributes in R but share only the primary key attribute of R is called a complete vertical fragmentation ofR. In this case the projection lists satisfy the following two conditions: • L1 U L2 U ... U Ln = ATTRS(R). • Li n Lj = PK(R) for any i *- j, where ATTRS(R) is the set of attributes of Rand PK(R) is the primary key of R. To reconstruct the relation R from a complete vertical fragmentation, we apply the OUTER UNION operation to the vertical fragments (assuming no horizontal fragmentation is used). Notice that we could also apply a FULL OUTER JOIN operation and get the same result for a complete vertical fragmentation, even when some horizontal fragmentation may also have been applied. The two vertical fragments of the EMPLDYEE relation with projection lists LI = {SSN, NAME, BDATE, ADDRESS, SEX} and L2 = {SSN, SALARY, SUPERSSN, DNO} constitute a complete vertical fragmentation of EMPLOYEE. 812 I Chapter 25 Distributed Databases and Client-Server Architectures Two horizontal fragments that are neither complete nor disjoint are those defined on the EMPLOYEE relation of Figure 5.5 by the conditions (SALARY> 50000) and (DNO = 4); they may not include all EMPLOYEE tuples, and they may include common tuples. Two vertical fragments that are not complete are those defined by the attribute lists L1 = {NAME, ADDRESS} and L2 = {SSN, NAME, SALARY}; these lists violate both conditions of a complete vertical fragmentation. Mixed (Hybrid) Fragmentation .. We can intermix the two types of fragmentation, yielding a mixed fragmentation. For example, we may combine the horizontal and vertical fragmentations of the EMPLOYEE relation given earlier into a mixed fragmentation that includes six fragments. In this case the original relation can be reconstructed by applying UNION and OUTER UNION (or OUTER JOIN) operations in the appropriate order. In general, a fragment of a relation R can be specified by a SELECT-PROJECT combination of operations TIL(udR)). If C = TRUE (that is, all tuples are selected) and L -=1= ATTRS(R), we get a vertical fragment, and if e -=1= TRUE and L = ATTRS(R), we get a horizontal fragment. Finally, if C -=1= TRUE and L -=1= ATTRS(R), we get a mixed fragment. Notice that a relation can itself be considered a fragment with e = TRUE and L = ATTRS(R). In the following discussion, the term fragment is used to refer to a relation or to any of the preceding types of fragments. A fragmentation schema of a database is a definition of a set of fragments that includes all attributes and tuples in the database and satisfies the condition that the whole database can be reconstructed from the fragments by applying some sequence of OUTER UNION (or OUTER JOIN) and UNION operations. It is also sometimes useful-although not necessary-to have all the fragments be disjoint except for the repetition of primary keys among vertical (or mixed) fragments. In the latter case, all replication and distribution of fragments is clearly specified at a subsequent stage, separately from fragmentation. An allocation schema describes the allocation of fragments to sites of the DDBS; hence, it is a mapping that specifies for each fragment the sitets) at which it is stored. If a fragment is stored at more than one site, it is said to be replicated. We discuss data replication and allocation next. 25.2.2 Data Replication and Allocation Replication is useful in improving the availability of data. The most extreme case is replica? tion of the whole database at every site in the distributed system, thus creating a fully replicated distributed database. This can improve availability remarkably because the system can con? tinue to operate as long as at least one site is up. It also improves performance of retrieval for global queries, because the result of such a query can be obtained locally from anyone site; hence, a retrieval query can be processed at the local site where it is submitted, if that site includes a server module. The disadvantage of full replication is that it can slow down update operations drastically, since a single logical update must be performed on every copy of the database to keep the copies consistent. This is especially true if many copies of the database exist. Full replication makes the concurrency control and recovery techniques more expensive than they would be if there were no replication, as we shall see in Section 25.5. The other extreme from full replication involves having no replication-that is, each fragment is stored at exactly one site. In this case all fragments must be disjoint, 25.2 Data Fragmentation, Replication, and Allocation Techniques I 813 except for the repetition of primary keys among vertical (or mixed) fragments. This is also called nonredundant allocation. Between these two extremes, we have a wide spectrum of partial replication of the data-that is, some fragments of the database may be replicated whereas others may not. The number of copies of each fragment can range from one up to the total number of sites in the distributed system. A special case of partial replication is occurring heavily in applications where mobile workers-such as sales forces, financial planners, and claims adjustors-carry partially replicated databases with them on laptops and personal digital assistants and synchronize them periodically with the server database.i A description of the replication of fragments is sometimes called a replication schema. Each fragment-or each copy of a fragment-must be assigned to a particular site in the distributed system. This process is called data distribution (or data allocation). The choice of sites and the degree of replication depend on the performance and availability goals of the system and on the types and frequencies of transactions submitted at each site. For example, if high availability is required and transactions can be submitted at any site and if most transactions are retrieval only, a fully replicated database is a good choice. However, if certain transactions that access particular parts of the database are mostly submitted at a particular site, the corresponding set of fragments can be allocated at that site only. Data that is accessed at multiple sites can be replicated at those sites. If many updates are performed, it may be useful to limit replication. Finding an optimal or even a good solution to distributed data allocation is a complex optimization problem. 25.2.3 Example of Fragmentation, Allocation, and Replication We now consider an example of fragmenting and distributing the company database of Fig? ures 5.5 and 5.6. Suppose that the company has three computer sites--one for each current department. Sites 2 and 3 are for departments 5 and 4, respectively. At each of these sites, we expect frequent access to the EMPLOYEE and PROJECT information for the employees who work in thatdepartment and the projects controlled by thatdepartment. Further, we assume that these sites mainly access the NAME, SSN, SALARY, and SUPERSSN attributes of EMPLOYEE. Site 1 is used by company headquarters and accesses all employee and project information regularly, in addition to keeping track of DEPENDENT information for insurance purposes. According to these requirements, the whole database of Figure 5.6 can be stored at site 1. To determine the fragments to be replicated at sites 2 and 3, we can first horizontally fragment DEPARTMENT by its key DNUMBER. We then apply derived fragmentation to the relations EMPLOYEE, PROJECT, and DEPT_LOCATIONS relations based on their foreign keys for department number-called DNO, DNUM, and DNUMBER, respectively, in Figure 5.5. We can then vertically fragment the resulting EMPLOYEE fragments to include only the attributes DNO}. Figure 25.3 shows the mixed fragments EMPD5 and SSN, SUPERSSN, SALARY, {NAME, EMPD4, which include the EMPLOYEE tuples satisfying the conditions DNO = 5 and DNO = 4, 5. For a scalable approach to synchronize partially replicated databases, see Mahajan et al. (1998). 814 I Chapter 25 Distributed Databases and Client-Server Architectures (a) I EMPD5 DNAME J WORKS ONS (b) I FNAME John Franklin Ramesh Jcryce ESSN 123456789 123456789 666884444 453453453 453453453 333445555 333445555 333445555 333445555 EMPD4 DNAME Administration I WORKS_ON4 PNO 1 2 3 1 2 2 3 10 20 FNAME AIic:ia Jemifer Ahmad ESSN 333445555 999887777 999887777 987987987 987987987 987654321 987654321 PNO 10 30 10 10 30 30 20 MINIT LNAME B T K A Smith Wcq; Naravan English HOURS 32.5 7.5 40.0 20.0 20.0 10.0 10.0 10.0 10.0 -SSN 123456789 333445555 666884444 453453453 SALARY SUPERSSN DNO MGRSTARTDATE 1988-05-22 ! PROJS5 333445555 888665555 333445555 333445555 IDEP5_LOCS Data at Site 2 30000 40000 38000 25000 PNAME ProductX ProductY ProductZ 5 5 5 5 DNUMBER PNUMBER 1 5 2 3 5 5 LOCATION Bellaire SugaJ1and Houston PLOCATION Bellaire Sugarland Houston DNUM 5 5 5 MINIT LNAME J S V Zelaya Wallace Jabbar HOURS 10.0 30.0 10.0 35.0 5.0 20.0 15.0 -SSN 999887777 987654321 987987987 SALARY SUPERSSN DNO MGRSTARTDATE 1995-01-01 I PROJS4 Data at Site 3 25000 43000 25000 987654321 888665555 987654321 IDEP4 lOCS 4 4 PNAME Computerization Newbenefits 4 IDNU~BER I=ON I PNUMBER 10 30 PLOCATION Stafford Staffold DNUM 4 4 FIGURE 25.3 Allocation of fragments to sites. (a) Relation fragments at site 2 corresponding to department 5. (b) Relation fragments at site 3 corresponding to department 4. respectively. The horizontal fragments of PROJECT, DEPARTMENT, and DEPCLOCATIONS are similarly fragmented by department number. All these fragments-stored at sites 2 and 3-are replicated because they are also stored at the headquarters site 1. We must now fragment the WORKS_ON relation and decide which fragments of WORKS_ON to store at sites 2 and 3. We are confronted with the problem that no attribute of WORKS_ON 25.3 Types of Distributed Database Systems I 815 directly indicates the department to which each tuple belongs. In fact, each tuple in WORKS_ ON relates an employee e to a project p. We could fragment WORKS_ON based on the department d in which e works or based on the department d' that controls p. Fragmentation becomes easy if we have a constraint stating that d = d' for all WORKS_ON tuples-that is, if employees can work only on projects controlled by the department they work for. However, there is no such constraint in our database of Figure 5.6. For example, the WORKS_ON tuple <333445555, 10, 10.0> relates an employee who works for department 5 with a project controlled by department 4. In this case we could fragment WORKS_ON based on the department in which the employee works (which is expressed by the condition C) and then fragment further based on the department that controls the projects that employee is working on, as shown in Figure 25.4. In Figure 25.4, the union of fragments 01, 02, and 03 gives all WORKS_ON tuples for employees who work for department 5. Similarly, the union of fragments 04, OS, and 06 gives all WORKS_ON tuples for employees who work for department 4. On the other hand, the union of fragments 01, 04, and 07 gives all WORKS_ON tuples for projects controlled by department 5. The condition for each of the fragments 01 through 09 is shown in Figure 25.4. The relations that represent M:N relationships, such as WORKS_ON, often have several possible logical fragmentations. In our distribution of Figure 25.3, we choose to include all fragments that can be joined to either an EMPLOYEE tuple or a PROJECT tuple at sites 2 and 3. Hence, we place the union of fragments 01, 02, 03, 04, and 07 at site 2 and the union of fragments 04, OS, 06, 02, and 08 at site 3. Notice that fragments 02 and 04 are replicated at both sites. This allocation strategy permits the join between the local EMPLOYEE or PROJECT fragments at site 2 or site 3 and the local WORKS_ON fragment to be performed completely locally. This clearly demonstrates how complex the problem of database fragmentation and allocation is for large databases. The Selected Bibliography at the end of this chapter discusses some of the work done in this area. 25.3 TYPES OF DISTRIBUTED DATABASE SYSTEMS The term distributed database management system can describe various systems that dif? fer from one another in many respects. The main thing that all such systems have in com? mon is the fact that data and software are distributed over multiple sites connected by some form of communication network. In this section we discuss a number of types of DDBMSs and the criteria and factors that make some of these systems different. The first factor we consider is the degree of homogeneity of the DDBMS software. If all servers (or individual local DBMSs) use identical software and all users (clients) use identical software, the DDBMS is called homogeneous; otherwise, it is called heterogeneous. Another factor related to the degree of homogeneity is the degree of local autonomy. If there is no provision for the local site to function as a stand-alone DBMS, then the system has no local autonomy. On the other hand, if direct access by local transactions to a server is permitted, the system has some degree of local autonomy. At one extreme of the autonomy spectrum, we have a DDBMS that "looks like" a centralized DBMS to the user. A single conceptual schema exists, and all access to the system is obtained through a site that is part of the DDBMS-which means that no local 816 I Chapter 25 Distributed Databases and Client-Server Architectures (a) I G1 ESSN autonomy exists. At the other extreme we encounter a type of DDBMS called a federated DDBMS (or a multidatabase system). In such a system, each server is an independent and autonomous centralized DBMS that has its own local users, local transactions, and DBA and hence has a very high degree of local autonomy. The term federated database system (FDBS) is used when there is some global view or schema of the federation of databases that is shared by the applications. On the other hand, a multidatabase system does not have a global schema and interactively constructs one as needed by the application. Both systems are hybrids between distributed and centralized systems and the distinction we made between them is not strictly followed. We will refer to them as FDBSs in a generic sense. PNO HOURS 123456789 1 123456789 666884444 453453453 453453453 333445555 333445555 2 3 1 2 2 3 32.5 7.5 40.0 20.0 20.0 10.0 10.0 C2=CAND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=4)) C1=C AND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=5)) Employees in Department 5 C3=CAND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=1)) (b) ~ ESSN ~ HOURS I IG5 C4=CAND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=5)) ESSN 999887777 999887777 987987987 987987987 987654321 (e) ~ ESSN ~I HOURS PNO 30 10 10 30 30 HOURS 30.0 10.0 35.0 5.0 20.0 C5=CAND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=4)) Employees in Department 4 C6=CAND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=1)) I ~ ESSN ~ HOURS I C7=CAND (PNOIN (SELECTPNUMBER C8=CAND (PNOIN (SELECTPNUMBER FROMPROJECT FROMPROJECT WHEREDNUM=5)) WHEREDNUM=4)) C9=CAND (PNOIN (SELECTPNUMBER FROMPROJECT WHEREDNUM=1)) Employees in Department 1 FIGURE 25.4 Complete and disjoint fragments of the WORKS_ON relation. (a) Fragmentsof WORKS_ON for employ? ees working in department 5 (c= [ESSN IN (SELECT SSN FROM EMPLOYEE WHERE DNO=5)]). (b) Fragmentsof WORKS_ ON for employees working in department 4 (c= [ESSN IN (SELECT SSN FROM EMPLOYEE WHERE DNo=4)]). (e) Frag? ments of WORKS_ON for employees working in department 1 (c= [ESSN IN (SELECT SSN FROM EMPLOYEE WHERE DNO=l)]) • 25.3 Types of Distributed Database Systems I 817 In a heterogeneous FOBS, one server may be a relational DBMS, another a network DBMS, and a third an object or hierarchical DBMS; in such a case it is necessary to have a canonical system language and to include language translators to translate subqueries from the canonical language to the language of each server. We briefly discuss the issues affecting the design of FDBSs below. Federated Database Management Systems Issues. The type of heterogeneity present in FDBSs may arise from several sources. We discuss these sources first and then point out how the different types of autonomies contribute to a semantic heterogeneity that must be resolved in a heterogeneous FOBS. • Differences in data models: Databases in an organization come from a variety of data models including the so-called legacy models (network and hierarchical, see Appen? dixes E and F), the relational data model, the object data model, and even files. The modeling capabilities of the models vary. Hence, to deal with them uniformly via a single global schema or to process them in a single language is challenging. Even if two databases are both from the RDBMS environment, the same information may be represented as an attribute name, as a relation name, or as a value in different data? bases. This calls for an intelligent query-processing mechanism that can relate infor? mation based on metadata. • Differences in constraints: Constraint facilities for specification and implementation vary from system to system. There are comparable features that must be reconciled in the construction of a global schema. For example, the relationships from ER models are represented as referential integrity constraints in the relational model. Triggers may have to be used to implement certain constraints in the relational model. The global schema must also deal with potential conflicts among constraints. • Differences in query languages: Even with the same data model, the languages and their versions vary. For example, SQLhas multiple versions like SQL-89, sQL-92, and SQL-99, and each system has its own set of data types, comparison operators, string manipulation features, and so on. Semantic Heterogeneity. Semantic heterogeneity occurs when there are differences in the meaning, interpretation, and intended use of the same or related data. Semantic heterogeneity among component database systems (DBSs) creates the biggest hurdle in designing global schemas of heterogeneous databases. The design autonomy of component DBSs refers to their freedom of choosing the following design parameters, which in tum affect the eventual complexity of the FOBS: • The universe of discourse from which the data is drawn: For example, two customer accounts, databases in the federation may be from United States and Japan with entirely different sets of attributes about customer accounts required by the account? ing practices. Currency rate fluctuations would also present a problem. Hence, rela? tions in these two databases which have identical names-CUSTOMER or ACCOUNT-may have some common and some entirely distinct information. • Representation and naming: The representation and naming of data elements and the structure of the data model may be prespecified for each local database. 818 I Chapter 25 Distributed Databases and Client-Server Architectures • The understanding, meaning, and subjective interpretation of data. This is a chief contrib? utor to semantic heterogeneity. • Transaction and policy constraints: These deal with serializability criteria, compensat? ing transactions, and other transaction policies. • Derivation of summaries: Aggregation, summarization, and other data-processing fea? tures and operations supported by the system. Communication autonomy of a component DBS refers to its ability to decide whether to communicate with another component DBS. Execution autonomy refers to the ability of a component DBS to execute local operations without interference from external operations by other component DBSs and its ability to decide the order in which to execute them. The association autonomy of a component DBS implies that it has the ability to decide whether and how much to share its functionality (operations it supports) and resources (data it manages) with other component DBSs. The major challenge of designing FDBSs is to let component DBSs interoperate while still providing the above types of autonomies to them. A typical five-level schema architecture to support global applications in the FOBS environment is shown in Figure 25.5. In this architecture, the local schema is the conceptual schema (full database definition) of a component database, and the compo? nent schema is derived by translating the local schema into a canonical data model or common data model (CDM) for the FDBS. Schema translation from the local schema to the component schema is accompanied by generation of mappings to transform commands on a component schema into commands on the corresponding local schema. The export schema represents the subset of a component schema that is available to the FDBS. The federated schema is the global schema or view, which is the result of integrating all the shareable export schemas. The external schemas define the schema for a user group or an application, as in the three-level schema architecture. 6 All the problems related to query processing, transaction processing, and directory and metadata management and recovery apply to FDBSs with additional considerations. It is not within our scope to discuss them in detail here. 25.4 QUERY PROCESSING IN DISTRIBUTED DATABASES We now give an overview of how a DDBMS processes and optimizes a query. We first dis? cuss the communication costs of processing a distributed query; we then discuss a spe? cial operation, called a semijoin, that is used in optimizing some types of queries in a DDBMS. 6. For a detailed discussion of the autonomies and the five-level architecture of FDBMSs, seeSheth and Larson (1990). 25.4 Query Processing in Distributed Databases I 819 FIGURE 25.5 The five-level schema architecture in a federated database system (FOBS). Source: Adapted from Sheth and Larson, Federated Database Systems for Managing Distributed Heterogeneous Autonomous Databases. ACM Computing Surveys (Vol. 22: No.3, September 1990). 25.4.1 Data Transfer Costs of Distributed Query Processing We discussed the issues involved in processing and optimizing a query in a centralized DBMS in Chapter 15. In a distributed system, several additional factors further complicate query processing. The first is the cost of transferring data over the network. This data includes intermediate files that are transferred to other sites for further processing, as well as the final result files that may have to be transferred to the site where the query result is needed. Although these costs may not be very high if the sites are connected via a high? performance local area network, they become quite significant in other types of networks. Hence, DDBMS query optimization algorithms consider the goal of reducing the amount of data transfer as an optimization criterion in choosing a distributed query execution strategy. We illustrate this with two simple example queries. Suppose that the EMPLOYEE and DEPARTMENT relations of Figure 5.5 are distributed as shown in Figure 25.6. We will assume in this example that neither relation is fragmented. According to Figure 25.6, the size of the EMPLOYEE relation is 100 * 10,000 = 106 bytes, and the size of the DEPARTMENT relation is 35 * 100 = 3500 bytes. Consider the query Q: "For each employee, retrieve the employee 820 I Chapter 25 Distributed Databases and Client-Server Architectures SITE1: EMPLOYEE 10,000records eachrecord is 100byteslong SSNfieldis 9 byteslong DNOfieldis 4 byteslong FNAMEfieldis 15byteslong LNAMEfieldis 15 byteslong SITE2: DEPARTMENTI I DNAME DNUMBER , 100records eachrecord is 35 byteslong DNUMBER fieldis 4 byteslong DNAME fieldis 10 byteslong MGRSSN fieldis 9 byteslong MGRSSN I MGRSTARTDATE FIGURE 25.6 Example to illustrate volume of data transferred. name and the name of the department for which the employee works." This can be stated as follows in the relational algebra: Q: '1TFNAME, LNAME,DNAME (EMPLOYEE ~ DNO~DNUMBER DEPARTMENT) The result of this query will include 10,000 records, assuming that every employee is related to a department. Suppose that each record in the query result is 40 bytes long. The query is submitted at a distinct site 3, which is called the result site because the query result is needed there. Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3. There are three simple strategies for executing this distributed query: 1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and per? form the join at site 3. In this case a total of 1,000,000 + 3500 = 1,003,500 bytes must be transferred. 2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. The size of the query result is 40 * 10,000 = 400,000 bytes, so 400,000 + 1,000,000 = 1,400,000 bytes must be transferred. 3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3. In this case 400,000 + 3500 = 403,500 bytes must be transferred. If minimizing the amount of data transfer is our optimization criterion, we should choose strategy 3. Now consider another query Q': "For each department, retrieve the department name and the name of the department manager." This can be stated as follows in the relational algebra: Q': '1TFNAME, LNAME, DNAME (DEPARTMENT ~MGRSSN~SSN EMPLOYEE) 25.4 Query Processing in Distributed Databases I 821 Again, suppose that the query is submitted at site 3. The same three 'strategies for executing query Q apply to Q', except that the result of Q' includes only 100 records, assuming that each department has a manager: 1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and per? form the join at site 3. In this case a total of 1,000,000 + 3500 = 1,003,500 bytes must be transferred. 2. Transfer the EMPLOYEE relation to site 2, execute the join at site 2, and send the result to site 3. The size of the query result is 40 * 100 = 4000 bytes, so 4000 + 1,000,000 = 1,004,000 bytes must be transferred. 3. Transfer the DEPARTMENT relation to site 1, execute the join at site 1, and send the result to site 3. In this case 4000 + 3500 = 7500 bytes must be transferred. Again, we would choose strategy 3-in this case by an overwhelming margin over strategies 1 and 2. The preceding three strategies are the most obvious ones for the case where the result site (site 3) is different from all the sites that contain files involved in the query (sites 1 and 2). However, suppose that the result site is site 2; then we have two simple strategies: 1. Transfer the EMPLOYEE relation to site 2, execute the query, and present the result to the user at site 2. Here, the same number of bytes-1 ,OOO,OOO-must be trans? ferred for both Q and Q'. 2. Transfer the DEPARTMENT relation to site 1, execute the query at site 1, and send the result back to site 2. In this case 400,000 + 3500 = 403,500 bytes must be trans? ferred for Q and 4000 + 3500 = 7500 bytes for Q'. A more complex strategy, which sometimes works better than these simple strategies, uses an operation called semijoin. We introduce this operation and discuss distributed execution using semijoins next. 25.4.2 Distributed Query Processing Using Semijoin The idea behind distributed query processing using the semijoin operation is to reduce the number of tuples in a relation before transferring it to another site. Intuitively, the idea is to send the joining column of one relation R to the site where the other relation S is located; this column is then joined with S. Following that, the join attributes, along with the attributes required in the result, are projected out and shipped back to the original site and joined with R. Hence, only the joining column of R is transferred in one direction, and a subset of S with no extraneous tuples or attributes is transferred in the other direc? tion. If only a small fraction of the tuples in S participate in the join, this can be quite an efficient solution to minimizing data transfer. To illustrate this, consider the following strategy for executing Q or Q': 1. Project the join attributes of DEPARTMENT at site 2, and transfer them to site 1. For Q, we transfer F = 'ITDNuMBER(DEPARTMENT), whose size is 4 * 100 = 400 bytes, whereas, for Q', we transfer F' = 'ITMGRSSN(DEPARTMENT), whose size is 9 * 100 = 900 bytes. 822 I Chapter 25 Distributed Databases and Client-Server Architectures 2. Join the transferred file with the EMPLOYEE relation at site 1, and transfer the required attributes from the resulting file to site 2. For Q, we transfer R = 1TDNOFNAME LNAME(F ~DNUMBER~DNOEMPLOYEE)' whose size is 34 * 10,000 = 340,000 bytes, whereas, for Q', w~ trans? fer R' = 'lTMGRSSN, FNAME. LNAME(F' !>4"GRSSN=SSN EMPLOYEE), whose size is 39 * 100 = 3900 bytes, 3. Execute the query by joining the transferred file R·or R' with DEPARTMENT, and present the result to the user at site 2. Using this strategy, we transfer 340,400 bytes for Q and 4800 bytes for Q'. We limited the EMPLOYEE attributes and tuples transmitted to site 2 in step 2 to only those that will actually be joined with a DEPARTMENT tuple in step 3. For query Q, this turned out to include all EMPLOYEE tuples, so little improvement was achieved. However, for Q' only 100 out of the 10,000 EMPLOYEE tuples were needed. The semijoin operation was devised to formalize this strategy. A semijoin operation R I><~~BS, where A and B are domain-compatible attributes of Rand S, respectively, produces the same result as the relational algebra expression 'lTR(R~A~BS), In a distributed environment where Rand S reside at different sites, the semijoin is typically implemented by first transferring F = 'lTR(S) to the site where R resides and then joining F with R, thus leading to the strategy discussed here. Notice that the semijoin operation is not commutative; that is, 25.4.3 Query and Update Decomposition In a DDBMS with no distribution transparency, the user phrases a query directly in terms of specific fragments. For example, consider another query Q: "Retrieve the names and hours per week for each employee who works on some project controlled by department 5," which is specified on the distributed database where the relations at sites 2 and 3 are shown in Figure 25.3, and those at site 1 are shown in Figure 5.6, as in our earlier exam? ple. A user who submits such a query must specify whether it references the PROJSS and WORKS_ONS relations at site 2 (Figure 25.3) or the PROJECT and WORKS_ON relations at site 1 (Figure 5.6). The user must also maintain consistency of replicated data items when updating a DDBMS with no replication transparency. On the other hand, a DDBMS that supports full distribution, fragmentation, and replication transparency allows the user to specify a query or update request on the schema of Figure 5.5 just as though the DBMS were centralized. For updates, the DDBMS is responsible for maintaining consistency among replicated items by using one of the distributed concurrency control algorithms to be discussed in Section 25.5. For queries, a query decomposition module must break up or decompose a query into subqueries that can be executed at the individual sites. In addition, a strategy for combining the results of the subqueries to form the query result must be generated. Whenever the DDBMS determines that an item referenced in the query is replicated, it must choose or materialize a particular replica during query execution. To determine which replicas include the data items referenced in a query, the DDBMS refers to the fragmentation, replication, and distribution information stored in the DDBMS 25.4 Query Processing in Distributed Databases I 823 catalog. For vertical fragmentation, the attribute list for each fragment is kept in the catalog. For horizontal fragmentation, a condition, sometimes called a guard, is kept for each fragment. This is basically a selection condition that specifies which tuples exist in the fragment; it is called a guard because only tuples that satisfy this condition are-permitted to be stored in the fragment. For mixed fragments, both the attribute list and the guard condition are kept in the catalog. In our earlier example, the guard conditions for fragments at site 1 (Figure 5.6) are TRUE (all tuples), and the attribute lists are * (all attributes). For the fragments shown in Figure 25.3, we have the guard conditions and attribute lists shown in Figure 25.7. When the DDBMS decomposes an update request, it can determine which fragments must be updated by examining their guard conditions. For example, a user request to insert a new EMPLOYEE tuple <'Alex', '345671239', 'Coleman', '3306 '22-APR-64', 'B', Sandstone, Houston, TX', M, 33000, '987654321', 4> would be decomposed by the DDBMS into two insert requests: the first inserts the preceding tuple in the EMPLOYEE fragment (a) EMPD5 attribute list: FNAME,MINIT,LNAME,SSN,SALARY,SUPERSSN, DNO guard condition: DNO=5 DEP5 attribute list: * (all attributes DNAME,DNUMBER,MGRSSN,MGRSTARTDATE) guard condition: DNUMBER=5 DEP5_LOCS attribute list: * (all attributes DNUMBER,LOCATION) guard condition: DNUMBER=5 PROJS5 attribute list: * (all attributes PNAME,PNUMBER,PLOCATION,DNUM) guard condition: DNUM=5 WORKS_ON5 attribute list: * (all attributes ESSN,PNO,HOURS) guard condition: ESSN IN (ltSSN (EMPD5)) OR PNO IN (ltPNUMBER (PROJS5)) EMPD4 (b) attribute list: FNAME,MINIT,LNAME,SSN,SALARY,SUPERSSN, DNO guard condition: DNO=4 DEP4 attribute list: * (all attributes DNAME,DNUMBER,MGRSSN,MGRSTARTDATE) guard condition: DNUMBER=4 DEP4_LOCS attribute list: * (all attributes DNUMBER,LOCATION) guard condition: DNUMBER=4 PROJS4 attribute list: * (all attributes PNAME,PNUMBER,PLOCATION,DNUM) guard condition: DNUM=4 WORKS_ON4 attribute list: * (all attributes ESSN,PNO,HOURS) guard condition: ESSN IN (ltSSN (EMPD4)) OR PNO IN (ltPNUMBER (PROJS4)) FIGURE 25.7 Guard conditions and attributes lists for fragments. (a) Site 2 frag? ments. (b) Site 3 fragments. 824 I Chapter 25 Distributed Databases and Client-Server Architectures at site 1, and the second inserts the projected tuple -c' Alex' , ' B' , '345671239', 33000, '987654321', 4> in the EMPa4 fragment at site 3. For query decomposition, the DDBMS can determine which fragments may contain the required tuples by comparing the query condition with the guard conditions. For example, consider the query Q: "Retrieve the names and 'hours per week for each employee who works on some project controlled by department 5"; this can be specified in SQL on the schema of Figure 5.5 as follows: Q: SELECT FNAME, LNAME, HOURS EMPLOYEE, PROJECT, WORKS_ON FROM WHERE DNUM:5 AND PNUMBER:PNO AND ESSN:SSN; 'Co1eman' , Suppose that the query is submitted at site 2, which is where the query result will be needed. The DDBMS can determine from the guard condition on PROJs5 and WORKS_ON 5 that all tuples satisfying the conditions (aNuM = 5 AND PNUMBER = PNO) reside at site 2. Hence, it may decompose the query into the following relational algebra subqueries: Tl <- 1TE5SN (PROJS5~PNUMBER=PNOWORKS_ON5) T2 <- 1TESSN, 'NAME, LNAME (Tl~ESSN=sSNEMPLOYEE) RESULT <- 1T'NAME, LNAME, HOURS(T2 * WORKS_ON5) This decomposition can be used to execute the query by using a semijoin strategy. The DDBMS knows from the guard conditions that PROJs5 contains exactly those tuples satisfying (aNuM = 5) and that WORKS_ON 5 contains all tuples to be joined with PROJs5; hence, subquery T1 can be executed at site 2, and the projected column ESSN can be sent to site 1. Subquery T2 can then be executed at site 1, and the result can be sent back to site 2, where the final query result is calculated and displayed to the user. An alternative strategy would be to send the query Q itself to site 1, which includes all the database tuples, where it would be executed locally and from which the result would be sent back to site 2. The query optimizer would estimate the costs of both strategies and would choose the one with the lower cost estimate. 25.5 OVERVIEW OF CONCURRENCY CONTROL AND RECOVERY IN DISTRIBUTED DATABASES For concurrency control and recovery purposes, numerous problems arise in a distributed DBMS environment that are not encountered in a centralized DBMS environment. These include the following: • Dealing with multiple copies of the data items: The concurrency control method is responsible for maintaining consistency among these copies. The recovery method is responsible for making a copy consistent with other copies if the site on which the copy is stored fails and recovers later. • Failure of individual sites: The DDBMS should continue to operate with its running sites, if possible, when one or more individual sites fail. When a site recovers, its local database must be brought up to date with the rest of the sites before it rejoins the system. 25.5 Overview of Concurrency Control and Recovery in Distributed Databases I 825 • Failure of communication links: The system must be able to deal with failure of one or more of the communication links that connect the sites. An extreme case of this problem is that network partitioning may occur. This breaks up the sites into two or more partitions, where the sites within each partition can communicate only with one another and not with sites in other partitions. • Distributed commit: Problems can arise with committing a transaction that is access? ing databases stored on multiple sites if some sites fail during the commit process. The two-phase commit protocol (see Chapter 19) is often used to deal with this problem. • Distributed deadlock: Deadlock may occur among several sites, so techniques for deal? ing with deadlocks must be extended to take this into account. Distributed concurrency control and recovery techniques must deal with these and other problems. In the following subsections, we review some of the techniques that have been suggested to deal with recovery and concurrency control in DDBMSs. 25.5.1 Distributed Concurrency Control Based on a Distinguished Copy of a Data Item To deal with replicated data items in a distributed database, a number of concurrency control methods have been proposed that extend the concurrency control techniques for centralized databases. We discuss these techniques in the context of extending centralized locking. Similar extensions apply to other concurrency control techniques. The idea is to designate a particular copy of each data item as a distinguished copy. The locks for this data item are associated with the distinguished copy, and all locking and unlocking requests are sent to the site that contains that copy. A number of different methods are based on this idea, but they differ in their method of choosing the distinguished copies. In the primary site technique, all distinguished copies are kept at the same site. A modification of this approach is the primary site with a backup site. Another approach is the primary copy method, where the distinguished copies of the various data items can be stored in different sites. A site that includes a distinguished copy of a data item basically acts as the coordinator site for concurrency control on that item. We discuss these techniques next. Primary Site Technique. In this method a single primary site is designated to be the coordinator site for all database items. Hence, all locks are kept at that site, and all requests for locking or unlocking are sent there. This method is thus an extension of the centralized locking approach. For example, if all transactions follow the two-phase locking protocol, serializability is guaranteed. The advantage of this approach is that it is a simple extension of the centralized approach and hence is not overly complex. However, it has certain inherent disadvantages. One is that all locking requests are sent to a single site, possibly overloading that site and causing a system bottleneck. A second disadvantage is that failure of the primary site paralyzes the system, since all locking information is kept at that site. This can limit system reliability and availability. 826 I Chapter 25 Distributed Databases and Client-Server Architectures Although all locks are accessed at the primary site, the items themselves can be accessed at any site at which they reside. For example, once a transaction obtains a READ_LOCK on a data item from the primary site, it can access any copy of that data item. However, once a transaction obtains a WRITE_LOCK and updates a data item, the DDBMS is responsible for updating allcopies of the data item before releasing the lock. Primary Site with Backup Site. This approach addresses the second disadvantage of the primary site method by designating a second site to be a backup site. All locking information is maintained at both the primary and the backup sites. In case of primary site failure, the backup site takes over as primary site, and a new backup site is chosen. This simplifies the process of recovery from failure of the primary site, since the backup site takes over and processing can resume after a new backup site is chosen and the lock status information is copied to that site. It slows down the process of acquiring locks, however, because all lock requests and granting of locks must be recorded at both the primary and the backup sites before a response is sent to the requesting transaction. The problem of the primary and backup sites becoming overloaded with requests and slowing down the system remains undiminished. Primary Copy Technique. This method attempts to distribute the load of lock coordination among various sites by having the distinguished copies of different data items stored at different sites. Failure of one site affects any transactions that are accessing locks on items whose primary copies reside at that site, but other transactions are not affected. This method can also use backup sites to enhance reliability and availability. Choosing a New Coordinator Site in Case of Failure. Whenever a coordinator site fails in any of the preceding techniques, the sites that are still running must choose a new coordinator. In the case of the primary site approach with no backup site, all executing transactions must be aborted and restarted in a tedious recovery process. Part of the recovery process involves choosing a new primary site and creating a lock manager process and a record of all lock information at that site. For methods that use backup sites, transaction processing is suspended while the backup site is designated as the new primary site and a new backup site is chosen and is sent copies of all the locking information from the new primary site. If a backup site X is about to become the new primary site, X can choose the new backup site from among the system's running sites. However, if no backup site existed, or if both the primary and the backup sites are down, a process called election can be used to choose the new coordinator site. In this process, any site Y that attempts to communicate with the coordinator site repeatedly and fails to do so can assume that the coordinator is down and can start the election process by sending a message to all running sites proposing that Y become the new coordinator. As soon as Y receives a majority of yes votes, Y can declare that it is the new coordinator. The election algorithm itself is quite complex, but this is the main idea behind the election method. The algorithm also resolves any attempt by two or more sites to become coordinator at the same time. The references in the Selected Bibliography at the end of this chapter discuss the process in detail. 25.6 An Overview of 3-Tier Client-Server Architecture I 827 25.5.2 Distributed Concurrency Control Based on Voting The concurrency control methods for replicated items discussed earlier all use the idea of a distinguished copy that maintains the locks for that item. In the voting method, there is no distinguished copy; rather, a lock request is sent to all sites that includes a copy of the data item. Each copy maintains its own lock and can grant or deny the request for it. If a transaction that requests a lock is granted that lock by a majority of the copies, it holds the lock and informs all copies that it has been granted the lock. If a transaction does not receive a majority of votes granting it a lock within a certain time-out period, it cancels its request and informs all sites of the cancellation. The voting method is considered a truly distributed concurrency control method, since the responsibility for a decision resides with all the sites involved. Simulation studies have shown that voting has higher message traffic among sites than do the distinguished copy methods. If the algorithm takes into account possible site failures during the voting process, it becomes extremely complex. 25.5.3 Distributed Recovery The recovery process in distributed databases is quite involved. We give only a very brief idea of some of the issues here. In some cases it is quite difficult even to determine whether a site is down without exchanging numerous messages with other sites. For example, suppose that site X sends a message to site Y and expects a response from Y but does not receive it. There are several possible explanations: • The message was not delivered to Y because of communication failure. • Site Y is down and could not respond. • Site Y is running and sent a response, but the response was not delivered. Without additional information or the sending of additional messages, it is difficult to determine what actually happened. Another problem with distributed recovery is distributed commit. When a transaction is updating data at several sites, it cannot commit until it is sure that the effect of the transaction on every site cannot be lost. This means that every site must first have recorded the local effects of the transactions permanently in the local site log on disk. The two-phase commit protocol, discussed in Section 19.6, is often used to ensure the correctness of distributed commit. 25.6 AN OVERVIEW OF 3- TIER CLIENT-SERVER ARCHITECTURE As we pointed out in the chapter introduction, full-scale DDBMSs have not been devel? oped to support all the types of functionalities that we discussed so far. Instead, distributed database applications are being developed in the context of the client-server architec- 828 I Chapter 25 Distributed Databases and Client-Server Architectures tures. We already introduced the two-tier client-server architecture in Section 2.5. It is now more common to use a three-tier architecture, particular in Web applications. This architecture is illustrated in Figure 25.8. In the three-tier client-server architecture, the following three layers exist: 1. Presentation layer (client): This provides the user interface and interacts with the user. The programs at this layer present Web interfaces or forms to the client in order to interface with the application. Web browsers are often utilized, and the languages used include HTML, JAVA, JavaScript, PERL, Visual Basic, and so on. This layer handles user input, output, and navigation by accepting user com? mands and displaying the needed information, usually in the form of static or dynamic Web pages. The latter are employed when the interaction involves data? base access. When a Web interface is used, this layer typically communicates with the application layer via the HTTP protocol. 2. Application layer (business logic): This layer programs the application logic. For example, queries can be formulated based on user input from the client, or query results can be formatted and sent to the client for presentation. Additional appli? cation functionality can be handled at this layer, such as security checks, identity verification, and other functions. The application layer can interact with one or more databases or data sources as needed by connecting to the database using ODBC, )DBC, SQL/CLI or other database access techniques. 3. Database server: This layer handles query and update requests from the applica? tion layer, processes the requests, and send the results. Usually SQL is used to access the database if it is relational or object-relational and stored database pro- supervisor~ undec4OK_supervisor subordinate main-producbcemp 1 CT workson employee salary department project female FIGURE 25.8 The three-tier client-server architecture. supervise male 25.6 An Overview of 3-Tier Client-Server Architecture I 829 cedures may also be invoked. Query results (and queries) may be formatted into XML (see Chapter 26) when transmitted between the application server and the database server. Exactly how to divide the DBMS functionality between client, application server, and database server may vary. The common approach is to include the functionality of a centralized DBMS at the database server level. A number of relational DBMS products have taken this approach, where an SQL server is provided. The application server must then formulate the appropriate SQL queries and connect to the database server when needed. The client provides the processing for user interface interactions. Since SQL is a relational standard, various SQL servers, possibly provided by different vendors, can accept SQL commands through standards such as ODBC, JDBC, SQL!CLI (see Chapter 9). In this architecture, the application server may also refer to a data dictionary that includes information on the distribution of data among the various SQL servers, as well as modules for decomposing a global query into a number of local queries that can be executed at the various sites. Interaction between application server and database server might proceed as follows during the processing of an SQL query: 1. The application server formulates a user query based on input from the client layer and decomposes it into a number of independent site queries. Each site query is sent to the appropriate database server site. 2. Each database server processes the local query and sends the results to the applica? tion server site. Increasingly, XML is being touted as the standard for data exchange (see Chapter 26) so the database server may format the query result into XML before sending it to the application server. 3. The application server combines the results of the subqueries to produce the result of the originally required query, formats it into HTML or some other form accepted by the client, and sends it to the client site for display. The application server is responsible for generating a distributed execution plan for a multisite query or transaction and for supervising distributed execution by sending commands to servers. These commands include local queries and transactions to be executed, as well as commands to transmit data to other clients or servers. Another function controlled by the application server (or coordinator) is that of ensuring consistency of replicated copies of a data item by employing distributed (or global) concurrency control techniques. The application server must also ensure the atomicity of global transactions by performing global recovery when certain sites fail. We discussed distributed recovery and concurrency control in Section 25.5. If the DDBMS has the capabilty to hide the details of data distribution from the application server, then it enables the application server to execute global queries and transactions as though the database were centralized, without having to specify the sites at which the data referenced in the query or transaction resides. This property is called distribution transparency. Some DDBMSs do not provide distribution transparency, instead requiring that applications be aware of the details of data distribution. 830 I Chapter 25 Distributed Databases and Client-Server Architectures 25.7 DISTRIBUTED DATABASES IN ORACLE In the client-server architecture, the Oracle database system is divided into two parts: (l) a front-end as the client portion, and (2) a back-end as the server portion. The cli? ent portion is the front-end database application that interacts' with the user. The cli? ent has no data access responsibility and merely handles the requesting, processing, and presentation of data managed by the server. The server portion runs Oracle and handles the functions related to concurrent shared access. It accepts SQL and PL/SQL statements originating from client applications, processes them, and sends the results back to the client. Oracle client-server applications provide location transparency by making loca? tion of data transparent to users; several features like views, synonyms, and procedures contribute to this. Global naming is achieved by using to refer to tables uniquely. Oracle uses a two-phase commit protocol to deal with concurrent distributed transactions. The COMMIT statement triggers the two-phase commit mechanism. The RECO (recoverer) background process automatically resolves the outcome of those distributed transactions in which the commit was interrupted. The RECO of each local Oracle Server automatically commits or rolls back any "in-doubt" distributed transactions consistently on all involved nodes. For long-term failures, Oracle allows each local DBA to manually commit or roll back any in-doubt transactions and free up resources. Global consistency can be maintained by restoring the database at each site to a predetermined fixed point in the past. Oracle's distributed database architecture is shown in Figure 25.9. A node in a distributed database system can act as a client, as a server, or both, depending on the situation. The figure shows two sites where databases called HQ (headquarters) and Sales are kept. For example, in the application shown running at the headquarters, for an SQL statement issued against local data (for example, DELETE FRDM DEPT ••• ), the HQ computer acts as a server, whereas for a statement against remote data (for example, INSERT INTO EMP@SALES), the HQ computer acts as a client. All Oracle databases in a distributed database system (DDBS) use Oracle's networking software NetS for interdatabase communication. NetS allows databases to communicate across networks to support remote and distributed transactions. It packages SQL statements into one of the many communication protocols to facilitate client to server communication and then packages the results back similarly to the client. Each database has a unique global name provided by a hierarchical arrangement of network domain names that is prefixed to the database name to make it unique. Oracle supports database links that define a one-way communication path from one Oracle database to another. For example, CREATE DATABASE LINK sales.us.americas; establishes a connection to the sales database in Figure 25.9 under the network domain us that comes under domain ameri cas. Data in an Oracle DDBS can be replicated using snapshots or replicated master tables. Replication is provided at the following levels: • Basic replication: Replicas of tables are managed for read-only access. For updates, data must be accessed at a single primary site. 25.7 Distributed Databases in Oracle I 831 Database server Database server (C:::>C:::>C:::> Net8 Net8 =I---::l-~-----:.,--f----' (c:::>c:::>c:::> = CONNECT TO .. ---- IDENTIFY BY . DEPT--t-Table .r HQ Database ~--- Application TRANSACTION INSERT INTO EMP@SALES .. ; DELETE FROM DEPT .. ; SELECT... FROM EMP@SALES ... ; COMMIT; ---- --t-EMPtable , , , Sales database TRANSACTION INSERT INTO EMP@SALES .. ; DELETE FROM DEPT.. ; SELECT... FROM EMP@SALES ... ; COMMIT; FIGURE 25.9 Oracle distributed database systems. Source: From Oracle (1997a). Copyright © Oracle Corporation 1997. All rights reserved. • Advanced (symmetric) replication: This extends beyond basic replication by allowing applications to update table replicas throughout a replicated DDBS. Data can be read and updated at any site. This requires additional software called Oracle's advanced replication option. A snapshot generates a copy of a part of the table by means of a query called the snapshot definingquery. A simple snapshot definition looks like this: CREATE SNAPSHOT sales.orders AS SELECT * FROM sa1es.orders@hq.us.americas; 832 I Chapter 25 Distributed Databases and Client-Server Architectures Oracle groups snapshots into refresh groups. By specifying a refresh interval, the snapshot is automatically refreshed periodically at that interval by up to ten Snapshot Refresh Processes (SNPs). If the defining query of a snapshot contains a distinct or aggregate function, a GROUP BY or CONNECT BY clause, or join or set operations, the snapshot is termed a complex snapshot and requires additional processing. Oracle (up to version 7.3) also supports ROWID snapshots that are based on physical row identifiers of rows in the master table. Heterogeneous Databases in Oracle. In a heterogeneous DDBS, at least one database is a non-Oracle system. Oracle Open Gateways provides access to a non-Oracle database from an Oracle server, which uses a database link to access data or to execute remote procedures in the non-Oracle system. The Open Gateways feature includes the following: • Distributed transactions: Under the two-phase commit mechanism, transactions may span Oracle and non-Oracle systems. • Transparent SQL access: SQL statements issued by an application are transparently transformed into SQL statements understood by the non-Oracle system. • Pass-through SQL and stored procedures: An application can directly access a non? Oracle system using that system's version of SQL. Stored procedures in a non-Oracle SQL-based system are treated as if they were PL!SQL remote procedures. • Global query optimization: Cardinality information, indexes, etc., at the non-Oracle system are accounted for by the Oracle Server query optimizer to perform global query optimization. • Procedural access: Procedural systems like messaging or queuing systems are accessed by the Oracle server using PL!SQL remote procedure calls. In addition to the above, data dictionary references are translated to make the non? Oracle data dictionary appear as a part of the Oracle Server's dictionary. Character set translations are done between national language character sets to connect multilingual databases. 25.8 SUMMARY In this chapter we provided an introduction to distributed databases. This is a very broad topic, and we discussed only some of the basic techniques used with distributed databases. We first discussed the reasons for distribution and the potential advantages of distributed databases over centralized systems. We also defined the concept of distribution transparency and the related concepts of fragmentation transparency and replication transparency. We discussed the design issues related to data fragmentation, replication, and distribution, and we distin? guished between horizontal and vertical fragments of relations. We discussed the use of data replication to improve system reliability and availability. We categorized DDBMSs by usingcri? teria such as degree of homogeneity of software modules and degree of local autonomy. We dis- Review Questions I 833 cussed the issues of federated database management in some detail focusing on the needs of supporting various types of autonomies and dealing with semantic heterogeneity. We illustrated some of the techniques used in distributed query processing, and discussed the cost of communication among sites, which is considered a major factor in distributed query optimization. We compared different techniques for executing joins and presented the semijoin technique for joining relations that reside on different sites. We briefly discussed the concurrency control and recovery techniques used in DDBMSs. We reviewed some of the additional problems that must be dealt with in a distributed environment that do not appear in a centralized environment. We then discussed the client-server architecture concepts and related them to distributed databases, and we described some of the facilities in Oracle to support distributed databases. Review Questions 25.1. What are the main reasons for and potential advantages of distributed databases? 25.2. What additional functions does a DDBMS have over a centralized DBMS? 25.3. What are the main software modules of a DDBMS? Discuss the main functions of each of these modules in the context of the client-server architecture. 25.4. What is a fragment of a relation? What are the main types of fragments? Why is fragmentation a useful concept in distributed database design? 25.5. Why is data replication useful in DDBMSs? What typical units of data are replicated? 25.6. What is meant by data allocation in distributed database design? What typical units of data are distributed over sites? 25.7. How is a horizontal partitioning of a relation specified? How can a relation be put back together from a complete horizontal partitioning? 25.8. How is a vertical partitioning of a relation specified? How can a relation be put back together from a complete vertical partitioning? 25.9. Discuss what is meant by the following terms: degree of homogeneity of a DDBMS, degree of local autonomy of a DDBMS, federated DBMS, distribution transparency, frag? mentation transparency, replication transparency, multidatabase system. 25.10. Discuss the naming problem in distributed databases. 25.11. Discuss the different techniques for executing an equijoin of two files located at different sites. What main factors affect the cost of data transfer? 25.12. Discuss the semijoin method for executing an equijoin of two files located at dif? ferent sites. Under what conditions is an equijoin strategy efficient? 25.13. Discuss the factors that affect query decomposition. How are guard conditions and attribute lists of fragments used during the query decomposition process? 25.14. How is the decomposition of an update request different from the decomposition of a query? How are guard conditions and attribute lists of fragments used during the decomposition of an update request? 25.15. Discuss the factors that do not appear in centralized systems that affect concur? rency control and recovery in distributed systems. 834 I Chapter 25 Distributed Databases and Client-Server Architectures 25.16. Compare the primary site method with the primary copy method for distributed concurrency control. How does the use of backup sites affect each? 25.17. When are voting and elections used in distributed databases? 25.18. What are the software components in a client-server DDBMS? Compare the two? tier and three-tier client-server architectures. Exercises 25.19. Consider the data distribution of the COMPANY database, where the fragments at sites 2 and 3 are as shown in Figure 25.3 and the fragments at site 1 are as shown in Figure 5.6. For each of the following queries, show at least two strategies of decomposing and executing the query. Under what conditions would each of your strategies work well? a. For each employee in department 5, retrieve the employee name and the names of the employee's dependents. b. Print the names of all employees who work in department 5 but who work on some project not controlled by department 5. 25.20. Consider the following relations: BOOKS (Book#, Primary_author, Topic, Total_stock, $price) BOOKSTORE (Store#, City, State, Zip, Inventory_value) STOCK (Store#, Book#, Qty) TOTAL_STOCK is the total number of books in stock, and INVENTORY_VALUE is the total inventory value for the store in dollars. a. Give an example of two simple predicates that would be meaningful for the BOOKSTORE relation for horizontal partitioning. b. How would a derived horizontal partitioning of STOCK be defined based on the partitioning of BOOKSTORE? c. Show predicates by which BOOKS may be horizontally partitioned by topic. d. Show how the STOCK may be further partitioned from the partitions in (b) by adding the predicates in (c). 25.21. Consider a distributed database for a bookstore chain called National Books with 3 sites called EAST, MIDDLE, and WEST. The relation schemas are given in question 24.20. Consider that BOOKS are fragmented by $PRICE amounts into: B1: BOOK!: up to $20. Bz: BOOK2: from $20.01 to $50. B3: BOOK3: from $50.01 to $100. B4: BOOK4: $100.01 and above. Similarly, BOOKSTORES are divided by Zi pcodes into: SI: EAST: Zi pcodes up to 35000. s, MIDDLE: Zipcodes 35001 to 70000. S3: WEST: Zi pcodes 70001 to 99999. Assume that STOCK is a derived fragment based on BOOKSTORE only. Selected Bibliography I 835 a. Consider the query: SELECT Book#, Total_stock FROM Books WHERE $price > 15 and $price < 55; Assume that fragments of BOOKSTORE are non-replicated and assigned based on region. Assume further that BOOKS are allocated as: EAST: 8 1, B4 MIDDLE: B1, 82 WEST: 8 1, B2, B3, B4 Assuming the query was submitted in EAST, what remote subqueries does it generate? (write in SQL). b. If the bookprice of BOOK#= 1234 is updated from $45 to $55 at site MIDDLE, what updates does that generate? Write in English and then in SQl. c. Given an example query issued at WEST that will generate a subquery for MIDDLE. d. Write a query involving selection and projection on the above relations and show two possible query trees that denote different ways of execution. 25.22. Consider that you have been asked to propose a database architecture in a large organization, General Motors, as an example, to consolidate all data including legacy databases (from Hierarchical and Network models, which are explained in Appendices C and D; no specific knowledge of these models is needed) as well as relational databases, which are geographically distributed so that global applica? tions can be supported. Assume that alternative one is to keep all databases as they are, while alternative two is to first convert them to relational and then sup? port the applications over a distributed integrated database. a. Draw two schematic diagr