为了正常的体验网站,请在浏览器设置里面开启Javascript功能!
首页 > 6.生物信息学数据库

6.生物信息学数据库

2012-07-02 50页 ppt 18MB 27阅读

用户头像

is_306165

暂无简介

举报
6.生物信息学数据库nullDatabases for Bioinformatics Databases for Bioinformatics 陈艳炯 chenyanjiong@mail.xjtu.edu.cn 医学院免疫与病原生物学系数据库系统基础数据库系统基础数据库的基本概念 数据管理系统的发展 数据库技术的发展 数据库系统的组成 数据库应用系统体系结构数据(Data) 数据(Data) 数据的定义 描述客观事物(对象)的符号记录 数据的种类 文字、图形、图像、声音 数据的特点 数据与其语义是不可分的DataDataThe term ...
6.生物信息学数据库
nullDatabases for Bioinformatics Databases for Bioinformatics 陈艳炯 chenyanjiong@mail.xjtu.edu.cn 医学院免疫与病原生物学系数据库系统基础数据库系统基础数据库的基本概念 数据管理系统的发展 数据库技术的发展 数据库系统的组成 数据库应用系统体系结构数据(Data) 数据(Data) 数据的定义 描述客观事物(对象)的符号 数据的种类 文字、图形、图像、声音 数据的特点 数据与其语义是不可分的DataDataThe term data means groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data (plural of "datum", which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived.null数据概念的变化特点 质的规定:由简单到集成;由私有到共享。 量的刻化:由小量到大量到海量。 所处位置:在软件中的从属地位到主导地位。null信息(Information) 是以数据为载体的对客观世界实际存在的事物、事件和概念的抽象反应。 信息=数据+数据处理 Data processingData processingComputer data processing is any process that uses a computer program to enter data and summarise, analyse or otherwise convert data into usable information. The process may be automated and run on a computer. It involves recording, analysing, sorting, summarising, calculating, disseminating and storing data. Because data are most useful when well-presented and actually informative, data-processing systems are often referred to as information systems. nullData analysis When the domain from which the data are harvested is a science or an engineering, data processing and information systems are considered too broad of terms and the more specialized term data analysis is typically used, focusing on the highly-specialized and highly-accurate algorithmic derivations and statistical calculations that are less often observed in the typical general business environment. Data analysis packages like DAP, gretl or PSPP are often used. Elements of data processingElements of data processingIn order to be processed by a computer, data needs first be converted into a machine readable format. Once data is in digital format, various procedures can be applied on the data to get useful information. Data processing may involve various processes, including: Data acquisition(数据采集) Data entry(数据录入) Data cleaning(数据清理) Data validation(数据验证) Data tabulation(数据制表) Statistical analysis(统计分析) Computer graphics(计算机图形) Data warehousing(数据存储) Data mining(数据挖掘)Data acquisitionData acquisitionIn computer data processing, data acquisition is the sampling of real world physical conditions and conversion of the resulting samples into digital numeric values that can be manipulated by a computer. The components of data acquisition systems include: Sensors that convert physical parameters to electrical signals. Signal conditioning circuitry to coerce sensor signals into a form that can be converted to digital values. Analog-to-digital converters, which convert conditioned sensor signals to digital values. Depending on the application, acquired data may be displayed, analyzed, or recorded, or some combination there of. Data acquisition applications may be controlled by commercial DAQ software or by custom programs developed using various general purpose programming languages such as BASIC or C. Specialized programming languages used for data acquisition include EPICS for building large scale data acquisition systems, LabVIEW, which offers a graphical programming environment, and MATLAB which provides graphical tools and libraries for data acquisition and analysis.nullData cleansing or data scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of the data and then replacing, modifying or deleting this dirty data. After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by different data dictionary definitions of similar entities in different stores, may have been caused by user entry errors, or may have been corrupted in transmission or storage. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data. The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records).nullA data entry clerk is a member of staff who reads hand-written or printed records and types them into a computer. They are sometimes employed on a temporary basis, but most large companies which have large amounts of data will hire on a near-permanent basis.nullIn computer science, data validation is the process of ensuring that a program operates on clean, correct and useful data. It uses routines, often called "validation rules" or "check routines", that check for correctness, meaningfulness, and security of data that are input to the system. The rules may be implemented through the automated facilities of a data dictionary, or by the inclusion of explicit application program validation logic. Incorrect data validation can lead to data corruption or a security vulnerability. Data validation checks that data are valid, sensible, reasonable, and secure before they are processed.nullComputer graphics are graphics created using computers and, more generally, the representation and manipulation of pictorial data by a computer. The development of computer graphics, or simply referred to as CG, has made computers easier to interact with, and better for understanding and interpreting many types of data. Developments in computer graphics have had a profound impact on many types of media and have revolutionized the animation and video game industry.nullData mining is the process of extracting patterns from data. As more data are gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. null数据结构(data structure)是计算机中存储、组织数据的方式。(In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently. ) 数据结构的逻辑表示与物理存储体现为数据的逻辑结构、存储结构、数据的处理方法(算法)与处理结果。 nullThe two main structures of a database are TABLES and INDEXES. Tables are the structures that store your data in the database. Each table is composed of a number of FIELDS, also known as COLUMNS in some database engines.   Indexes do not store data, and you do not use them directly. They are used internally by the database engine to speed up certain search operations.                                                                                      nullField names and types are defined when you create a table. In order to create an index you have to define the table and the field to be indexed, and the indexing order (Ascending or Descending). Indexes can also be UNIQUE, and in this case the indexed field does not allow duplicate data to be inserted in different records or rows (for example you could not have two employees with the same userid value if the userid field is being indexed as UNIQUE.)Data Manipulation(数据操作)Data Manipulation(数据操作)数据操作 Inserting, deleting and updating data 分类、归并、排序、存取、检索和输入、输出、更新(包括插入、删除、修改) A data manipulation language (DML) is a family of syntax elements similar to a computer programming language used for inserting, deleting and updating data in a database. Structured Query Language (SQL), which is used to retrieve and manipulate data in a relational database. IDMS used by IMS/DLI, CODASYL databases. null数据管理 对数据进行分类、组织、编码、存储、检索和维护 数据处理的中心问题 数据管理技术的发展过程 人工管理阶段(20世纪50年代中期以前) 文件系统阶段(20世纪50年代后期--60年代中期) 数据库系统阶段(20世纪60年代后期--现在)数据库管理技术发展的比较数据库管理技术发展的比较null数据库技术是一种计算机辅助管理数据的方法,它研究如何组织和存储数据,如何高效地获取和处理数据,是计算机科学的重要分支。 通过研究数据库的结构、存储、设计、管理以及应用的基本理论和实现方法,并利用这些理论来实现对数据库中的数据进行处理、分析和理解的技术。nullDatabase Technology includes theory and experimental methodology for building computer systems that handles large data volumes. Central is development of concepts, languages, software, and methods for describing, storing, searching, analyzing, distributing, and other data processing to make access of data simple, efficient, scalable, reliable, and adaptable for new application areas.nullDatabase数据库(Database, DB)的定义 数据库(Database, DB)的定义 数据库是“按照数据结构来组织、存储和管理数据的仓库”。 数据库是电脑化的资料保存系统。 数据库本身可视为电子化的档案柜——储存电脑化档案的处所,使用者可以新增档案或删除档案,也可以对档案中的资料执行新增、撷取、更新、删除等操作。 数据库是长期储存在计算机内、有组织的、可共享的大量数据的集合。 数据库的基本特征 数据库的基本特征 数据按一定的数据模型组织、描述和储存 可为各种用户共享 冗余度较小 数据独立性较高 易扩展 Main features of a database Main features of a database 1. Compactness(紧凑 ) where there is no need for the old paper files that has a big size. 2. Speed(快速) Because of the computer can restore the stored Data Base and upgrading it very fast than the normal human manual hand can do. 3. Less drudgery(减少人工) because the computer do every thing for you. 4. Currency(专业) The more specific you can have when you asking for a Data Base information. 5. Simplicity (简单) An easy way to collect, access-connects, and display information. 6. stability (稳定) To prevent unnecessary loss of data. 7. Security(安全 ) To protect against unauthorized access to private data. Architectures Architectures A number of database architectures exist. Many databases use a combination of strategies. Databases are software-based "containers" that is structure to collect and store information so it can be retrieved, added to, updated or removed in an automatic fashion. Database programs are designed for users so that they can add or delete any information needed. The structure of a database is the table, which consists of rows and columns of information.数据库的主要特点数据库的主要特点(1) 实现数据共享 数据共享包含所有用户可同时存取数据库中的数据,也包括用户可以用各种方式通过接口使用数据库,并提供数据共享。 (2) 减少数据的冗余度 减少大量重复数据,减少了数据冗余,维护了数据的一致性。 (3) 数据的独立性 数据的独立性包括数据库的逻辑结构和应用程序相互独立,也包括数据物理结构的变化不影响数据的逻辑结构。数据库的主要特点数据库的主要特点(4) 数据实现集中控制 数据库可对数据进行集中控制和管理,并通过数据模型表示各种数据的组织以及数据间的联系。 (5) 数据一致性和可维护性,以确保数据的安全性和可靠性 ①安全性控制:以防止数据丢失、错误更新和越权使用; ②完整性控制:保证数据的正确性、有效性和相容性; ③并发控制:使在同一时间周期内,允许对数据实现多路存取,又能防止用户之间的不正常交互作用; ④故障的发现和恢复:由数据库管理系统提供一套方法,可及时发现故障和修复故障,从而防止数据被破坏。 数据库在计算机系统中的位置 数据库在计算机系统中的位置 硬件平台基础软件平台软件基础构架平台应用软件平台软件产品协同软件 办公软件数据库系统 操作系统 中间件 应用服务器数据库系统(Database System,DBS)的组成数据库系统(Database System,DBS)的组成硬件系统 数据库(Database) 数据库管理系统(DBMS) 人员null数据库系统组成数据库系统组成数据库 即存储在磁带、磁盘、光盘或其他外存介质上、按一定结构组织在一起的相关数据的集合。 数据库管理系统(DBMS)它是一组能完成描述、管理、维护数据库的程序系统。它按照一种公用的和可控制的方法完成插入新数据、修改和检索原有数据的操作。 人员: 最终用户 数据库设计者 系统分析员和应用程序员 数据库管理员(DBA) 数据库管理系统数据库管理系统数据库管理系统(Database Management System,DBMS) 位于用户与操作系统之间的一层数据管理软件 是基础软件,是一个大型复杂的软件系统 DBMS的用途 科学地组织和存储数据、高效地获取和维护数据nullDBMS能够统一管理和共享数据的数据库管理系统。 数据模型是数据库系统的核心和基础,各种DBMS 软件都是基于某种数据模型的。 通常也按照数据模型的特点将传统数据库系统分成网状数据库、层次数据库和关系数据库三类。 nullA Database Management System (DBMS) is a set of computer programs that controls the creation, maintenance, and the use of a database. DBMS的主要功能 DBMS的主要功能 数据定义功能 提供数据定义语言(DDL) 定义数据库中的数据对象 数据组织、存储和管理 分类组织、存储和管理各种数据 确定组织数据的文件结构和存取方式,实现数据之间的联系 提供多种存取方法提高存取效率DBMS的主要功能DBMS的主要功能数据操纵功能 提供数据操纵语言(DML) 实现对数据库的基本操作 (查询、插入、删除和修改) 数据库的事务管理和运行管理 数据库在建立、运行和维护时由DBMS统一管理和控制 保证数据的安全性、完整性、多用户对数据的并发使用 发生故障后的系统恢复DBMS的主要功能DBMS的主要功能数据库的建立和维护功能(实用程序) 数据库初始数据装载转换 数据库转储 介质故障恢复 数据库的重组织 性能监视分析等 其它功能 DBMS与网络中其它软件系统的通信 两个DBMS系统的数据转换 异构数据库之间的互访和互操作 nullSome of the more popular relational database management systems include: Microsoft Access Filemaker Microsoft SQL Server MySQL Oracle nullMicrosoft SQL Server Microsoft Access SQL语言共分为四大类: 数据查询语言DQL, 数据操纵语言DML, 数据定义语言DDL, 数据控制语言DCL。 null The interdisciplinary nature of bioinformatics will require the use of a variety of discipline-specific databases. Oracle Database Architecture on Windows nullA database is an integrated collection of logically related records or files consolidated into a common pool that provides data for one or more multiple uses. The data in a database is organized according to a database model. relational model hierarchical model network model数据库应用系统体系结构数据库应用系统体系结构主从式结构的数据库系统 分布式数据库系统 客户/服务器(client/server或C/S) 数据库系统 浏览器/服务器数据库系统null主从式结构的数据库系统 指一个主机带多个终端的多用户结构。在这种结构中,数据库系统,包括应用程序、DBMS、数据,都集中存放在主机上,所有处理任务都由主机来完成,各个用户通过主机的终端并发地存取数据库,共享数据资源。  优点:数据易于管理与维护。  缺点:主机的任务会过分繁重,可能成为瓶颈,从而使系统性能大幅度下降; 当主机出现故障时,整个系统都不能使用,因此系统的可靠性不高。null分布式结构的数据库系统 分布式结构的数据库系统是指数据库中的数据在逻辑上是一个整体,但物理地分布在计算机网络的不同结点上。网络中的每个结点都可以独立处理本地数据库中的数据,执行局部应用;同时也可以同时存取和处理多个异地数据库中的数据,执行全局应用。 优点:分布式结构的数据库系统计算机网络发展的必然产物,它适应了地理上分散的公司、团体和组织对于数据库应用的需求。 缺点:数据的分布存放给数据的处理、管理与维护带来困难;当用户需要经常访问远程数据时,系统效率会明显地受到网络交通的制约。 null客户/服务器(client/server或C/S) 结构的数据库系统 服务器:网络中某个(些)结点上的计算机专门用于执行DBMS功能,称为数据库服务器。 客户机:其他结点上的计算机安装DBMS的外围应用开发工具,支持用户的应用,称为客户机。 工作原理:在客户/服务器结构中,客户端的用户请求被传送到数据库服务器,数据库服务器进行处后,只将结果返回给用户(而不是整个数据)。 优点:显著减少了网络上的数据传输量,提高了系统的性能、吞吐量和负载能力;客户/服务器结构的数据库往往更加开放(多种不同的硬件和软件平台、数据库应用开发工具),应用程序具有更强的可移植性,同时也可以减少软件维护开销。 浏览器/服务器结构的数据库系统 浏览器/服务器结构的数据库系统 在Internet和Intranet上的浏览器/服务器(简称B/S)的数据库系统从本质上讲,与传统的C/S都是用同一种请求和应答方式来执行应用的。 但传统的C/S结构模式在客户端集中了大量应用软件,而B/S是一种基于Hyperlink、HTML、Java的三层或多层C/S结构,客户端仅需要单一的浏览器软件,是一种全新的体系结构。数据模型数据模型在数据库中用数据模型这个工具来抽象、表示和处理现实世界中的数据和信息。 数据模型应满足三方面要求 能比较真实地模拟现实世界 容易为人所理解 便于在计算机上实现数据模型数据模型 1. 概念数据模型(Conceptual Data Model):简称概念模型,是面向数据库用户的现实世界的模型,主要用来描述世界的概念化结构,它使数据库的设计人员在设计的初始阶段,摆脱计算机系统及DBMS的具体技术问题,集中精力分析数据以及数据之间的联系等,与具体的数据管理系统(Database Management System,简称DBMS)无关。概念数据模型必须换成逻辑数据模型,才能在DBMS中实现。 null数据模型数据模型2. 逻辑数据模型(Logical Data Model):简称数据模型,这是用户从数据库所看到的模型,是具体的DBMS所支持的数据模型,如网状数据模型(Network Data Model)、层次数据模型(Hierarchical Data Model)等等。此模型既要面向用户,又要面向系统,主要用于数据库管理系统(DBMS)的实现。数据模型数据模型3. 物理数据模型(Physical Data Model):简称物理模型,是面向计算机物理表示的模型,描述了数据在储存介质上的组织结构,它不但与具体的DBMS有关,而且还与操作系统和硬件有关。每一种逻辑数据模型在实现时都有其对应的物理数据模型。DBMS为了保证其独立性与可移植性,大部分物理数据模型的实现工作由系统自动完成,而设计者只设计索引、聚集等特殊结构。 null最常用的数据模型 最常用的数据模型 非关系模型 层次模型(Hierarchical Model) 网状模型(Network Model) 关系模型(Relational Model) 面向对象模型(Object Oriented Model) 对象关系模型(Object Relational Model)database modeldatabase modelA database model or database schema is the structure or format of a database, described in a formal language supported by the database management system, null (1)层次结构模型 层次结构模型实质上是一种有根结点的定向有序树(在数学中“树”被定义为一个无回的连通图)。例如高等学校的组织结构图。这个组织结构图像一棵树,校部就是树根(称为根结点),各系、专业、教师、学生等为枝点(称为结点),树根与枝点之间的联系称为边,树根与边之比为1:N,即树根只有一个,树枝有N个。   按照层次模型建立的数据库系统称为层次模型数据库系统。IMS(Information Management System)是其典型代表。 nullnullIn a hierarchical model, data is organized into a tree-like structure, implying a single upward link in each record to describe the nesting, and a sort field to keep the records in a particular order in each same-level list. null(2)网状结构模型 按照网状数据结构建立的数据库系统称为网状数据库系统,其典型代表是DBTG(Data Base Task Group)。用数学方法可将网状数据结构转化为层次数据结构。 nullThe network model (defined by the CODASYL specification) organizes data using two fundamental constructs, called records and sets. Records contain fields (which may be organized hierarchically, as in the programming language COBOL). Sets (not to be confused with mathematical sets) define one-to-many relationships between records: one owner, many members. null(3)关系结构模型 关系式数据结构把一些复杂的数据结构归结为简单的二元关系(即二维表格形式)。例如某单位的职工关系就是一个二元关系。 关系型数据库系统以关系代数为坚实的理论基础,经过几十年的发展和实际应用,技术越来越成熟和完善。 由关系数据结构组成的数据库系统被称为关系数据库系统。nullThe relational model was introduced by E.F. Codd in 1970 as a way to make database management systems more independent of any particular application. It is a mathematical model defined in terms of predicate logic and set theory. null人们发现关系型数据库系统虽然技术很成熟,但其局限性也是显而易见的:它能很好地处理所谓的“表格型数据”,却对技术界出现的越来越多的复杂类型的数据无能为力。(4)面向对象数据库系统(4)面向对象数据库系统面向对象是一种认识方法学,也是一种新的程序设计方法学。 把面向对象的方法和数据库技术结合起来可以使数据库系统的分析、设计最大程度地与人们对客观世界的认识相一致。 面向对象数据库系统是为了满足新的数据库应用需要而产生的新一代数据库系统。 nullIn recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. nullObjects and object identity: In this model, everything is modeled as objects. An object can be any physical or abstract thing. It can be a person, place, thing, or a concept. An object can be used to model the overall structure not just a part of it. Also the behavior of the thing that is being modeled is also specified in the object. This feature is called encapsulation. Only thing we need to know at this stage is object can store information and behavior in the same entity i.e. an object. Car: Color, Brand, ModelNo, Gears, EngineCylinders, Capacity, No of gates. All this information is sufficient to model any car.nullcomputer systemscomputer systemsnullnull生物信息学数据库简介null生物信息数据库几个明显的特征(1)数据库的更新速度不断加快-数据量呈指数增长趋势 (2)数据库使用频率增长更快 (3)数据库的复杂程度不断增加 (4)数据库网络化 (5)面向应用 (6)先进的软硬件配置nullnullnullnullnullhttp://www.ncbi.nlm.nih.gov/genbank/GenbankOverview.htmlnullhttp://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.htmlnullLOCUS The LOCUS field contains a number of different data elements, including locus name, sequence length, molecule type, GenBank division, and modification date. DEFINITION Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function (if the sequence is non-coding). ACCESSION The unique identifier for a sequence record.  VERSION A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.  GI "GenInfo Identifier" sequence identification number, in this case, for the nucleotide sequence. If a sequence changes in any way, a new GI number will be assigned. Organism The formal scientific name for the source organism (genus and species, where appropriate) and its lineage, based on the phylogenetic classification scheme used in the NCBI Taxonomy Database. AUTHORS List of authors in the order in which they appear in the cited article. source Mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number. Can also include other information such as map location, strain, clone, tissue type, etc., if provided by submitter. Taxon A stable unique identification number for the taxon of the source oganism. A taxonomy ID number is assigned to each taxon (species, genus, family, etc.) in the NCBI Taxonomy Database. CDS Coding sequence; region of nucleotides that corresponds with the sequence of amino acids in a protein (location includes start and stop codons). The CDS feature includes an amino acid translation. protein_id A protein sequence identification number, similar to the Version number of a nucleotide sequence. Protein IDs consist of three letters followed by five digits, a dot, and a version number. gene A region of biological interest identified as a gene and for which a name has been assigned. The base span for the gene feature is dependent on the furthest 5' and 3' features. nullACCESSION Records from the RefSeq database of reference sequences have a different accession number format that begins with two letters followed by an underscore bar and six or more digits, for example: NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins NC_123456 chromosomes nullThe GenBank division to which a record belongs is indicated with a three letter abbreviation. In this example, GenBank division is PLN. The GenBank database is divided into 18 divisions: 1. PRI - primate sequences   2. ROD - rodent sequences   3. MAM - other mammalian sequences   4. VRT - other vertebrate sequences   5. INV - invertebrate sequences
/
本文档为【6.生物信息学数据库】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索