Friday, April 12, 2019

Tutorial 07 – Data Persistence

1. Discuss the role of data in information systems indicating the need for data persistence 

What does data persistence mean?

Persistent data in the field of data processing denotes information that is infrequently accessed and not likely to be modified. Static data is information, for example a record, that does not change and may be intended to be permanent. It may have previously been categorized as persistent or dynamic.


2. Explain the terms: Data, Database, Database Server, and Database Management System 

Data : Information in raw or unorganized form
Database : A database is a collection of information that is organized so that it can be easily accessed, managed and updated
Database Server : A database is a collection of information that is organized so that it can be easily accessed, managed and updated
Database Management System : system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.

3. Compare Files and Databases, discussing pros and cons of them 

File System
Pros of the File System
  • Performance can be better than when you do it in a database.  To justify this, if you store large files in DB, then it may slow down the performance because a simple query to retrieve the list of files or filename will also load the file data if you used Select * in your query. In a files ystem, accessing a file is quite simple and light weight.
  • Saving the files and downloading them in the file system is much simpler than it is in a database since a simple "Save As" function will help you out. Downloading can be done by addressing a URL with the location of the saved file.
  • Migrating the data is an easy process. You can just copy and paste the folder to your desired destination while ensuring that write permissions are provided to your destination.
  • It's cost effective in most cases to expand your web server rather than pay for certain databases.
  • It's easy to migrate it to cloud storage i.e. Amazon S3, CDNs, etc. in the future
Cons of the File System
  • Loosely packed. There are no ACID (Atomicity, Consistency, Isolation, Durability) operations in relational mapping, which means there is no guarantee. Consider a scenario in which your files are deleted from the location manually or by some hacking dudes. You might not know whether the file exists or not. Painful, right?
  • Low security. Since your files can be saved in a folder where you should have provided write permissions, it is prone to safety issues and invites trouble, like hacking. It's best to avoid saving in the file system if you cannot afford to compromise in terms of security.
Database
Pros of Database
  • ACID consistency,  which includes a rollback of an update that is complicated when files are stored outside the database.
  • Files will be in sync with the database and cannot be orphaned, which gives you the upper hand in tracking transactions
  • Backups automatically include file binaries.
  • It's more secure than saving in a file system.

Cons of Database
  • You may have to convert the files to blob in order to store them in the database.
  • Database backups will be more hefty and heavy
  • Memory is ineffective. Often, RDBMSs are RAM-driven, so all data has to go to RAM first. Yeah, that’s right. Have you ever thought about what happens when an RDBMS has to find and sort data? RDBMS tracks each data page — even the lowest amount of data read and written — and it has to track if it’s in-memory or if it’s on-disk, if it’s indexed or if it's sorted physically etc


4. Discuss different arrangements of data, giving examples for each 

•Un-structured 
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents.

•Semi-structured 
Semi-structured data is data that is neither raw data, nor typed data in a conventional database system. It is structured data, but it is not organized in a rational model, like a table or an object-based graph. A lot of data found on the Web can be described as semi-structured. Data integration especially makes use of semi-structured data.

Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.

•Structured
structured data. Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets

Examples of structured data include numbers, dates, and groups of words and numbers called strings. Most experts agree that this kind of data accounts for about 20 percent of the data that is out there. Structured data is the data you're probably used to dealing with. It's usually stored in a database.

5. Explain different types of databases, providing examples for their use 

Hierarchical Databases
In a hierarchical database management systems (hierarchical DBMSs) model, data is stored in a parent-children relationship nodes. In a hierarchical database, besides actual data, records also contain information about their groups of parent/child relationships

Example : IBM Information Management System (IMS) and the RDM Mobile
Network Databases
Network database management systems (Network DBMSs) use a network structure to create relationship between entities. Network databases are mainly used on a large digital computers. Network databases are hierarchical databases but unlike hierarchical databases where one node can have one parent only, a network node can have relationship with multiple entities. A network database looks more like a cobweb or interconnected network of records

Example : If we have to design a School Database, then Student will be an entity with attributes name, age, address etc. As Address is generally complex, it can be another entity with attributes street name, pincode, city etc, and there will be a relationship between them.
Relational Databases
In relational database management systems (RDBMS), the relationship between data is relational and data is stored in tabular form of columns and rows. Each column if a table represents an attribute and each row in a table represents a record. Each field in a table represents a data value.
Structured Query Language (SQL) is a the language used to query a RDBMS including inserting, updating, deleting, and searching records.

Example : Most well known DBMS applications fall into the RDBMS category. Examples include Oracle Database, MySQL, Microsoft SQL Server, and IBM DB2

Object-Oriented Model
In this Model we have to discuss the functionality of the object oriented Programming. It takes more than storage of programming language objects. Object DBMS's increase the semantics of the C++ and Java.I t provides full-featured database programming capability, while containing native language compatibility. It adds the database functionality to object programming languages. This approach is the analogical of the application and database development into a constant data model and language environment. Applications require less code, use more natural data modeling, and code bases are easier to maintain. Object developers can write complete database applications with a decent amount of additional effort.
Graph Databases
Graph Databases are NoSQL databases and use a graph structure for sematic queries. The data is stored in form of nodes, edges, and properties. In a graph database, a Node represent an entity or instance such as customer, person, or a car. A node is equivalent to a record in a relational database system. An Edge in a graph database represents a relationship that connects nodes. Properties are additional information added to the nodes.


ER Model Databases 
An ER model is typically implemented as a database. In a simple relational database implementation, each row of a table represents one instance of an entity type, and each field in a table represents an attribute type. In a relational database a relationship between entities is implemented by storing the primary key of one entity as a pointer or "foreign key" in the table of another entity.

6. Compare and contrast data warehouse with Big data 

Data warehouse means the relational database, so storing, fetching data will be similar with normal SQL query. And big data is not following proper database structure, we need to use hive or spark SQL to see the data by using hive specific query. 100% data loaded into data warehousing are using for analytics reports.


7. Explain how the application components communicate with files and databases 
Connecting BRM Components
To allow BRM components to communicate with each other, you use entries in configuration or properties files. The basic connection entries in the files identify the host names and port numbers of each component.

These connection entries are set when you install BRM and when you install each client application. You can change them if you change your configuration. Depending on how you install BRM, you might have to change some entries to connect BRM components.

8. Differentiate the SQL statements, Prepared statements, and Callable statements 
  • The Statement is used for executing a static SQL statement. Used to execute normal SQL queries. 
  • The PreparedStatement is used for executing a precompiled SQL statement. Used to execute dynamic or parameterized SQL queries.
  • The CallableStatement is an interface which is used to execute SQL stored procedures, cursors, and Functions. Used to execute the stored procedures.


9. Argue the need for ORM, explaining the development with and without ORM 

Object-Relational Mapping (ORM) is a technique that lets you query and manipulate data from a database using an object-oriented paradigm. When talking about ORM, most people are referring to a library that implements the Object-Relational Mapping technique, hence the phrase "an ORM".

An ORM library is a completely ordinary library written in your language of choice that encapsulates the code needed to manipulate the data, so you don't use SQL anymore; you interact directly with an object in the same language you're using.

For example, here is a completely imaginary case with a pseudo language:

You have a book class, you want to retrieve all the books of which the author is "Linus". Manually, you would do something like that:

book_list = new List();
sql = "SELECT book FROM library WHERE author = 'Linus'";
data = query(sql); // I over simplify ...
while (row = data.next())
{
     book = new Book();
     book.setAuthor(row.get('author');
     book_list.add(book);
}

With an ORM library, it would look like this:

book_list = BookTable.query(author="Linus");

The mechanical part is taken care of automatically via the ORM library.

10. Discuss the POJO, Java Beans, and JPA, indicating their similarities and differences 

POJO (Plain Old Java Object): A Plain Old Java Object or POJO is a term initially introduced to designate a simple lightweight Java object, not implementing any javax.ejb interface, as opposed to heavyweight EJB 2.x (especially Entity Beans, Stateless Session Beans are not that bad IMO). Today, the term is used for any simple object with no extra stuff.


JavaBeans: JavaBeans are reusable software components for Java that can be manipulated visually in a builder tool. Practically, they are classes written in the Java programming language conforming to a particular convention. They are used to encapsulate many objects into a single object (the bean), so that they can be passed around as a single bean object instead of as multiple individual objects. A JavaBean is a Java Object that is serializable, has a nullary constructor, and allows access to properties using getter and setter methods.


Enterprise JavaBeans (EJB) is a managed, server software for modular construction of enterprise software, and one of several Java APIs. EJB is a server-side software component that encapsulates the business logic of an application.


11. Identify the ORM tools available for different development platforms (Java, PHP, and .Net) 

Hibernate
Hibernate is an object-relational mapping (ORM) library for the Java language, providing a framework for mapping an object-oriented domain model to a traditional relational database. Hibernate solves object-relational impedance mismatch problems by replacing direct persistence-related database accesses with high-level object handling functions.

Features of Hibernate:

  • Transparent persistence without byte code processing
  • Object-oriented query language
  • Object / Relational mappings
  • Automatic primary key generation


IBatis / MyBatis
iBATIS is a persistence framework which automates the mapping between SQL databases and objects in Java, .NET, and Ruby on Rails. In Java, the objects are POJOs (Plain Old Java Objects). The mappings are decoupled from the application logic by packaging the SQL statements in XML configuration files. The result is a significant reduction in the amount of code that a developer needs to access a relational database using lower level APIs like JDBC and ODBC.

Features of IBatis:

  • Support for Unit of work / object level transactions
  • In memory object filtering
  • Providing an ODMG compliant API and/or OCL and/or OPath
  • Supports multiservers (clustering) and simultaneous access by other applications without loss of transaction integrity


Toplink
In computing, TopLink is an object-relational mapping (ORM) package for Java developers. It provides a framework for storing Java objects in a relational database or for converting Java objects to XML documents

Features of Toplink:

  • Query framework that supports an object-oriented expression framework, Query by Example (QBE), EJB QL, SQL, and stored procedures
  • Object-level transaction framework
  • Caching to ensure object identity
  • Set of direct and relational mappings

12. Discuss the need for NoSQL indicating the benefits, also explain different types of NoSQL databases 

Organizations are increasingly adopting NoSQL databases in response to the complexity and limitations of traditional, legacy relational databases. NoSQL databases are more scalable, can help you achieve better performance, and offers a more cost-effective way of developing, implementing and sharing software.

Key benefits of NoSQL include:

  • Efficient, scale-out architecture instead of monolithic architecture
  • The ability to handle high volumes of structured, semi-structured, and unstructured data
  • Being better aligned with object-oriented programming
  • Working well with today's software development methodologies that involve agile sprints and frequent code pushes

Types of NoSQL databases-
There are 4 basic types of NoSQL databases:

Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3 (Dynamo)}

The schema-less format of a key value database like Riak is just about what you need for your storage needs. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data. A bucket is a logical group of keys – but they don’t physically group the data. There can be identical keys in different buckets.

Performance is enhanced to a great degree because of the cache mechanisms that accompany the mappings. To read a value you need to know both the key and the bucket because the real key is a hash (Bucket+ Key).

There is no complexity around the Key Value Store database model as it can be implemented in a breeze. Not an ideal method if you are only looking to just update part of a value or query the database.

Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB}

The data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the only difference is that the values stored (referred to as “documents”) provide some structure and encoding of the managed data. XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are some common standard encodings.

Column-based Store- Each storage block contains data from only one column, {Example- HBase, Cassandra}

In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema. Read and write is done using columns rather than rows.

In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast search/ access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on disk while Columnar databases store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster.


Graph-based-A network database that uses edges and nodes to represent and store data. {Example- Neo4J}

In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables and columns representation, a flexible graphical representation is instead used which is perfect to address scalability concerns. Graph structures are used with edges, nodes and properties which provides index-free adjacency. Data can be easily transformed from one model to the other using a Graph Base NoSQL database.

13. Discuss what Hadoop is, explaining the core concepts of it 

What Is Hadoop?
When you learn about Big Data you will sooner or later come across this odd sounding word: Hadoop - but what exactly is it?

Put simply, Hadoop can be thought of as a set of open source programs and procedures (meaning essentially they are free for anyone to use or modify, with a few exceptions) which anyone can use as the "backbone" of their big data operations.

I'll try to keep things simple as I know a lot of people reading this aren't software engineers, so I hope I don't over-simplify anything - think of this as a brief guide for someone who wants to know a bit more about the nuts and bolts that make big data analysis possible.
4 Modules of Hadoop
1. Distributed File-System
The most important two are the Distributed File System, which allows data to be stored in an easily accessible format, across a large number of linked storage devices, and the MapReduce - which provides the basic tools for poking around in the data.
(A "file system" is the method used by a computer to store data, so it can be found and used. Normally this is determined by the computer's operating system, however a Hadoop system uses its own file system which sits "above" the file system of the host computer - meaning it can be accessed using any computer running any supported OS).

2. MapReduce
MapReduce is named after the two basic operations this module carries out - reading data from the database, putting it into a format suitable for analysis (map), and performing mathematical operations i.e counting the number of males aged 30+ in a customer database (reduce).

3. Hadoop Common
The other module is Hadoop Common, which provides the tools (in Java) needed for the user's computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.

4. YARN
The final module is YARN, which manages resources of the systems storing the data and running the analysis.
Various other procedures, libraries or features have come to be considered part of the Hadoop "framework" over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop Common and Hadoop YARN are the principle four.

14. Explain the concept of IR, identifying tools for IR 

Information Retrieval is the process of obtaining relevant information from a collection of informational resources. So, we have to think about what concepts IR systems use to model this data so that they can return all the documents that are relevant to the query term and ranked based on certain importance measures.

Retrieval Tools. Systems created for retrieval of information. Retrieval tools are essential as basic building blocks for a system that will organize recorded information that is collected by libraries, archives, museums

No comments:

Post a Comment

Tutorial 11 – Client-side development 2 - RiWAs

1. Distinguish the term “Rich Internet Applications” (RIAs) from “Rich Web-based Applications” (RiWAs).  Definition What does Rich Inter...