Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
  Governors State University  OPUS Open Portal to University Scholarship  All Capstone ProjectsStudent Capstone ProjectsSpring 2016 Data Migration from RDBMS to Hadoop Naga Sruthi Tiyyagura Governors State University Monika Rallabandi Governors State University Radhakrishna Nalluri Governors State University Follow this and additional works at:hp://opus.govst.edu/capstonesPart of theComputer Sciences Commons For more information about the academic degree, extended learning, and certicate programs of Governors State University, go tohp://www.govst.edu/Academics/Degree_Programs_and_Certications/ Visit theGovernors State Computer Science Departmentis Project Summary is brought to you for free and open access by the Student Capstone Projects at OPUS Open Portal to University Scholarship. Ithas been accepted for inclusion in All Capstone Projects by an authorized administrator of OPUS Open Portal to University Scholarship. For moreinformation, please contactopus@govst.edu. Recommended Citation Tiyyagura, Naga Sruthi; Rallabandi, Monika; and Nalluri, Radhakrishna, Data Migration from RDBMS to Hadoop (2016).  AllCapstone Projects . 184.hp://opus.govst.edu/capstones/184  2 Table of Contents 1    Project Description  ................................................................................................................................................................ 1   1.1   Project Abstract ............................................................................................................................................................. 3   1.2   Competitive Information ............................................................................................................................................... 3   1.3   Relationship to Other Applications/Projects ................................................................................................................. 3   1.4   Assumptions and Dependencies .................................................................................................................................... 4   1.5   Future Enhancements .................................................................................................................................................... 4   1.6   Definitions and Acronyms ............................................................................................................................................ 5    2   Technical Description  ............................................................................................................................................................ 6   2.1   Project/Application Architecture ................................................................................................................................... 6   2.2   Project/Application Information flows .......................................................................................................................... 6   2.3   Interactions with other Projects ..................................................................................................................................... 7   2.4   Interactions with other Applications ............................................................................................................................. 8   2.5   Capabilities.................................................................................................................................................................. 10   2.6   Risk Assessment and Management ............................................................................................................................. 11    3    Project Requirements  ........................................................................................................................................................... 12   3.1   Identification of Requirements .................................................................................................................................... 12   3.2   Operations, Administration, Maintenance and Provisioning (OAM&P) .................................................................... 12    4    Project Design Description  .................................................................................................................................................. 14    5    Project Internal/external Interface Impacts and Specification  .......................................................................................... 18   6     Functional overview  ............................................................................................................................................................ 19   6.1 Impact  ............................................................................................................................................................................. 21   7    Open Issues  .......................................................................................................................................................................... 23   8    References  ............................................................................................................................................................................ 27    3 1    ProjectDescription 1.1    Project Abstract Oracle, IBM, Microsoft and Teradata own a large portion of the information on the planet. By that on the off chance that we run an inquiry in any piece of the world, it is likely that you are perusing the information from a Database possessed  by them. The bigger the volume of information moves from Oracle to DB2 or other is testing assignment for the business. The conception of Hadoop and NoSQL innovation spoke to a seismic movement that shook the RDBMS market and offering a different option for organizations. The Database merchants moved rapidly to Big Data for position and opposite. Indeed, even everybody has own enormous information innovation like prophet NoSQL and mongo DB ,There is a colossal business sector for an elite information movement that can duplicate the information and put away in RDBMS Databases to Hadoop or  NoSQL databases. Current data is available in the RDBMS databases like oracle, SQL Server, MySQL and Teradata. We are  planning to migrate RDBMS data to big data which is support NoSQL database and contains verity of data from the existed system it’s take huge resources and time to migrate pita bytes of data. Time and resource may be constraints for the current migrating process.   1.2   Competitive Information In the summary report “Big Data Best Practices”, we see 45 different examples of big data exploitation by businesses and industries of all sorts: entertainment, manufacturing, retail, financial, food services, travel, sports, fashion, politics, gaming, and much more. This report refers to the now famous 2011 McKinsey Global Institute report (Big Data: The Next Frontier for Innovation, Competition, and Productivity) when making this declaration: “there are no [Big Data] best practices. I’d say there are emerging next practices.” Consequently, it is now time for all businesses to get on board that train and exploit big data for competitive advantage. The white paper “25 Data Stories from GNIP” provides another rich compilation of stories that demonstrate the “unlimited value and near limitless application” of big data, with a focus on business growth and competitive advantage through social data exploitation. ã   Hundreds of built-in data-type conversions, transformers, look-up matching, and aggregations ã   Robust metadata, data lineage, and data modeling capabilities ã   Data quality and profiling subsystems ã   Workflow management, i.e., a GUI for generating ETL scripts and handling errors ã   Fine grained, role-based security 1.3    Relationship to Other Applications/Projects  Normally folks use NoSQL DBs(like HBase, Cassandra) with Hadoop. Using these DBs with hadoop is merely a matter of configuration. You don't need any connecting program in order to achieve this. Apart from the point made by @Doctor Dan, there are few other reasons behind choosing NoSQL DBs in place of SQL DBs. One thing is size. These  NoSQL DBs provided great horizontal scalability which enables you to store PBs of data easily. You could scale traditional systems, but vertically. Another reason for complexityof data. The places, where these DBs are being used, mostly handle highly unstructured data which is not very easy to deal with using traditional systems. For example, sensor data, log data etc. Basically, I did not understand why sqoop exists. Why can’t we directly use an SQL data on Hadoop. Although Hadoop is very good at handling your BigData needs, it is not the solution to all your needs. It is not suitable for real-time needs. Suppose you are an Online Transaction Company with very very huge dataset. You find out that you could  process this data very easily using Hadoop. But the problem is that you can't serve the real-time needs of you customers with Hadoop. This is where SQOOP comes into picture. It is an import/export tool that allows you to move data between a SQL DB and Hadoop. You could move your BigData into your Hadoop cluster, process it there and then push the results back into your SQL DB using SQOOP to serve the real-time needs of your customers.  4 A traditional RDBMS is used to handle relational data. Hadoop works well with structured as well as unstructured data, and supports various serialization and data formats for example Text, Json, Xml, Avro etc. I would say there are problems where SQL databases are a perfect choice. if your data size permits it and your data type is relational, you are fine to use the RDBMS approach. Its worked well in the past, its a mature technology and it has its needs. Where the data size or type is such that you are unable to save it in an RDBMS, go for solutions like Hadoop. One such example is a product catalog. A car has different attributes than a television. It is tough to create a new table per product type. Another example is machine generated data. in this case the data size creates a big pressure on the traditional RDBMS. Thats a classic Hadoop problem. Or document indexing. There are various such examples. 1.4    Assumptions and Dependencies In CDH 3, all of the Hadoop API implementations were confined to a single JAR file (hadoop-core) plus a few of its dependencies. It was relatively straightforward to make sure that classes from these JAR files were available at runtime. CDH 4 and CDH 5 are more complex: they bundle both MRv1 and MRv2 (YARN). To simplify things, CDH 4 and CDH 5  provide a Maven-based way of managing client-side Hadoop API dependencies that saves you from having to figure out the exact names and locations of all the JAR files needed to provide Hadoop APIs. In CDH 5, Cloudera recommends that you use a hadoop-client artifact for all clients, instead of managing JAR-file-based dependencies manually. ã   Flavors of the hadoop-client Artifact ã   Versions of the hadoop-client Artifact ã   Using hadoop-client for Maven-based Java Projects ã   Using hadoop-client for Ivy-based Java Projects ã   Using JAR Files Provided in the hadoop-client Package 1.5    Future Enhancements Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthens and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example: ã   Cloudera’s proprietary code is focused on management, set-up, etc. ã   The “Phase 1″ plans Hortonworks shared with me for Apache Hadoop are focused on industrial -strengthness, as are significant parts of “Phase 2″.*   ã   MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is  just C++ vs. Java.) ã   So does Hadapt, but mainly vs. Hive. ã   Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite. (Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.) Hortonworks,   a new Hadoop company spun out of Yahoo  ,  graciously permitted me to post a slide deck    outlining an Apache Hadoop roadmap. Phase 1 refers to stuff that is underway more or less now. Phase 2 is scheduled for alpha in October, 2011, with production availability not too late in 2012.  
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x