Monday, September 3, 2012

Relevance of Hadoop in Enterprise Application Development


RDBMS compared to hadoop/mapreduce (taken from book 'hadoop a definitive guide')

datasize - gigabytes                     v/s   petabytes
access - interactive and batch      v/s   batch
updates - read and write many times v/s write once read many times
structure - static schema              v/s   dynamic schema
integrity - high                             v/s    low
scaling - non linear                      v/s    linear

Enterprise apps are generally characterized by data volumes in gigabytes rather than petabytes, they are required to be interactive, a user makes a request and instantly reports should show it. Apps often work on transactional data (insert/update/delete), rather than write once, read always. Also enterprise apps need high data integrity.
 In an enterprise app development scenario, most applications are in control of deciding, whether the data they store will be structured or unstructured. why would they chose to store data in unstructured form? if all applications within enterprise store data in structured form, when one enterprise app needs data from another one, it will just access the structured data in that app, thru formal interfaces like direct DB access, soap, rest or a variety of other EAI options, with support for transactions and other sharing semantics.


Enterprise apps may contain data in the form of uploaded documents, etc. this can represent unstructured data, using tools like SOLR and lucene, this unstructured data can be indexed and stored off and searched relatively easily.

Within an enterprise it would not make much sense to gather stats from logs etc, unless its for purposes like finding app features usage through click model analysis. but there too, batch processes can analyse logs periodically and  summary tables can very well store the data, instead of raw logs being retained for years and years and then running hadoop based computations on the raw data

Most hadoop use cases seem to exhibit following characteristics:

  • the data under analysis seems to be unstructured and not under control of design, so that traditional RDBMS/data warehousing/ETL alternatives to hadoop cannot be employed
  • there is a need to retain the raw data as it is, so they need a cheap, scalable and commodity hardware based distributed file system anyways
  • The data is of the order of tera bytes or more rather than giga bytes
Hadoop seems to be well suited for uses such as for a internet search engine, where-in:
  • not only is there massive data of the order of terabytes and petabytes, but more importantly, 
  • this data is produced at a very fast rate, so conventional design of a batch process, summarizing this data periodically is not feasible from economics point of view
  • the massive rate at which data is being generated is not under control or determinate
  • sources which generate the data are massive in numbers and highly distributed.
  • also data sources are unknown and cannot be intercepted. eg it would not be feasible to ask twitter to for an event every time there is a tweet from someone in the world


Most enterprise applications seem to exhibit following characteristics:

  • An enterprise application produces and consumes data in various formats
  • The data produced by the application can be controlled and forced to be in structured format to some extent, including say documents uploaded by users, can be indexed and searchable
  • The rate at which data is produced is not very high like internet scale and also is not uncontrolled or indeterminate
  • data sources are well known and often can be intercepted. eg. you can ask another enterprise app to send out an async message, or poll its database or use some other notification mechanism at the source where data is generated
  • The data consumed by enterprise application is mostly from other enterprise apps, through well established EAI and notifications mechanisms
  • If an enterprise needs huge data from external sources for analytics purpose, there can be ready-made tools, COTS applications which can mine, public domain or external data and give summaries to the enterprise application. The enterprise application should not have to get that external data onboard and write their own hadoop jobs to process that data. Hadoop based analysis tools should ideally find their way into integration with enterprise applications.
  • for example if I were an automobile / finance company with a robust IT applications requirement, to get marketing data I would not think of asking my IT to use hadoop to get that data, instead, i would buy that data, or atleast use tools for getting the data, since that work is not specific to my enterprise. contrast this to having Soap or REST or EAI expertize in my IT, so that all apps in my enterprise can integrate.
Disclaimer:
I have not yet worked on any hadoop based application nor can I claim extensive experience in big data.
Still I thought putting across my thoughts based on experience in enterprise app development, might bring up questions, related to hadoop relevance in enterprise, that are also bothering minds of other enterprise architects


Note:

structured data is data that is organized into entities that have a predefined format like XML docs or database tables. this is realm of rdbms. semi structured data like spread sheets and unstructured data that does not have any internal structure like plain text or image data. map reduce works well on data that is semistructured or unstructured.

Some hadoop use cases mentioned in the book:

last.fm

  • using user generated track listening data to produce diff charts like weekly charts for top tracks per country and per user. users listen to tracks using last.fm own client or one of hundreds of third party clients apps.


facebook.com

  • producing daily and hourly summaries over large amounts of data
    • reports based on these summaries are used to drive engg and non-engg team product decisions. these summaries include reports on growth of users, page views, average time spent on site by users
    • providing performance numbers about ad campaigns that are run on facebook
    • backend processing of site features like "likes" on people, applications etc
  • running ad-hoc jobs over historical data, for analysis for product team
  • as de-facto long term archival store for log datasets
  • to lookup log events by specific attributes, to maintain site integrity and protect users against spam bots
  • complementing existing data warehousing infrastructure, by storing unstructured data


Rack Space

  • Log processing
    • hadoop is used to process logs that are generated by user interactions and end result is in lucene indexes that custmer support can query
  • to improve scalability of rdbms we resort to sharding which causes us to lose analysis of entire data. instead of sharding decided to use hadoop. scaling is linear and processing of raw data can be done parallely, using same algos for small, large or extremely large datasets



hypothetical use cases

  • advertiser insights and performance
    • advertizers have to be provided standard aggregated stats about their ads
  • ad hoc analysis and product / features feedback
  • data analysis for 
    • websites
    • bio informatics
    • oil etc explorations companies





No comments: