Real-Time Kafka / MapR Streams Data Ingestion into HBase / MapR-DB via PySpark

Streaming data is becoming an essential part of every data integration project nowadays, if not a focus requirement, a second nature. Advantages gained from real-time data streaming are so many. To name a few: real-time analytics and decision making, better resource utilization, data pipelining, facilitation for micro-services and much more. Python has many modules out […]
Perfecting Lambda Architecture with Oracle Data Integrator (and Kafka / MapR Streams)

Republished by: MapR Technologies Datafloq ——- Introduction “Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch– and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online […]
Analyzing The World Factbook by CIA

“The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 267 world entities.” In this blog I’m going to work on dataset provided by the CIA, public information, and can be obtained from https://www.cia.gov/library/publications/the-world-factbook/ The data provided, which you’ll be able to download from my notebook, is […]
Hacker News Data Analysis using Python

This is going to be a short and quick one before the weekend. I’ll be working with a dataset that has submissions to Hacker News from 2006 to 2015. Hacker News is a site where “users can submit articles from across the internet (usually about technology and startups), and others can “upvote” the articles, signifying […]
US Public Schools Civil Rights Data Analysis using Python

In this post I’m going to make some analysis on the 2013-14 Civil Rights Data Collection (CRDC). The CRDC is “a survey of all public schools and school districts in the United States. It measures student access to courses, programs, instructional and other staff, and resources — as well as school climate factors, such as […]
Star Wars Analytics using Python

Yes, you have read that correctly. In his post I’m going to clean up a dataset that has been collected from 835 people; a survey that has several questions around Star Wars 1 to 6. This will enable me to answer questions like “Does the rest of America realize that “The Empire Strikes Back” is […]