November 8, 2015

Big Data Analysis

My first attempt at Big Data Analysis!!! Disclaimer: it is ‘Big’ for what I have done so far, not ‘Big’ compared to the industry standards. Although the data had only 30,000 rows with 4 columns, yet it exposed me to certain caveats that one needs to watch for while doing Data Analysis, for example the importance of prepping the data, the need to understand the data before making any conclusions, the risks of applying inappropriate statistical methods like finding the average for a distribution with two peaks, etc.

The data is a WhatsApp chat archive of about 11 months from an active group. Excel was used for the analysis with the exception of the Word Cloud for which I used an online tool . This analysis also showed the importance of quantitative evidence and empirical data over what you feel or ‘think’, many of my assumptions were shattered 🙂

Here is what I found, enjoy (feel free to click the images to see a bigger version):

  1. Total number of messages over a period of 11 months (Nov 2014 to Sept 2015): 29,752 messages
  2. Unique number of members: 35
  3. Total number of media (pics, video, audio) files: 3,079 files
  4. Approximately every 10th msg is a pic/video/audio file
  5. People have typed messages of length (no. of characters) all the way from 1 to 256. The shortest message that no one has typed is 257 characters long. After that point the distribution starts to falter.

6. Top message contributor …………. Flavin, the most talkative person in the group, followed by Mahalakshmi, Sriroopa.

Top Contributors

Top Contributors

7. Top media contributor ……… Mahalakshmi loves to forward pics/videos, followed by SK, Mahadevan.Media Contributors

8. Message distribution over the span of a week: surprizeeeeee, they are evenly distributed over all the 7 days of the week with slight reduction over the weekend 🙂 A very consistent group.

Messages per Day of the Week

9. What is the peak traffic time and when is the lull? X-axis is US time, the labels are in India time. People love to chat around 9:30 pm India time!! Why is there a dip at 12:30pm India time?

Messages per hr of the day

10. All top contributors, except Flavin, don’t care if it is a weekday or the weekend, they just type away but Flavin throttles down over the weekend. Top contributors performance over the span of a week

11. My favorite, the Word Cloud, a visual representation of word usage, the bigger the word, the more frequently it is used. Our group loves exchanging pleasantries ‘good morning friends’ and for some reason the word ‘Flavin’. Note: this Word Cloud excludes common words like ‘the’, ‘a’, ‘and’, ‘is’ etc.

Word Cloud

12. All inclusive list of frequently used words without any exclusions. ‘Flavin’ is the only name that makes it into this list. Why does everyone mention him by name?

Most commonly used words

13. We also love our emoticons. Ignore the 16th bar.

Message sizes

14. Longest message award goes to ……………….. Mahadevan. How long you ask? 23,412 characters, one giant forwarded msg on 24th Nov, 2014. Do click the image to see it in its full glory.

Message length distribution

<<< THE END >>>

It took me about 20 hrs to do this analysis, for any more data analysis please be ready to pay me 😎

