My first attempt at Big Data Analysis!!! Disclaimer: it is ‘Big’ for what I have done so far, not ‘Big’ compared to the industry standards. Although the data had only 30,000 rows with 4 columns, yet it exposed me to certain caveats that one needs to watch for while doing Data Analysis, for example the importance of prepping the data, the need to understand the data before making any conclusions, the risks of applying inappropriate statistical methods like finding the average for a distribution with two peaks, etc.
The data is a WhatsApp chat archive of about 11 months from an active group. Excel was used for the analysis with the exception of the Word Cloud for which I used an online tool http://www.wordle.net/ . This analysis also showed the importance of quantitative evidence and empirical data over what you feel or ‘think’, many of my assumptions were shattered 🙂
Here is what I found, enjoy (feel free to click the images to see a bigger version):
- Total number of messages over a period of 11 months (Nov 2014 to Sept 2015): 29,752 messages
- Unique number of members: 35
- Total number of media (pics, video, audio) files: 3,079 files
- Approximately every 10th msg is a pic/video/audio file
- People have typed messages of length (no. of characters) all the way from 1 to 256. The shortest message that no one has typed is 257 characters long. After that point the distribution starts to falter.
6. Top message contributor …………. Flavin, the most talkative person in the group, followed by Mahalakshmi, Sriroopa.
7. Top media contributor ……… Mahalakshmi loves to forward pics/videos, followed by SK, Mahadevan.
8. Message distribution over the span of a week: surprizeeeeee, they are evenly distributed over all the 7 days of the week with slight reduction over the weekend 🙂 A very consistent group.
9. What is the peak traffic time and when is the lull? X-axis is US time, the labels are in India time. People love to chat around 9:30 pm India time!! Why is there a dip at 12:30pm India time?
10. All top contributors, except Flavin, don’t care if it is a weekday or the weekend, they just type away but Flavin throttles down over the weekend.
11. My favorite, the Word Cloud, a visual representation of word usage, the bigger the word, the more frequently it is used. Our group loves exchanging pleasantries ‘good morning friends’ and for some reason the word ‘Flavin’. Note: this Word Cloud excludes common words like ‘the’, ‘a’, ‘and’, ‘is’ etc.
12. All inclusive list of frequently used words without any exclusions. ‘Flavin’ is the only name that makes it into this list. Why does everyone mention him by name?
13. We also love our emoticons. Ignore the 16th bar.
14. Longest message award goes to ……………….. Mahadevan. How long you ask? 23,412 characters, one giant forwarded msg on 24th Nov, 2014. Do click the image to see it in its full glory.
<<< THE END >>>
It took me about 20 hrs to do this analysis, for any more data analysis please be ready to pay me 😎