If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

The sources of big data

Where does big data come from?
Sometimes, big data is data coming from one very large source. Most of the time, big data is a collection of data from lots of little sources. With 7.5 billion people in the world and even more computing devices, there's a lot of data out there to collect.
Let's explore a variety of sources.

Scientific research

The Large Hadron Collider, the world's largest particle accelerator, is used by physicists around the world to study the nature of matter. LHC experiments produce around 50-75 petabytes each year, the equivalent of 15-20 million high definition movies.1
3-dimensional rendering of photon collision. Labeled "CMS Experiment at the LHC, CERN, Data recorded: 2012-May-13".
Collision of two photons in LHC experiment. Image source: CERN.
The earth is surrounded by thousands of satellites. NASA EOSDIS is one of the groups collecting imagery and sensor reports from those satellites, adding 23 terabytes of data to its archive every day.5
An artistic rendering of Earth with 16 satellites in orbit around it. Each satellite has a different name.
NASA’s earth science satellite fleet. Image source: NASA
Thanks to government funding of scientific research projects, a lot of the data collected by research projects is openly available in standard formats. That enables researchers and hobbyists everywhere to turn that data into valuable insights and opportunities.
You can explore a vast array of open data on Data.gov, an initiative from the U.S. government. You can analyze the data yourself or turn it into beautiful visualizations, like this animated Earth.

Digital libraries

Digital libraries archive vast numbers of historical documents, artifacts, and media.
The Internet Archive is a non-profit that attempts to archive every webpage at multiple points in its history. Our own website, Khan Academy, has been captured more than 8000 times, so we can reflect fondly on our early days in 2008. A single copy of their archive takes up more than 30 petabytes of space, and since they certainly don't want to lose that data, there are multiple copies of that 30 petabyte archive.2
Screenshot of Internet Archive for khanacademy.org, shows timeline of captures along the top and old Khan Academy homepage underneath timeline.
Khan Academy on December 30, 2008, the 5th of 8,974 captures.
Google Books is a related project that has scanned over 25 million books and hopes to eventually scan every book in the world.2 The scanning algorithms use optical character recognition (OCR) to turn the scanned book pages into text, so you may find results from books in Google search queries. The Google Ngram Viewer uses the scanned text database to visualize how often words were used by authors over the last few hundred years.
Screenshot from Google Ngram Viewer for the words "computer", "telegram", and "typewriter". Chart goes from 1840 to 2000, shows rapidly increasing line for "computer" and much smaller decreasing lines for "telegram" and "typewriter".
Google Ngram Viewer for "computer", "typewriter", "telegram".

Medical records

An increasing number of health care providers are storing patient data in an electronic health record (EHR). An electronic health record includes the patient's demographics, medical issues, medications ordered/taken, laboratory results, and imagery results.6
Medical imagery is the bulkiest of the data in an EHR, since images take up so much more space than text. Hospitals often use imagery to diagnose internal injuries and tumors, and they may use different technologies like magnetic resonance imaging (MRI), positron emission tomography (PET), and X-ray computed tomography (CT).
A CT scan creates cross-sectional images of a body part or the entire body. The animation below shows 34 slices from a CT brain scan, from the top of the skull to the base:
Animated GIF of CT scans of a brain, starting from the top of the skull and ending at the base.
A typical CT scan takes 512 x 512 images and stores each pixel using 16 bits. The brain scan above would take up 18 MB of storage space, and a more detailed scan or a scan of a longer body part would take up even more space. A single hospital can easily generate terabytes of imagery data each year.7
In the US, health care providers need to store all that patient data in a way that's compliant with the Health Information Portability and Accountability Act (HIPAA). Their data storage mechanism must have privacy safeguards, to ensure only authorized health care providers can access the data. It also needs to have a backup copy and a disaster recovery strategy, to ensure the data isn't accidentally destroyed.8

User-facing applications

Any application with millions of users is also collecting big data about their user's interactions.
Back in 2014, Facebook reportedly generated 4 new petabytes of data every single day.4 That amount of data presents huge challenges for processing, storage, and privacy.
We'll look at some of the challenges of dealing with large data sets in the next article.
🤔 What other sources of big data can you think of? Is your own data becoming part of a big data collection somewhere?

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!

Want to join the conversation?