Finding Great Datasets

By John Sutor
May 13, 2020 · 5 minute read

"Data is like garbage. You better know what you're going to do with it before you collect it." - Mark Twain. While it can be argued that this quote is more applicable to planning and beginning a project, this blog is concerned with finding and locating great datasets. After all, every great research project was built around quality data. Typically, a research project involves you collecting your own data through experimentation. However, this may be difficult in some contexts. Getting involved in a state-of-the-art cancer research laboratory is no walk in the park when you're only a high school student. Furthermore, collecting atmospheric data isn't as simple as walking around in your background while holding a thermometer. Given these limitations, openly available datasets are the move for exploring data that you are unable to collect yourself. Therefore, we'll help you get started with locating some phenomenal openly-available datasets online.

Finding Great Data

Okay, you're probably thinking "why can't I just Google 'great dataset' and click the first link that comes up? Bam! I'll be ready to start my own project." By all means, if you want to do this, feel free! But are you sure that the data that you're working with is "clean" (not a pain in the neck to work with)? Are you confident that you understand how the data is labeled? Just because a dataset is openly available online doesn't necessarily mean that it's easy to work with. Luckily, there are great websites and resources that provide easy-to-use open data that you can quickly get up and running with.

Our first highly recommended resource for open-source data is NEON (National Ecological Observatory Network). At https://data.neonscience.org/, you can find a myriad of data related to the atmosphere, biogeochemistry (that's a mouthful), ecohydrology, land processes, and organisms/communities. You can further refine this data to include the location where it was collected, when the data was collected, and the type of data that was collected. The majority of the data is available in a time series format, which means that you can identify trends in the dataset that you're working with over a period of time. Perhaps the best part about the NEON platform is the in-depth tutorials for how to work with their datasets. These videos cover everything from how the data is collected, to loading the data onto your computer, to examining the data yourself. While many of these tutorials and examples are currently supported for the R programming language, NEON is currently working on data tutorials in the Python language as well. You can find these tutorials and workshops at https://www.neonscience.org/resources.

Another phenomenal source of data is Kaggle. At https://www.kaggle.com/datasets, you can find thousands of datasets ranging from topics such as image recognition to medical data. You can search for datasets and filter them by the data format, the size of the dataset, and even how openly available the dataset is. One great feature about Kaggle datasets that we appreciate is how they are assigned a "usability" score ranging from 1 to 10. This will allow you to determine the usability level of data that you are comfortable working with and potentially spend less time organizing and "cleaning" your data rather than actually diving in and exploring your data. Furthermore, you can see how other Kaggle users worked with the open data in their own code by clicking the Kernels tab on each dataset. There, you can see user submissions sorted by popularity that will allow you to get quickly up and running using the data.

If you want to work with data provided by a certain country, be sure to check out Open Government Data websites at https://www.data.gov/open-gov/. For the United States Government (as in, ALL data provided by the United States Government), simply access https://www.data.gov/. With hundreds of thousands of open-source datasets from departments such as the National Science Foundation and the National Aeronautics and Space Administration, you can find virtually all of the data you would ever need for creating a great research project. You can even further narrow down the data to topics such as agriculture, energy, ecosystems, climate, and the ocean. While there are many more datasets offered through this data portal, they do not include tutorials on how to work with the data similar to those NEON and Kaggle provide. However, if you're willing to dig in and figure out how to work with the data yourself, the sheer amount of datasets available is a huge plus to anyone. Similarly, you can access all of India's open data at https://data.gov.in/, South Korea's open data at http://data.seoul.go.kr/, and Brazil's open data at http://www.dados.gov.br/ to name a few.

Finally, we recommend checking out the GitHub page https://github.com/awesomedata/awesome-public-datasets to find public datasets across virtually all subject areas. This page is open source and maintained by GitHub collaborators eager to share the amazing datasets that they have found. The datasets are constantly monitored to determine whether they are being actively maintained or need to be fixed, so you don't have to worry about working with antiquated or unreliable data. The one downside, as similar to the United States Government Data Portal, is that there is a lack of tutorials offered for investigating the provided data. However, you can explore this data on your own and figure out how to use it for the thrill of the adventure.

Moving Forward

Once you've found a great dataset, it's time to get to work! If you don't know where to start, feel free to check out some of our other blogs for help. From posts discussing how to create a hypothesis to formulating a method for your research to follow, we've got you covered. If you think we missed any great resources, let us know! Click the feedback link below to let us know what you found useful or how we can improve. We're always updating our website with more great resources, so stay tuned for useful posts to come. If you're currently working on a project or plan to work on a project in the near future, be sure to sign up for our website! Once signed up, you can request mentorship, ask questions, share your work, and receive feedback for your project! If you want to make sure that your science research or project doesn't go unnoticed, be sure to share it on our website for all to see.

Did you enjoy this article?

More on this topic...

TL;DR Science: Artificial Intelligence in Healthcare

Artificial Intelligence (AI) is revolutionizing the field of healthcare, bringing forth a new era of personalized medicine, improved diagnostics, and enhanced patient care. With its ability to analyze vast amounts of data, identify patterns, and make intelligent predictions, AI is transforming the way healthcare professionals diagnose diseases, develop treatment plans, and manage patient outcomes. Find out more in this week's article!

TLDR: Exploring the Frontier of Science: Bioinformatics and Genomic Data Analysis

In today's ever-evolving world of science, one field stands at the crossroads of biology and computer science, promising exciting discoveries and breakthroughs. Bioinformatics and genomic data analysis are captivating domains that offer an intriguing glimpse into the fusion of technology and life sciences. Check out the article this week to learn more!

TL;DR Science: Numerical Analysis: The Unsung Hero of Science

In science, physical processes like chemical reactions or moving bodies are modelled using mathematics. The mathematical models used are ordinarily systems of equations that relate all the quantities being dealt with symbolically. These equations are said to hold for values of the variables that lie in a particular set like the real numbers or a subset of the real numbers. FInd out more in this week's article about numerical analysis and its importance!

TL;DR Science - Vitamins and Minerals

Regardless of our age, we’ve always been told to eat balanced.  Fruits and vegetables form a large part of that ideal diet.  However, we don’t always think about why that is; sure, they’re healthy, but what makes them so important to our eating habits?  What about proteins – what makes lean meats or lentils the central focus of many of our plates?  The answer is pretty simple: all these foods are packed with nutrients. Read this week's article on vitamins and minerals!

TLDR Science: RNA-Seq Analysis: The Fascinating World Inside Our Cells

Check out this week's article on RNA sequence analysis!