Finding Great Datasets

By John Sutor
May 13, 2020 · 5 minute read

"Data is like garbage. You better know what you're going to do with it before you collect it." - Mark Twain. While it can be argued that this quote is more applicable to planning and beginning a project, this blog is concerned with finding and locating great datasets. After all, every great research project was built around quality data. Typically, a research project involves you collecting your own data through experimentation. However, this may be difficult in some contexts. Getting involved in a state-of-the-art cancer research laboratory is no walk in the park when you're only a high school student. Furthermore, collecting atmospheric data isn't as simple as walking around in your background while holding a thermometer. Given these limitations, openly available datasets are the move for exploring data that you are unable to collect yourself. Therefore, we'll help you get started with locating some phenomenal openly-available datasets online.

Finding Great Data

Okay, you're probably thinking "why can't I just Google 'great dataset' and click the first link that comes up? Bam! I'll be ready to start my own project." By all means, if you want to do this, feel free! But are you sure that the data that you're working with is "clean" (not a pain in the neck to work with)? Are you confident that you understand how the data is labeled? Just because a dataset is openly available online doesn't necessarily mean that it's easy to work with. Luckily, there are great websites and resources that provide easy-to-use open data that you can quickly get up and running with.

Our first highly recommended resource for open-source data is NEON (National Ecological Observatory Network). At https://data.neonscience.org/, you can find a myriad of data related to the atmosphere, biogeochemistry (that's a mouthful), ecohydrology, land processes, and organisms/communities. You can further refine this data to include the location where it was collected, when the data was collected, and the type of data that was collected. The majority of the data is available in a time series format, which means that you can identify trends in the dataset that you're working with over a period of time. Perhaps the best part about the NEON platform is the in-depth tutorials for how to work with their datasets. These videos cover everything from how the data is collected, to loading the data onto your computer, to examining the data yourself. While many of these tutorials and examples are currently supported for the R programming language, NEON is currently working on data tutorials in the Python language as well. You can find these tutorials and workshops at https://www.neonscience.org/resources.

Another phenomenal source of data is Kaggle. At https://www.kaggle.com/datasets, you can find thousands of datasets ranging from topics such as image recognition to medical data. You can search for datasets and filter them by the data format, the size of the dataset, and even how openly available the dataset is. One great feature about Kaggle datasets that we appreciate is how they are assigned a "usability" score ranging from 1 to 10. This will allow you to determine the usability level of data that you are comfortable working with and potentially spend less time organizing and "cleaning" your data rather than actually diving in and exploring your data. Furthermore, you can see how other Kaggle users worked with the open data in their own code by clicking the Kernels tab on each dataset. There, you can see user submissions sorted by popularity that will allow you to get quickly up and running using the data.

If you want to work with data provided by a certain country, be sure to check out Open Government Data websites at https://www.data.gov/open-gov/. For the United States Government (as in, ALL data provided by the United States Government), simply access https://www.data.gov/. With hundreds of thousands of open-source datasets from departments such as the National Science Foundation and the National Aeronautics and Space Administration, you can find virtually all of the data you would ever need for creating a great research project. You can even further narrow down the data to topics such as agriculture, energy, ecosystems, climate, and the ocean. While there are many more datasets offered through this data portal, they do not include tutorials on how to work with the data similar to those NEON and Kaggle provide. However, if you're willing to dig in and figure out how to work with the data yourself, the sheer amount of datasets available is a huge plus to anyone. Similarly, you can access all of India's open data at https://data.gov.in/, South Korea's open data at http://data.seoul.go.kr/, and Brazil's open data at http://www.dados.gov.br/ to name a few.

Finally, we recommend checking out the GitHub page https://github.com/awesomedata/awesome-public-datasets to find public datasets across virtually all subject areas. This page is open source and maintained by GitHub collaborators eager to share the amazing datasets that they have found. The datasets are constantly monitored to determine whether they are being actively maintained or need to be fixed, so you don't have to worry about working with antiquated or unreliable data. The one downside, as similar to the United States Government Data Portal, is that there is a lack of tutorials offered for investigating the provided data. However, you can explore this data on your own and figure out how to use it for the thrill of the adventure.

Moving Forward

Once you've found a great dataset, it's time to get to work! If you don't know where to start, feel free to check out some of our other blogs for help. From posts discussing how to create a hypothesis to formulating a method for your research to follow, we've got you covered. If you think we missed any great resources, let us know! Click the feedback link below to let us know what you found useful or how we can improve. We're always updating our website with more great resources, so stay tuned for useful posts to come. If you're currently working on a project or plan to work on a project in the near future, be sure to sign up for our website! Once signed up, you can request mentorship, ask questions, share your work, and receive feedback for your project! If you want to make sure that your science research or project doesn't go unnoticed, be sure to share it on our website for all to see.

Did you enjoy this article?

Discussion

More on this topic...

TL;DR Science: Carbon Cycle

Carbon is found everywhere; it’s the backbone of life. It’s in plants, animals, the oceans, rocks, the air, and even inside you. So how has carbon made it around the Earth to become part of the deepest rocks and highest points of the atmosphere? In this week’s article, we’ll be covering the carbon cycle as we trace this crucial element’s path around the world.

Every Drop Counts: Installing Smart Showers

As our scarce water supplies are being depleted at faster rates, a new generation of scientists are challenged with coming up with more efficient ways of conserving our remaining water resources. These days, new technologies like Smart Showers are helping people all around the world limit and reduce their water usage. Find out more about these technologies in this week’s article.

Bioethics - Unethical human experimentation 

Science is meant to improve our lives, right; or is it possible that not all scientists may not have the best intentions? Throughout scientific history, there have been an unfortunate number of cases in which the scientific method has been carried out with the best intentions or ethics. In this article, historical examples of unethical human experiments are going to be discussed, and how they are avoided in the modern day.

TL;DR Science: Classification of Animals as it Relates to Humans

Ever wonder why humans are classified the way we are? Check out this week's article for a brief overview of the classification system within the animal kingdom.

Today, 48 Years Ago

In this week’s article: Which properties of space were utilized for human needs in the vacuum? What is the purpose of the Mariner 10 project? What planet has a longer day than a year? What discoveries did the Mariner 10 program make? How does our Solar system look? and much more