-
Notifications
You must be signed in to change notification settings - Fork 0
Proposal
What decision-making context will you support? What are some decisions in that context you might support?
Our decision-making context is getting a dog. Some decisions in this context are what to name your dog, what to feed your dog, how often to walk your dog, where to walk your dog, whether to hire a dog-sitter, whether to hire a dog-walker, what dog parks to go to based on popularity or crowdedness, as well many other decision regarding the dog’s life and how it will be treated.
What should you name your dog if you want your dog’s name to be timeless?
Dog owners
Many dog owners treat their pets like their own children. Therefore, the name chosen for the dog holds a lot of value. Some dog names go into style, and others get old and outdated. Usually dogs are not with you for your whole life but are still very much an important and integral part of one’s life. It can be very important to some that the name they choose for their dog will always be a cool/memorable name. This choosen name is also what the dog will learn to respond to, and it’s something that the dog owner and other people will have to call it for the rest of its life. So a lot of though may go into this decision.
What data will you work with? Please include background on who collected the data, where you accessed it, and any additional information we should know about how this data came to be.
Our two main datasets will be in regards to Seattle and New York City dogs.
-
https://data.seattle.gov/Community/Seattle-Pet-Licenses/jguv-t9rb This data is provided by the City of Seattle Department of Finance and Administrative Services. This dataset contains active/current Seattle pet licenses, including animal type (species), pet's name, breed and the owner's ZIP code. It is a public government dataset that can be exported from their website. The list of pet licenses was created on January 24, 2017 and is current as of January 11th, 2017. The data only goes back to 2005, might need to make a public records request for previous years’ data.
-
https://github.com/Kaz-A/dog_names/ Collected by the NYC Department of Health, and is based on the 2015 results. Provided by a GitHub user because NYC does not release the raw data to the public unfortunately. This dataset lists dog’s name, breed, gender, age of dog (as of 2015), and owner’s borough. The law is that every dog 4 months or older in New York City must be licensed.
Who is affected by those decisions? Depending on the domain of your data, there may be a variety of audiences interested in using your analysis. You should hone in on one of these audiences.
Dog owners, veterinarians, friends of dog owners, dog sitters, and anyone else interacting with dogs. We are mainly focusing on new dog owners or caregivers who will be naming dogs. Other audiences would not be interested in using the tool as they do not have a dog to name. However they may decide to use it for fun, and to learn more about dogs. Animal shelters may also be able to benefit from this data, as it would help them choose a popular name so that the dog would be more adoptable.
How will your project support decisions? List out at least one decision your project will support for your audience.
Our project will support the decision of naming your dog something timeless by analyzing and elucidating what makes names timeless, and offering scientifically-grounded timeless dog names based on dog name popularity throughout time. These suggestions will give dog owners ideas for what they can name their dog, and potentially be the chosen name for the dog.
Sub-goals:
Defining “timeless” and the qualities that make a dog name timeless
- How many years of data should we collect and does the years we choose affect our results?
- Do dog name trends differ nationwide? Should our results and analysis be location based?
Identifying popular dog name trends
- What are the top dog names?
- What dog names have been popular the longest?
Understanding dog names trends across various dog categories
- What are the differences and similarities in dog names across dog breeds, genders, sizes, etc.?
- Do these dog categories influence the timelessness of a dog name?
What will be the format of your final product (Shiny app, HTML page or slideshow compiled with KnitR, etc.)?
The final product of our project will be a Shiny app, or some interactive web application that showcases our finding and allows users to view our timeless name suggestions, and potentially filter the results by certain dog qualities (such as breed, size, gender).
- For past dog name registration data, we might need to make a public records request in order to access the data.
- It may be challenging to collect consistent data when scraping various websites.
- Cleaning and integrating data from multiple datasource may also be challenging.
- If we build a Shiny Application, we will need to learn how to do that, as it will be most of ours first using it.
- We also might need to research new methods of analysis/modeling based on our specific project needs. Such as the best or most appropriate way to define and calculate popularity/timelessness.
- We will also need to learn how to use and work with Beautiful Soup for scraping websites.
How will you conduct you analysis? Please include a detailed description of your intended modeling approach.
We will be analyzing dog name data, in order to identify and suggest timeless dog names. Time and dog name are two of the main inputs for understanding timelessness. By modeling dog name frequencies across time, we can figure out the most popular dog names for certain years or time periods, and the names that have remained popular for the longest time.
Our analysis will also include other features of a dog, such as its breed, size and gender. We will model the popular name trends for these factors, as well as the commonalities and differences between these factors. For example, we may find that certain names are more popular for large dogs but not for small dogs, or that certain names are more popular for certain breeds but not for others. Overall, our analysis aims to identify timeless dog names and the qualities that affect timelessness. Our end results should output timeless dog names suggestions, both overall and for specific categories of dogs. Our model could also potentially allow us to predict what modern names may become timeless in the future.
Getting enough data to be year specific may be difficult. By this I mean it will be hard to put a time on our data. Most of the websites have the ‘current’ standing and are vague on when the data was taken or published. Also, different datasources capture time in different ways. Some indicate the year the pet license was created, others base it on a dog's age or estimated birthday. Even if we have time data, it may still be challenging to accurately study it. Furthermore, defining timeless will be difficult to determine. "Timeless" is a subjective concept, which includes more factors other than just how frequent the name appears. A challenge for our analysis will be to understand and identify which factors influence timelessness and then integrate those into our models.
- Data Collection - Identify and gather data from data sources (completed by Nov 12)
- Export government/public datasets (submit access request if needed)
- Scrape data from websites
- Data Cleaning – Join and consolidate data so that it is ready for analysis (completed by Nov 22)
- Make sure data sources have all the necessary information we will need to study
- Decide on standard format/structure for the dataset (schema and instances)
- Clean data and join data sources into the standardized format/structure
- Data Analysis – Understand and identify timeless dog names (completed by Nov 30)
- Analyze dog name popularity trends overall
- Analyze dog name popularity trends across different categories
- Define timelessness
- Build models
- Build an interface to showcase our findings (completed by Dec 11)
- Use Shiny App to build an interactive website
Akush has experience with web scraping and working with Shiny App. So we can help lead the data collection process and building the final interface. Trevor, Aridna, Kathryn and Jillian will focus on the data cleaning and data analysis using R.
- Not having access to necessary data or not having enough necessary data
We will need expansive data in order to understand popularity trends. If we are unable to access or are denied access to this data from our chosen government datasets, then we can mitigate this problem by collecting data from other datasources that don't need access approval, such as scraping online websites or using public datasets. During the datasource selection process we will need to make sure that the datasources include all of the necessary data fields that we need, and that there is enough data within these fields. If we do identify missing information within our datasets, then we can need to find and utilize additional datasets that can fill in these gaps. Ultimately, mitigating this risk may mean adjusting the scope of our project/decision context so that it feasible with the data that we have access to.
Other potential datasources with aggregated dog name information:
- http://dogtime.com/dog-names (Top dog names, based on data from VPI pet insurance, the largest pet insurance company)
- http://www.mans-best-friend.org.uk/puppy-dog-names-by-breed.htm (Dog Names and Breeds Site Index)
- https://www.rover.com/blog/dog-names/ (Good link for city/location specific analysis)
- http://www.dogbreedplus.com/dog_names/dog_names_by_breed.htm (Good for breed analysis)
- Data is unclean
To mitigate this, we will need to make sure the datasets that we use have some consistent formatting, and that unclean data (missing values, errors, etc.) are minimized. We will need to figure out a standardized schema and instances for our dataset, so that our data can be consolidated in a organized way and then accurately analyzed. However, if we take all the necessary steps in order to clean the data and there are still issues, then we just have to mindful of how that affects our analysis and models and accommodate accordingly.