Today’s post is somewhat of a precursor to the thought process behind the analysis. Recalling that the goal is to establish a set of recommendations for not-for-profit to help facilitate advertising their fundraising event and simultaneously recruiting potential donors. Keeping in mind the needs of the client is key for a) acquiring or utilizing the relevant information to address them, and more importantly b) making sure that whatever the results end up being are presented in a way that the client will be receptive to. In the case of this scenario, the not-for-profit has the mission statement of helping further the careers of women in STEM who come from traditionally underrepresented groups in the industry. If I was to consider a pure results-based approach of working to obtain the most donations possible, I think that would not translate well into long-term sustainability. So, the next step is to consider who the ideal person my client is looking for which from my understanding, would be someone who would be enthusiastic about supporting diverse women in STEM and helping financially.
Having the perfect donor in mind is for sure the path we want to go down, but actually locating and persuading them demonstrates that the way there is fraught with thorns and unexpected issues. There does also come into play a component of ethics in data science that was especially highlighted throughout my time at Metis. There are a multitude of tools that can and will be used to develop “customer profiles” on people that have overreaching impacts on everyday folks, so it is our duty to fight back against that in a way to keep the privacy and dignity of people intact so as not to reduce people to potential consumers of said product. Finding the right balance is crucial to do so for anyone working in spaces that collect data on its customers. I do my absolute best to live up that promise and I hope to continue that in the future! Short post today, but I wanted to highlight some of the thoughts I had going into this first project. Tomorrow, I will conclude with a brief about the additional datasets I used and the visualizations that went along with it!
Hello! Welcome to Lingomath! My name is Syed Razvi and I am an aspiring Natural Language Processing Data Scientist. A little about me is that I have been a lifelong enthusiast of linguistics and data analysis, and in light of this year, I wanted to formally combine my passions and delve into this amazing industry by building some experience working with a myriad of tools across a variety of projects. I am chronicling the process so that I can look back and see how far I’ve come and to demonstrate my abilities! Here is my linktree which encompasses my main socials and github page. I’m excited to see where this takes me, and I’d love to hear from you if you’re on a similar path!
My first step on this journey began when I decided to join Metis. It is a data science intensive program that helped me gain laser focus towards the core areas of data science as well as the latest developments in the field. In fact, our cohort’s very first project was an analysis of the MTA Turnstile data that is publicly available. The main goal of this project was to gain familiarity with the most popular data science libraries like Pandas, Numpy, Scikit Learn, and Matplotlib in order to accomplish the first pass of any project Exploratory Data Analysis or EDA.
The first hurdle of this project was learning how to obtain data and format it into a data frame which is essentially the building blocks of statistical analysis. I have spent a lot of time working with arrays and data tables before in Excel, which was my first real foray into data analysis, but what stands out about data frames is that the flexibility with which you are able to manipulate and filter your data is incredibly intuitive.
We were given an initial prompt in that the eventual result of our analysis should be to provide recommendations towards the best locations with which to obtain signatures for and simultaneously advertise a fundraising event. Keeping that in mind, I chose to identify the stations that had the most traffic during the past 6 months. Looking through the data, I found that the manner in which each of these turnstiles was counting was a cumulative measure, so by taking the differences between each of the units of time I could find the actual count for that period.
This naturally led into another discovery of the time period interval data collection which showed that the intervals were not equivalent in length across multiple turnstiles. This snag was more difficult to work around, but it did provide me the opportunity to learn more about time formatting, and I was able to restrict the data to 4 hour intervals that were much better in representing the traffic of the stations. From there, I condensed the totals for the turnstiles into a station level measure, and I was able to obtain the counts! Looking back, something like this was very daunting to get started, but as always seems to be the case, once you take that first leap, the rest comes through grit and good old trial and error. This represents part 1 of most projects which are the data collection and cleaning components. Tomorrow, I want to delve deeper into the 2nd part which is the visualization and recommendation narrative!