Hello! Welcome to Lingomath! My name is Syed Razvi and I am an aspiring Natural Language Processing Data Scientist. A little about me is that I have been a lifelong enthusiast of linguistics and data analysis, and in light of this year, I wanted to formally combine my passions and delve into this amazing industry by building some experience working with a myriad of tools across a variety of projects. I am chronicling the process so that I can look back and see how far I’ve come and to demonstrate my abilities! Here is my linktree which encompasses my main socials and github page. I’m excited to see where this takes me, and I’d love to hear from you if you’re on a similar path!
My first step on this journey began when I decided to join Metis. It is a data science intensive program that helped me gain laser focus towards the core areas of data science as well as the latest developments in the field. In fact, our cohort’s very first project was an analysis of the MTA Turnstile data that is publicly available. The main goal of this project was to gain familiarity with the most popular data science libraries like Pandas, Numpy, Scikit Learn, and Matplotlib in order to accomplish the first pass of any project Exploratory Data Analysis or EDA.
The first hurdle of this project was learning how to obtain data and format it into a data frame which is essentially the building blocks of statistical analysis. I have spent a lot of time working with arrays and data tables before in Excel, which was my first real foray into data analysis, but what stands out about data frames is that the flexibility with which you are able to manipulate and filter your data is incredibly intuitive.
We were given an initial prompt in that the eventual result of our analysis should be to provide recommendations towards the best locations with which to obtain signatures for and simultaneously advertise a fundraising event. Keeping that in mind, I chose to identify the stations that had the most traffic during the past 6 months. Looking through the data, I found that the manner in which each of these turnstiles was counting was a cumulative measure, so by taking the differences between each of the units of time I could find the actual count for that period.
This naturally led into another discovery of the time period interval data collection which showed that the intervals were not equivalent in length across multiple turnstiles. This snag was more difficult to work around, but it did provide me the opportunity to learn more about time formatting, and I was able to restrict the data to 4 hour intervals that were much better in representing the traffic of the stations. From there, I condensed the totals for the turnstiles into a station level measure, and I was able to obtain the counts! Looking back, something like this was very daunting to get started, but as always seems to be the case, once you take that first leap, the rest comes through grit and good old trial and error. This represents part 1 of most projects which are the data collection and cleaning components. Tomorrow, I want to delve deeper into the 2nd part which is the visualization and recommendation narrative!