Data Science with Python

An overview of Data Science with Python from scratch

Besmire Thaqi
Girl and the duck

--

What is Data Science?

Data Science explained with a Venn Diagram [1]

Data Science is a field that includes data cleansing, preparation and analysis. It is a combination between statistics, mathematics, programming and problem-solving. Machine Learning and Artificial Intelligence are often incorporated when working in this area. Please refer to the picture shown above in order to understand better about what Data Science consists of!

What does a Data Scientist do?

Did you know? Harvard calls “Data Scientist” as the sexiest job of the 21st century! (Check here.)

As a Data Scientist, you work with structured and unstructured data. You determine whether the data sets are correct and accurate to use for further processes. Working with data sets means cleaning, validating, verifying them. This way, you use different algorithms or machine learning techniques to do analysis in the best way possible you can.

Programming Languages and Libraries that Data Scientists use

Two main programming languages that are used for this field are Python and R programming language. They have a lot of libraries to use for data analysis. For instance, in Python you have libraries like NumPy, Pandas, Matplotlib, Scikit-learn, SciPy, NLTK, etc. for scientific computing.

NumPy → is used for multidimensional arrays and matrices. You can apply mathematical operations in the whole data set without using loops.

Pandas → is used to make the process of data analysis easier. You can create data frames and structure your data set.

Matplotlib → is used for data visualization. You can create histograms, pie charts, line graphs and lots of other professional grade figures.

Scikit-learn → provides a lot of machine learning algorithms to use for your data sets. It is built on SciPy.

SciPy → provides fast N-dimensional array manipulations. It has different modules like optimization, linear algebra, integration and other stuff related to Data Science.

NLTK → is used to build platforms to interact with human language data. You can use it to work in computational linguistics.

Libraries mentioned above are only some of the most used ones (developed in Python), but there are a lot more that Python has to offer in this field.

Data Science vs. Data Analytics vs. Big Data

Often, these terms are being confused with each other. You can understand the differences better by the picture shown below with great details about each field.

Differences between Data Science, Big Data and Data Analytics [2]

And if you were wondering, yes, it’s one of the most high-paid fields!

Start learning Python and become a Data Scientist

If you are new to this field and you want to start a real journey, I would strongly recommend to start learning basics with Python and then continue to more advanced programming with Data Science libraries developed in Python.

If you need more information about how to start with Python (isntallation, coding, IDEs, editors, etc) please check my story “Introduction to Python”.

Also, I found it really helpful to get to know Data Science through DataCamp courses, check them out!

References:

[1]. Python Data Science Handbook by Jake VanderPlas, 2017, pg. 11
[2]. Simple Learn Platform, Data Science vs. Big Data vs. Data Analytics article by Avantika Monnappa, 2018

--

--