Code

Using the Pandas library

Using the Pandas library

Free Course: "Quick Start in Python"

Learn More

Source Data: Excel

We use a table from one of our webinars, which contains a thousand rows of training data. This data contains information about tourists and their travels, allowing for a deeper understanding of traveler behavior and preferences. The table includes parameters such as age, gender, country of residence, and travel details, making it a valuable resource for analyzing tourism trends and optimizing marketing strategies in the tourism industry.

  • monthly income (salary);
  • hometown (city);
  • age (age);
  • preferred type of vacation (vacation preference);
  • preferred mode of transport (transport preference);
  • number of family members (family members);
  • and, finally, the city that the tourist ultimately chose for the trip (target).

We will use the free Google Colab service, which allows you to write and run code for data processing directly in the browser, eliminating the need to install additional software. Check out our article to learn how to get started with Google Colab and effectively use its functionality for your projects.

Transforming Data: Translating Words into Numbers

The salary column is an important element for data analysis in the field of machine learning, as it contains exclusively numeric values. However, our table also contains categorical features, such as cities, types of vacation, transportation preferences, and the target city. These data may be less obvious for machine learning algorithms to process, but they play a key role in creating complete models. Proper processing of categorical features, such as encoding and normalization, allows for improved prediction quality and greater model performance.

To transform text features into numeric values, it is necessary to perform the encoding process. Each unique feature value will be represented as a separate column, which will expand the original table. This method, often called "one-hot encoding", helps improve the quality of data analysis and increases the efficiency of machine learning algorithms. Each new column will contain binary values, simplifying data interpretation and allowing for better use in models.

The "city" column contains 11 cities, meaning 11 new columns with their names will be added to the original table. For example, if a tourist is from Yaroslavl, the "city_Yaroslavl" column will contain the value 1, while the remaining ten columns, corresponding to other cities, will contain zeros. This will allow for more efficient processing and analysis of tourist data, providing a clear understanding of their origin.

The process of converting words into numerical values ​​is called categorical feature encoding. One popular method for this conversion is one-hot encoding, also known as one-hot encoding. This method creates binary features for each category, which allows for the efficient use of categorical data in machine learning algorithms. There are other, more complex encoding methods that can be applied depending on the specifics of the problem and the structure of the data.

We will do the same with the columns reflecting tourists' preferences for types of vacation and transportation. If a tourist prefers train, the new column transport_preference_Train will contain one, and the other columns will contain zeros. This approach clearly structures the data and facilitates the analysis of customer preferences.

To one-hot encode categorical data, the Pandas library uses the get_dummies() function. This function automatically converts categorical variables into a set of binary (zero and one) variables, which helps improve the quality of data analysis and the machine learning model. Using get_dummies() makes the data preparation process more efficient, as it simplifies working with categorical features that cannot be used directly in most algorithms. To use the function, simply pass it a DataFrame with categorical variables, after which it will return a new DataFrame in which each category will be represented by a separate column with binary values. This significantly simplifies further analysis and working with the data.

In this operation, we used the .get_dummies() function of the Pandas library to transform the data. We created a new variable, trips_df_2, into which we placed the data from the original trips_df variable. In this case, the values ​​​​in the columns ‘city’, ‘vacation_preference’, were converted into separate binary columns. This allows for improved data analysis and preparation for further processing, ensuring the possibility of using these categorical variables in modeling and other analytical tasks.

As a result, we have generated a table containing 24 columns. To get a list of just the names of these columns, we can use the .columns attribute.

Old columns such as income, age, and number of family members remain in the data, but many new numeric variables have been added instead of categorical features. This change allows for more detailed data analysis and the identification of new patterns. Using numerical features can significantly improve forecasting models and increase the accuracy of the results.