The correct data transformation process

<aside> 💡 Bottom line: Never transform your data before splitting it.

</aside>

In more detail:

Split the dataset before you do anything else. Set aside the data you'll use to test the model.
To learn about your data, use the split you'll use to train the model. Leave the test data alone. Don't look at it.
When it's time to preprocess the data, do it on every split individually. Fit your transformation process on the train data. Use the same global statistics to process the rest of the dataset.
Evaluate your model on the test data as few times as possible. Ideally, use your test data only once.
If you reuse your test data, don't use the results of your model as feedback to improve it. The more you stick to this, the more you'll delay the inevitable overfitting.

Wrong and correct example in the picture below:

Untitled