You are actually interested in predicting the higher values rather then the lower values (which seem to have a higher density). Trick is explained here

A machine learning algorithm may benefit from data that is "standardized". If one column in a dataframe has a completely different scale than the rest, it may cause an overfit. To mitigate this you may consider scaling your dataset.
Scaling algorithms are explained here
(StandardScaler , MinMaxScaler , QuantileTransformer)


When you're doing a regression you're sometimes not so much interested in predicting the most "likely" value, sometimes you're more interested in predicting a spectrum of likely values. Put differently: you may be interested in predicting the quantiles of a distribution, instead of the median value.
Trick is explained here.

There are many ways to do feature engineering for you machine learning pipeline and there are moments when you might favour one technique over another. In particular, when dealing with a category that behaves like text ... why not just model it as text?
You can get many insides out of dirty categorical features!!!

There might be one small problem: depending on the algorithm you want to use, you have to convert the sparse array into a dense array. This is shown here. So what if we use PCA to turn those sparse arrays into dense embeddings?