Model Fitting is running an algorithm to learn the relationship between predictors and outcome so that you can predict the future values of the outcome.
It proceeds in three steps:
First, you need a function that takes in a set of parameters and returns a predicted data set.
Second you need an ‘error function’ that provides a number representing the difference between your data and the model’s prediction for any given set of model parameters.
Third, you need to find the parameters that minimize this difference. Once you set things up properly, this third step is easy.
Markov Model in probability theory is a stochastic model used to model randomly changing systems where it is assumed that future states depend only on the current state not on the events that occurred before it (defined as the Markov property). Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modeling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property. There are four most common Markov models used in different situations, depending on whether every sequential state is observable or not, and whether the system is to be adjusted on the basis of observations made.
These are Markov chain (the simplest model), Hidden Markov Model (Markov chain with only part of states observable), Markov decision process (chain with applied action vector) and Hidden Markov decision process. There is also a Markov random field, or Markov network may be considered to be a generalization of a Markov chain in multiple dimensions, and Hierarchical Markov Models which can be applied to categorize human behavior at various levels of abstraction.
Manhattan Distance is the distance between two points measured along axes at right angles. The name hints to the grid layout of the streets of Manhattan, which causes the shortest path a car could take between two points in the city. The limitation of the Manhattan Distance heuristic is that it considers each tile independently, while in fact, tiles interfere with each other.
MAE – Mean Absolute Error in statistics is a quantity used to measure how close forecasts or predictions are to the eventual outcomes.The mean absolute error is an average of the absolute error where is the prediction and the true value. Note that alternative formulations may include relative frequencies as weight factors. The mean absolute error used the same scale as the data being measured. This is known as a scale-dependent accuracy measure and therefore cannot be used to make comparisons between series using different scales. The mean absolute error is a common measure of forecast error in time series analysis, where the terms “mean absolute deviation” is sometimes used in confusion with the more standard definition of mean absolute deviation. The same confusion exists more generally.
Machine Translation (MT) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another. On a basic level, MT performs simple substitution of words in one language for words in another, but that alone usually cannot produce a good translation of a text because recognition of whole phrases and their closest counterparts in the target language is needed. Solving this problem with corpus statistical, and neural techniques is a rapidly growing field that is leading to better translations, handling differences in linguistic typology, translation of idioms, and the isolation of anomalies. Current machine translation software often allows for customization by domain or profession, improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal language is used. It follows that machine translation of government and legal documents more readily produces usable output than conversation or less standardized text. Improved output quality can also be achieved by human intervention. With the assistance of these techniques, MT has proven useful as a tool to assist human translators and, in a very limited number of cases, can even produce output that can be used as is.
Loss Function in mathematical optimization, statistics, decision theory and machine learning is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative (sometimes called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over. In optimal control, the loss is the penalty for failing to achieve the desired value. In financial risk management, the function is precisely mapped to a monetary loss.
LOOCV or Leave-One-Out Cross Validation. LOOCV uses one observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. This is the same as a K-fold cross-validation with K being equal to the number of observations in the original sample. Leave-one-out cross-validation is usually very expensive from a computational point of view because of a large number of times the training process is repeated.
Long-Tailed Distribution in statistics and business is the portion of the distribution having a large number of occurrences far from the “head” or central part of the distribution. The term is often used loosely, with no definition or arbitrary definition, but precise definitions are possible. Broadly speaking, for such population distributions, the majority of occurrences (more than half, and where the Pareto principle applies, 80%) are accounted for by the first 20% of items in the distribution. What is unusual about a long-tailed distribution is that the most frequently occurring 20% of items represent less than 50% of occurrences; or in other words, the least frequently occurring 80% of items are more important as a proportion of the total population.The long tail concept has found some ground for application, research, and experimentation. It is a term used in online business, mass media, micro-finance, user-driven innovation, and social network mechanisms (e.g. crowdsourcing, crowd casting, peer-to-peer), economic models, and marketing.
Long Short-Term Memory usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is their default behavior. All recurrent neural networks have the form of a chain of repeating modules of a neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. LSTMs also have this chain like structure, but the repeating module has a different structure. The key to LSTMs is the cell state which is acting like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to flow along it unchanged. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. The first step in LSTM is to decide what information it is going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” The next step is to decide what new information we’re going to store in the cell state. The last step is a decision on output.
Log-Normal Distribution in probability theory is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a normal distribution. Likewise, if Y has a normal distribution, then X=exp(y) has a log-normal distribution. A random variable which is log-normally distributed takes only positive real values. The distribution is occasionally referred to as the Galton distribution or Galton’s distribution, after Francis Galton.A log-normal process is the statistical realization of the multiplicative product of many independent random variables, each of which is positive. This is justified by considering the central limit theorem in the log domain. The log-normal distribution is the maximum entropy probability distribution for a random variate for which the mean and variance of are specified.