Basic categorization of RL

We can gain cool insights by knowing types of RL algorithm:


Policy gradient: Learn by directly adjusting policy to maximize the reward. E.g. Shoot the basketball. If you did not make it (low reward), then you’ll adjust how you shoot the ball.

Value-based: Learn by core values. E.g. we know it is good to be optimistic. So we can still trust ourselves in difficult situations because we have this value function to help us know what to do.

Actor-Critic: Learn by having teachers to teach us how to do well. E.g. when we grow up, our parents teach us to do some things (e.g. to treat others well) and not to do some things (e.g. to bully others). We are actors. Parents are our critics. When being criticized, we might adjust our policy/value function/model.

Model-based: Learn the model of our world. E.g. We know if we release our cup in the air, it will fall to the ground due to gravity on Earth.

It is not difficult to know we actually use all these methods to learn.


The pictures above come from Deep RL course in UCB.


Dive into TensorFlow (3) – Init the variable

The tf.Variable class creates a tensor with an initial value that can be modified, much like a normal Python variable. This tensor stores its state in the session, so you must initialize the state of the tensor manually.

We use the tf.global_variables_initializer() function to initialize the state of all the Variable tensors.

init = tf.global_variables_initializer()
with tf.Session() as sess:

The tf.global_variables_initializer() call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown above.


Dive into TensorFlow (2) – Feeding dataset to session

x = tf.placeholder(tf.string)

with tf.Session() as sess:
    output =, feed_dict={x: 'Hello World'})

The feed_dict parameter in help us set the placeholder tensor. The above example shows the tensor x being set to the string "Hello, world".

It’s also possible to set more than one tensor using feed_dict as shown below.

x = tf.placeholder(tf.string)
y = tf.placeholder(tf.int32)
z = tf.placeholder(tf.float32)

with tf.Session() as sess:
    output =, feed_dict={x: 'Test String', y: 123, z: 45.67

How to install and setup MuJoCo 1.31?

OK, first you want to make sure you have this kind of setting by following older installation guide in MuJoCo repo:


Next, you can run the test by going to the bin and run./test ../model/humanoid.xml xs 10  :

ros@ros-K401UB:~/.mujoco/mjpro131/bin$ ./test ../model/humanoid.xml xs 10

Comparison of original and saved model
Max difference : 2.61e-06
Field name : body_mass

Simulation ..........
Simulation time : 0.64 s
Realtime factor : 15.59 x
Time per step : 0.128 ms
Contacts per step : 9
Constraints per step : 39
Degrees of freedom : 27

In my case, I have to copy my mjkey.txt to the ~/.mujoco/mjpro131/bin in order to run successfully:




Dive into TensorFlow (1) – Tensor and Session


In TensorFlow, data isn’t stored as integers, floats, or strings. These values are encapsulated in an object called a tensor. Tensors come in a variety of sizes as shown below:

# A is a 0-dimensional int32 tensor
A = tf.constant(1234) 
# B is a 1-dimensional int32 tensor
B = tf.constant([123,456,789]) 
# C is a 2-dimensional int32 tensor
C = tf.constant([ [123,456,789], [222,333,444] ])

tf.constant() is one of many TensorFlow operations. The tensor returned by tf.constant() is called a constant tensor, because the value of the tensor never changes.


TensorFlow’s API is built around the idea of a computational graph, a way of visualizing a mathematical process. A “TensorFlow Session” is an environment for running a graph. The session is in charge of allocating the operations to GPU(s) and/or CPU(s), including remote machines.

Hello World

import tensorflow as tf

# Create TensorFlow object called hello_constant
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output =

The code first creates the tensor, hello_constant, from the previous lines. The next step is to evaluate the tensor in a session.

The code creates a session instance, sess, using tf.Session. The function then evaluates the tensor and returns the results.